美好的一天...我收到了一个非常混乱的数据集需要清理...我的第一个想法是 PQ。下面是实际文件中 10k 多行的片段.txt
,它只是对每个用户(“C:”)重复,以----------
. 首先,该列表是我目前需要提取的信息。第一次导入PQ时是一列;选项卡分开。
- C:
- 订阅者姓名
- 当前费用(所有字段)
- 其他费用和信用
- 其他费用
- 税收
"Company-Name." Client No: "5780859"
" " Purchase Order No:
Invoice Date: 30-Sep-23 Unique Invoice No: "123456789"
"12345 Main Street"
"City AB"
"T4A 1B7"
"Account Number 1234567"
"-------------------------------------------------------------------------------"
"REPORT - INDIVIDUAL DETAILS"
--------------
"C:" "1234567890"
"Subscriber Name:" "NAME.NAME SPARE"
"Additional line user name:" ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:" ""
"Reference 2:" ""
"Handset Transparency"
"Number/Device Information" ""
"Starting Balance" $0.00
"Last Month's Balance" $0.00
"Current Balance" $0.00
"Monthly Credit" $0.00
"Monthly Balance Adjust" $0.00
"CURRENT CHARGES"
"Monthly Service Plan" $40.00
"Additional Local Airtime" $0.00
"Long Distance Charges" $22.40
"Roaming Charges" $0.00
"Total Taxes:" $7.49
"Total Current Charges:" $69.89
"MONTHLY SERVICE PLAN" 01-Oct-23 to 31-Oct-23
"Service Plan Name" "Total"
"Business SharePro 5GB Q1 offer (01-Oct-23 to 31-Oct-23)" $40.00
"Total Monthly Service Plan Charges" $40.00
"ADDITIONAL LOCAL AIRTIME"
"Service" "Total Airtime" "Free Airtime" "Included Airtime" "Chargeable Airtime" "Total"
"Phone (minutes)" 28:00 0:00 28:00 0:00 $0.00
"Total Additional Local Airtime Charges" $0.00
"LONG DISTANCE CHARGES"
"Service" "Total LD Minutes" "Free LD Minutes" "Included LD Minutes" "Chargeable LD Minutes" "Total"
"Domestic Phone" 28:00 0:00 0:00 28:00 $22.40
"Total Long Distance Charges" $22.40
"ROAMING"
"Service" "Roaming Minutes" "Roaming Charges" "Roaming LD Minutes" "Roaming LD Charges" "Roaming Surcharge" "Total"
"Total Roaming Charges" $0.00
"DO MORE DATA SERVICES"
"Service" "Total Events" "Event Type" "Total"
"Total Do More Data Services Charges" $0.00
"DO MORE VOICE SERVICES"
"Service" "Total Events" "Event Type" "Total"
"Total Do More Voice Services Charges" $0.00
"PAGER SERVICES"
"Service" "Total Messages" "Included Messages" "Chargeable Messages" "Total"
"Total Pager Charges" $0.00
"VALUE-ADDED SERVICES" 01-Oct-23 to 31-Oct-23
"Service" "Total"
"Can - Can/US LD $0.80/min (01-Oct-23 to 31-Oct-23)" $0.00
"Easy Roam INTL - $16/day Business (01-Oct-23 to 31-Oct-23)" $0.00
"Easy Roam US - $14/day Business (01-Oct-23 to 31-Oct-23)" $0.00
"UL Can - Can LD min (01-Oct-23 to 31-Oct-23)" $0.00
"UL domestic SMS / MMS (01-Oct-23 to 31-Oct-23)" $0.00
"Visual Voicemail (01-Oct-23 to 31-Oct-23)" $0.00
"Total Value Added Service Charges" $0.00
"OTHER CHARGES AND CREDIT"
"Charge or Credit" "Total"
"Total Other Charges and Credits" $0.00
"OTHER FEES"
"Service" "Total"
"Other Fees" $0.00
"TAXES"
"" "Total"
"GST" $3.12
"PST - BC" $4.37
"Total Taxes" $7.49
经过大量的尝试后,我使用了以下步骤;
- 过滤行(删除顶行)
- 添加了一个索引列,然后添加了一个条件列以返回“-----”的值,然后我将其“填充”。
- 再次过滤以仅保留实际需要的行
- 按分隔符(制表符)拆分单列,因为这就是“.txt”文件的呈现方式
- 删除了(最终将是)标题列,因为它让我感到困惑。
- 对条件列中的行进行分组并向下钻取以获得列表
- 最后一步是使用列表中的“Table.Combine”。
所以,现在我已经按应有的方式显示了行,但这是下一个问题。并非每个用户 (C:) 都具有相同的行数(费用),因此数据会溢出到许多列,而这些列不再位于应有的位置。
有什么办法可以解决吗?这样的东西更适合Python吗?