将 CSV 读入 Spark DataFrame 的两种方法有何不同？

Question

Rasec Malkic

Asked: 2024-05-20 05:07:56 +0800 CST2024-05-20 05:07:56 +0800 CST 2024-05-20 05:07:56 +0800 CST

如何根据column1和column3中的条件删除重复项？

772

我正在尝试根据第 1 列中的 con 值从大型 csv 文件中删除重复项，但考虑到这一点：

第 3 列可以为空或有多个值，用分隔。::: 如果第 1 列中有多个重复值，则保留第 3 列中元素数量最多的记录。删除第 3-列中的数字（如果存在）。

我的输入是：

H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+15-00:::01354,1
b,4,1111:::+1500,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426,4
d,5,7761:::+993-733:::+53-426:::87425,4
d,5,7761:::53-426,4

我想要得到的输出是：

H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4

我当前的脚本仅删除重复项，而无需其他考虑，因为我不知道如何混合这两个脚本以及如何添加条件以保留第 3 列中包含更多元素的记录。

awk -F, '{ gsub(/-/,"", $3); print } ' input.csv > input_without_hyphen.csv
awk -F',' -v OFS=',' '!a[$1]++' input_without_hyphen.csv > output.csv

谢谢你的帮助。

1 个回答

Voted

markp-fuso · Answer 1 · 2024-05-20T05:52:52+08:00

假设：

逗号仅显示为分隔符（即，我们不必担心实际数据中出现逗号）
如果两行完全相同并且第三列中的元素数量相同，那么我们保留我们处理的第一个元素
输入顺序要保持

一个awk想法：

awk '
BEGIN  { FS = OFS = "," }

FNR==1 { print; next }              # print header as is

       { key = $1                   # define key for this row
         if (! (key in counts))     # if we have not seen this key before then ...
            order[++ordnum] = key   # associate the key with the next ordering number

         gsub(/-/,"",$3)            # remove all hyphens from 3rd column

         n = split($3,a,/:::/)      # split 3rd column on ":::" delimiter, store results in array a[]; "n" == number of elements in array

         if (n > counts[key]) {     # if the number of elements (in 3rd column) is more than the previous count (for a row with the same key) then ...
            counts[key] = n         # save the new count and ...
            rows[key]   = $0        # save the current row
         }
       }

END    { for (i=1; i<=ordnum; i++)  # iterate through our ordering numbers and ...
            print rows[order[i]]    # print the associated row to stdout
       }
' input.csv

这会生成：

H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4

注意：虽然数据的输出顺序与 OP 的预期输出相同，但并不总是保证顺序；如果行顺序很重要，那么我们需要添加更多代码

如何根据column1和column3中的条件删除重复项？

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

如何根据column1和column3中的条件删除重复项？

1 个回答

相关问题