如何将 for 循环拆分为 3 个单独的数据框？

Question

mabanalyst

Asked: 2024-08-23 03:04:44 +0800 CST2024-08-23 03:04:44 +0800 CST 2024-08-23 03:04:44 +0800 CST

在 Pyspark 中计算重复次数

772

目前，我正在处理一个大型数据框并面临一个问题。

我想返回表中每个值重复的次数（计数）。

例如：数字 10 重复了两次，所以我想得到数字 2，依此类推...

我的代码是：

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

right_table_23 = [
    ("ID1", 2),
    ("ID2", 3),
    ("ID3", 5),
    ("ID4", 6),
    ("ID6", 10),
    ("ID8", 15),
    ("ID9", 10),
    ("ID10", 5),
    ("ID2", 5),
    ("ID3", 8),
    ("ID4", 3),
    ("ID2", 2),
    ("ID3", 4),
    ("ID4", 3)
]

上表的架构如下：

schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Count", IntegerType(), True)
    ])

接下来我使用以下代码创建表格：

df_right_table_23 = spark.createDataFrame(right_table_23, schema)

为了计算重复次数，我使用以下代码：

#It can be implemented in order to find repetitions for a number 2
df_right_table_23.select().where(df_right_table_23.count == 2).count()

但是如果数字范围包括从 2 到 100 的数字，则重写上述代码会很困难且耗时。

是否有可能以某种方式使重复计数的过程自动化？

1 个回答

Voted

Derek Roberts · Answer 1 · 2024-08-23T03:16:47+08:00

您无需担心，只需使用PySpark 中DataFrame经典的groupBy和函数，您就可以自动计算每个值的重复次数。count

我必须说你已经在那里了，这里有一个代码片段可以帮助你

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CountRepetitions").getOrCreate()

# your schema
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Value", IntegerType(), True)  # rename 'Count' to 'Value' makes sense that way
])

# df 
df_right_table_23 = spark.createDataFrame(df_right_table_23, schema)

# shows the both 'ID' and 'Value' columns and count the number of occurrences for each
result = df_right_table_23.groupBy("ID", "Value").count()

# change 'count' column to 'occurrences' for sake of simplicity
result = result.withColumnRenamed("count", "Occurrences")

# display 
result.show()

在此处显示结果

+----+-----+-----------+
|  ID|Value|Occurrences|
+----+-----+-----------+
| ID1|    2|          1|
| ID2|    3|          1|
| ID3|    5|          1|
| ID6|   10|          1|
| ID4|    6|          1|
| ID9|   10|          1|
| ID8|   15|          1|
|ID10|    5|          1|
| ID3|    8|          1|
| ID2|    5|          1|
| ID4|    3|          2|
| ID2|    2|          1|
| ID3|    4|          1|
+----+-----+-----------+

在 Pyspark 中计算重复次数

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

在 Pyspark 中计算重复次数

1 个回答

相关问题