我有可变数量的列pl.DataFrames
,它们共享一些列(例如symbol
和date
)。每个列pl.DataFrame
都有许多附加列,这些列对于实际任务并不重要。
这些symbol
列确实具有完全相同的内容(str
每个数据框中都存在不同的值)。这些date
列略有不同,因为它们在每个数据框中都没有完全相同的日期pl.DataFrame
。
实际任务是找到每个分组的共同日期(即symbol
)并pl.DataFrame
相应地过滤每个日期。
以下是三个示例pl.DataFrame
:
import polars as pl
df1 = pl.DataFrame(
{
"symbol": ["AAPL"] * 4 + ["GOOGL"] * 3,
"date": [
"2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04",
"2023-01-02", "2023-01-03", "2023-01-04",
],
"some_other_col": range(7),
}
)
df2 = pl.DataFrame(
{
"symbol": ["AAPL"] * 3 + ["GOOGL"] * 5,
"date": [
"2023-01-02", "2023-01-03", "2023-01-04",
"2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05",
],
"another_col": range(8),
}
)
df3 = pl.DataFrame(
{
"symbol": ["AAPL"] * 4 + ["GOOGL"] * 2,
"date": [
"2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05",
"2023-01-03", "2023-01-04",
],
"some_col": range(6),
}
)
DataFrame 1:
shape: (7, 3)
┌────────┬────────────┬────────────────┐
│ symbol ┆ date ┆ some_other_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪════════════════╡
│ AAPL ┆ 2023-01-01 ┆ 0 │
│ AAPL ┆ 2023-01-02 ┆ 1 │
│ AAPL ┆ 2023-01-03 ┆ 2 │
│ AAPL ┆ 2023-01-04 ┆ 3 │
│ GOOGL ┆ 2023-01-02 ┆ 4 │
│ GOOGL ┆ 2023-01-03 ┆ 5 │
│ GOOGL ┆ 2023-01-04 ┆ 6 │
└────────┴────────────┴────────────────┘
DataFrame 2:
shape: (8, 3)
┌────────┬────────────┬─────────────┐
│ symbol ┆ date ┆ another_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪═════════════╡
│ AAPL ┆ 2023-01-02 ┆ 0 │
│ AAPL ┆ 2023-01-03 ┆ 1 │
│ AAPL ┆ 2023-01-04 ┆ 2 │
│ GOOGL ┆ 2023-01-01 ┆ 3 │
│ GOOGL ┆ 2023-01-02 ┆ 4 │
│ GOOGL ┆ 2023-01-03 ┆ 5 │
│ GOOGL ┆ 2023-01-04 ┆ 6 │
│ GOOGL ┆ 2023-01-05 ┆ 7 │
└────────┴────────────┴─────────────┘
DataFrame 3:
shape: (6, 3)
┌────────┬────────────┬──────────┐
│ symbol ┆ date ┆ some_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪══════════╡
│ AAPL ┆ 2023-01-02 ┆ 0 │
│ AAPL ┆ 2023-01-03 ┆ 1 │
│ AAPL ┆ 2023-01-04 ┆ 2 │
│ AAPL ┆ 2023-01-05 ┆ 3 │
│ GOOGL ┆ 2023-01-03 ┆ 4 │
│ GOOGL ┆ 2023-01-04 ┆ 5 │
└────────┴────────────┴──────────┘
现在,第一步是找出每个股票的共同日期symbol
。AAPL
:["2023-01-02", "2023-01-03", "2023-01-04"]
GOOGL:["2023-01-03", "2023-01-04"]
这意味着,每个都pl.DataFrame
需要进行相应的过滤。预期结果如下:
DataFrame 1 filtered:
shape: (5, 3)
┌────────┬────────────┬────────────────┐
│ symbol ┆ date ┆ some_other_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪════════════════╡
│ AAPL ┆ 2023-01-02 ┆ 1 │
│ AAPL ┆ 2023-01-03 ┆ 2 │
│ AAPL ┆ 2023-01-04 ┆ 3 │
│ GOOGL ┆ 2023-01-03 ┆ 5 │
│ GOOGL ┆ 2023-01-04 ┆ 6 │
└────────┴────────────┴────────────────┘
DataFrame 2 filtered:
shape: (5, 3)
┌────────┬────────────┬─────────────┐
│ symbol ┆ date ┆ another_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪═════════════╡
│ AAPL ┆ 2023-01-02 ┆ 0 │
│ AAPL ┆ 2023-01-03 ┆ 1 │
│ AAPL ┆ 2023-01-04 ┆ 2 │
│ GOOGL ┆ 2023-01-03 ┆ 5 │
│ GOOGL ┆ 2023-01-04 ┆ 6 │
└────────┴────────────┴─────────────┘
DataFrame 3 filtered:
shape: (5, 3)
┌────────┬────────────┬──────────┐
│ symbol ┆ date ┆ some_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════════╪══════════╡
│ AAPL ┆ 2023-01-02 ┆ 0 │
│ AAPL ┆ 2023-01-03 ┆ 1 │
│ AAPL ┆ 2023-01-04 ┆ 2 │
│ GOOGL ┆ 2023-01-03 ┆ 4 │
│ GOOGL ┆ 2023-01-04 ┆ 5 │
└────────┴────────────┴──────────┘
您可以使用连接找到交点:
并
join
再次使用“过滤”:您可以使用
pl.DataFrame.join()
“how="semi
参数:或者你也可以稍微概括一下: