我有一个很大的 Excel 电子表格。我只对某些列感兴趣。此外,我只对特定列符合特定条件的行感兴趣。
以下作品:
import pandas as pd
import warnings
# this suppresses the openpyxl warning that we're seeing
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
# These are the columns we're interested in
COLUMNS = [
"A",
"B",
"C"
]
# the source file
XL = "source.xlsx"
# sheet name in the source file
SHEET = "Sheet1"
# the output file
OUTPUT = "target.xlsx"
# the sheet name to be used in the output file
OUTSHEET = "Sheet1"
# This loads the entire spreadsheet into a pandas dataframe
df = pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
# this replaces the original dataframe with rows where A contains "FOO"
df = df[df["A"].str.contains(r"\bFOO\b", regex=True)]
# now isolate those rows where the B contains "BAR"
df = df[df["B"].str.contains(r"\bBAR\b", regex=True)]
# output to the new spreadsheet
df.to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
这确实有效。然而,我不禁想到,也许有更好的方法来管理选择标准,尤其是在它们变得更加复杂的情况下。
还是“循序渐进”就好?