如何将 for 循环拆分为 3 个单独的数据框？

Question

Oyibo

Asked: 2024-07-24 05:34:43 +0800 CST2024-07-24 05:34:43 +0800 CST 2024-07-24 05:34:43 +0800 CST

使用正则表达式和极坐标高效解析公式

772

我正在尝试解析一系列数学公式，需要使用 Python 中的 Polars 高效地提取变量名称。Polars 中的正则表达式支持似乎有限，尤其是对于环视断言。有没有一种简单、有效的方法来解析公式中的符号？

这是我的代码片段：

import re
import polars as pl

# Define the regex pattern
FORMULA_DECODER = r"\b[A-Za-z][A-Za-z_0-9_]*\b(?!\()"
# \b          # Assert a word boundary to ensure matching at the beginning of a word
# [A-Za-z]    # Match an uppercase or lowercase letter at the start
# [A-Za-z0-9_]* # Match following zero or more occurrences of valid characters (letters, digits, or underscores)
# \b          # Assert a word boundary to ensure matching at the end of a word
# (?!\()      # Negative lookahead to ensure the match is not followed by an open parenthesis (indicating a function)

# Sample formulas
formulas = ["3*sin(x1+x2)+A_0",
            "ab*exp(2*x)"]

# expected result
pl.Series(formulas).map_elements(lambda formula: re.findall(FORMULA_DECODER, formula), return_dtype=pl.List(pl.String))
# Series: '' [list[str]]
# [
#   ["x1", "x2", "A_0"]
#   ["ab", "x"]
# ]

# Polars does not support this regex pattern
pl.Series(formulas).str.extract_all(FORMULA_DECODER)
# ComputeError: regex error: regex parse error:
#     \b[A-Za-z][A-Za-z_0-9_]*\b(?!\()
#                               ^^^
# error: look-around, including look-ahead and look-behind, is not supported

编辑这里有一个小的基准：

import random
import string
import re
import polars as pl

def generate_symbol():
    """Generate random symbol of length 1-3."""
    characters = string.ascii_lowercase + string.ascii_uppercase
    return ''.join(random.sample(characters, random.randint(1, 3)))

def generate_formula():
    """Generate random formula with 2-5 unique symbols."""
    op = ['+', '-', '*', '/']
    return ''.join([generate_symbol()+random.choice(op) for _ in range(random.randint(2, 6))])[:-1]


def generate_formulas(num_formulas):
    """Generate random formulas."""
    return [generate_formula() for _ in range(num_formulas)]

# Sample formulas
# formulas = ["3*sin(x1+x2)+(A_0+B)",
#             "ab*exp(2*x)"]

def parse_baseline(formulas):
    """Baseline serves as performance reference. It will not detect function names."""
    FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"
    return pl.Series(formulas).str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)

def parse_lookahead(formulas):
    FORMULA_DECODER = r"\b[A-Za-z][A-Za-z_0-9_]*\b(?!\()"
    return pl.Series(formulas).map_elements(lambda formula: re.findall(FORMULA_DECODER, formula), return_dtype=pl.List(pl.String))

def parse_no_lookahead_and_filter(formulas):
    FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"
    return (
        pl.Series(formulas)
        .str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)
        # filter for matches not containing an open parenthesis
        .list.eval(pl.element().filter(~pl.element().str.contains("(", literal=True)))
    )

formulas = generate_formulas(1000)
%timeit parse_lookahead(formulas)
%timeit parse_no_lookahead_and_filter(formulas)
%timeit parse_baseline(formulas)
# 10.7 ms ± 387 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.31 ms ± 76.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 708 μs ± 6.43 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

1 个回答

Voted

Hericks · Answer 1 · 2024-07-24T05:55:52+08:00

Best Answer

Hericks

2024-07-24T05:55:52+08:002024-07-24T05:55:52+08:00

正如评论中提到的，您可以删除负面前瞻，并可选择在匹配中包含左括号。在后处理步骤中，您可以过滤掉任何包含左括号的匹配项（使用pl.Series.list.eval）。

看起来可能如下。

# avoid negative lookahead and optionally match open parenthesis
FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"

(
    pl.Series(formulas)
    .str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)
    # filter for matches not containing an open parenthesis
    .list.eval(pl.element().filter(~pl.element().str.contains("(", literal=True)))
)

shape: (2,)
Series: '' [list[str]]
[
    ["x1", "x2", "A_0"]
    ["ab", "x"]
]

2

使用正则表达式和极坐标高效解析公式

为什么要通过 where 子句中绑定的通用特征来约束单位类型（如 `where () : Trait<…>`）？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

如果 T 既不可构造、不可复制、也不可移动，那么我可以拥有 std::optional<T> 吗？

为什么我可以定义一个 constinit 的 std::string 实例？如果对象需要动态初始化，constinit 不是被禁止的吗？

如何分配以后放置的新“如同新”

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

使用正则表达式和极坐标高效解析公式

1 个回答

相关问题