我有以下数据框
dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]),
('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}])]
df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}] |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
+------+------------------------------------------------------------------------------------------------+
我只想提取key -> Colour的值。结果应该是,
White
Black
我尝试了使用regexp_extract_all以及使用instr的子字符串的多个选项,结果始终为空值。任何建议都将不胜感激。
result = spark.sql("""select
regexp_extract('_2', '''key': 'Colour' + '(\\w+)') as value
from table
""")