我有以下数据框
dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]),
('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}]),
('value3', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'colour', 'value': 'Blue'}])]
df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}] |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
|value3|[{value -> Apple, key -> Fruit}, {value -> Blue, key -> colour}]
+------+------------------------------------------------------------------------------------------------+
我只想提取键 -> 颜色的值,并使用下面的方法获取精确的结果
from pyspark.sql import SparkSession, functions as F
...
df = df.select('_1', F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value'])
结果,
_1 _2
value1 White
value2 Black
value3
但是对于value3,没有结果,因为key是小写的colour
;对于value1和vaue2,key是驼峰式命名Colour
,这与lambda函数兼容F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value']
。我尝试使用大写字母来处理这三种情况,但不起作用。
F.filter('_2', lambda x: x['key'].upper() == 'COLOUR')[0]['value']
任何建议都将不胜感激。