假设我有一个 gcs 存储桶,其中包含具有以下结构的 json 文件:
[
{
"Id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Name": "alibaba",
"storeid": "Y1",
"storeName": "alibaba1",
"a": "1/2/3",
"b": "1.0/1.0/3",
"c": "0/0/0",
"d": "0/0/0",
"e": "1.8/3.4",
"f": "1/2/3",
"g": "1/2/3",
},
{
"Id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Name": "alibaba",
"storeUuid": "Y2",
"storeName": "alibaba2",
"a": "1/2/3",
"b": "1.0/1.0/3",
"c": "0/0/0",
"d": "0/0/0",
"e": "1.7/2.4",
"f": "1/2/3",
"g": "1/2/3",
},
{
"Id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Name": "alibaba",
"storeUuid": "Y3",
"storeName": "alibaba3",
"a": "1/2/3",
"b": "1.0/1.0/3",
"c": "0/0/0",
"d": "0/0/0",
"e": "2.7/4.4",
"f": "1/2/3",
"g": "1/2/3",
},
{
"Id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Name": "alibaba",
"storeUuid": "Y4",
"storeName": "alibaba4",
"a": "1/2/3",
"b": "1.0/1.0/3",
"c": "0/0/0",
"d": "0/0/0",
"e": "3.7/5.4",
"f": "1/2/3",
"g": "1/2/3",
}
]
我想要做的是通过求和a, b,c, d, f,g
并取平均值来聚合不同的值e
以返回一个json
像
[
{
"Id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Name": "alibaba",
"a": "sum over all first instance/sum over all second instances/sum aover all third instance",
"b": "sum over all first instance/sum over all second instances/sum aover all third instance",
"c": "sum over all first instance/sum over all second instances/sum aover all third instance",
"d": "sum over all first instance/sum over all second instances/sum aover all third instance",
"e": "average over all first instance/average over all second instance",
"f": "sum over all first instance/sum over all second instances/sum aover all third instance",
"g": "sum over all first instance/sum over all second instances/sum aover all third instance",
}
]
并不是说 中 的任何值都*/*/*
可以是 NaN ,并且 中 的数据都e
可以是 string data unvavailable
。
在已经创建了这个函数
def format_large_numbers_optimized(value):
abs_values = np.abs(value)
mask = abs_values >= 1e6
formatted_values = np.where(mask,
np.char.add(np.round(value / 1e6, 2).astype(str), "M"),
np.round(value, 2).astype(str))
return formatted_values
def process_json_data_optimized(json_list):
result = {}
keys = set(json_list[0].keys()) - {'Id', 'Name', 'storeid', 'storeName'}
for key in keys:
result[key] = {'values': []}
for json_data in json_list:
for key in keys:
value = json_data.get(key, '0')
result[key]['values'].append(value)
for key in keys:
all_values_processed = []
for value in result[key]['values']:
if isinstance(value, str) and '/' in value:
processed_values = [float(v) if v != 'data unavailable' else 0 for v in value.split('/')]
elif isinstance(value, float) or isinstance(value, int):
processed_values = [value]
else:
processed_values = [0.0]
all_values_processed.append(processed_values)
numeric_values = np.array(all_values_processed)
if numeric_values.ndim == 1:
numeric_values = numeric_values[:, np.newaxis]
summed_values = np.sum(numeric_values, axis=0)
formatted_summed_values = '/'.join(format_large_numbers_optimized(summed_values))
result[key]['summed'] = formatted_summed_values
processed_result = {key: data['summed'] for key, data in result.items()}
processed_result['Id'] = json_list[0]['Id']
processed_result['Name'] = json_list[0]['Name']
return processed_result
但它并没有创造出我所期望的。我完全不知所措。非常感谢任何帮助。
请注意,您将值放置为列表
all_values_processed
。假设该/
字符只是一个分隔符,并且通过替换为您想要的all_values_processed.append(processed_values)
内容all_values_processed += processed_values
。或者更好的是,您可以汇总这些值。例如,您可以有一个像这样聚合的函数
聚合 json 中给定键的函数
现在
将为您提供一本包含元组的字典,其中包含每个字段的计数和总和。为了得到你的最终答案你可以这样做
既然你标记了pandas并且如果我的操作正确,那么这是一个潜在的通用解决方案:
输出 :
使用的输入: