我有一个表,其中包含从 2 到 8my_link_t
的列t.weight
类型number
和分类speedcat
类型整数值。我想将数据从 min=0、max=0.6、step = 0.001 拆分到桶中,并构建一个 3D 图以查看每个类别的权重分布。
初始数据看起来像
weight speedcat
0.0234 2
0.8643 6 0.1854
7
(权重在 0 和 0.6 之间,speedcat 在 2 和 8 之间的还有一亿个条目)
这些查询返回正确的结果并在不到一分钟的时间内完成:
--repeat for each variable. Here we loook for speedcat =8
--It takes seconds to run this query
create table histogram_tbl_8 as (
select ttt."Start" as bucket_index, ttt.hist_row as bin8 --here
FROM ((
SELECT Bucket*1 "Start" , Bucket "End", Count(Bucket) hist_row
FROM (SELECT WIDTH_BUCKET (weight, 0, 0.6, 601) Bucket FROM my_link_t where speedcat=8)
GROUP BY Bucket ORDER BY Bucket ) ttt ) );
speedcat
对于in range 重复上述查询七次2..8
--if a bin is empty populate it with zero, don't skip it.
create table histogram_output as (
select tr.bucket_index,
CASE
WHEN 1 > (select count(*) from histogram_tbl_2 htm where htm.bucket_index = tr.bucket_index) THEN 0
ELSE (select htm.bin2 from histogram_tbl_2 htm where htm.bucket_index = tr.bucket_index and rownum = 1)
END
as b2,
--same for b3-b7
CASE
WHEN 1 > (select count(*) from histogram_tbl_8 htm where htm.bucket_index = tr.bucket_index) THEN 0
ELSE (select htm.bin8 from histogram_tbl_8 htm where htm.bucket_index = tr.bucket_index and rownum = 1)
END
as b8
FROM (SELECT LEVEL as bucket_index, 0 as b2, /* 0 as b3, 0 as b4, 0 as b5, 0 as b6, 0 as b7, */ 0 as b8 FROM DUAL CONNECT BY LEVEL < 600) tr
)
最后
select sum(b2), sum(b3),sum(b4),sum(b5),sum(b6),sum(b7),sum(b8) from histogram_output
select bucket_index,
round(b2 * 1000000 / 12921) as b2, --normalize so that total is 1000000 ppm
-- repeat for b3-b7
round(b8 * 1000000 / 6262) as b8 --normalize so that total is 1000000 ppm
from histogram_output
我得到一张桌子
bin_end speedcat_2 speedcat_3 speedcat_4 .. speedcat_8
0.001
0.002 .. 0.599 0.600
显示此类别和此 bin 中的对象的 ppm 现在,当我组合查询时
-- DONT USE THE EXAMPLE BELOW - it is ineefficient (runs 2+ hours instead of seconds for the method above)
SELECT Bucket_2*1 "Start" , Bucket_2 "End",
Count(Bucket_2) as b2,
--same for b3 .. b7
Count(Bucket_8) as b8
FROM
(
SELECT WIDTH_BUCKET (t2.weight, 0, 0.6, 601) Bucket_2,
--same for t3,.. t7
WIDTH_BUCKET (t8.weight, 0, 0.6, 601) Bucket_8
FROM (select weight from my_link_t where speedcat = 2) t2,
-- ..speedcat = 3) t3, .. speedcat = 4) t4, etc
(select weight from my_link_t where speedcat = 8 ) t8
)
GROUP BY Bucket_2 ORDER BY Bucket_2
------
查询运行几个小时(运行时间比单个查询长大约 500 倍),直到我终止它。书籍建议在 SQL 中进行所有数据切片。这个例子表明,在复杂查询的情况下,将数据加载到 Java 并将其切片可能会更好。
什么会导致差异?
简答
你的 7-way Cartesian
JOIN
将会有一些严重的性能问题。长答案
成套思考。
我假设您的数据集需要包含:speedcat、bucket_index、count(*)
该数据集的简单解决方案很简单:
这是格式 (x,y,z) 是大多数图形包的预期格式。
如果您希望结果采用“网格”格式,则
PIVOT
结果。