第一话
如果您只想破解代码,则可以安全地忽略以下(包括)加入部分:开始。背景和结果仅作为上下文。如果您想查看代码最初的样子,请查看 2015-10-06 之前的编辑历史。
客观的
最终,我想根据表中可用 GPS 数据的日期时间戳计算发射器(X
或)的内插 GPS 坐标,这些数据直接位于表中观测值的两侧。Xmit
SecondTable
FirstTable
我实现最终目标的直接目标是弄清楚如何最好地加入FirstTable
以SecondTable
获得那些侧翼时间点。稍后我可以使用该信息计算中间 GPS 坐标,假设沿等距柱状坐标系进行线性拟合(花言巧语是说我不在乎地球是这个比例的球体)。
问题
- 有没有更有效的方法来生成最接近的前后时间戳?
- 由我自己修复,只需抓住“之后”,然后仅获取与“之后”相关的“之前”。
- 有没有不涉及
(A<>B OR A=B)
结构的更直观的方式。- Byrdzeye提供了基本的替代方案,但是我的“真实世界”体验并不符合他的所有 4 个连接策略都执行相同。但完全归功于他解决了替代连接方式。
- 您可能有的任何其他想法、技巧和建议。
- 到目前为止,byrdzeye和Phrancis在这方面都非常有帮助。我发现Phrancis 的建议非常出色,并在关键阶段提供了帮助,所以我会在这里给予他优势。
如果我能就问题 3 获得任何额外帮助,我将不胜感激。 要点反映了我认为在个别问题上对我帮助最大的人。
表定义
半视觉表示
第一桌
Fields
RecTStamp | DateTime --can contain milliseconds via VBA code (see Ref 1)
ReceivID | LONG
XmitID | TEXT(25)
Keys and Indices
PK_DT | Primary, Unique, No Null, Compound
XmitID | ASC
RecTStamp | ASC
ReceivID | ASC
UK_DRX | Unique, No Null, Compound
RecTStamp | ASC
ReceivID | ASC
XmitID | ASC
第二张桌子
Fields
X_ID | LONG AUTONUMBER -- seeded after main table has been created and already sorted on the primary key
XTStamp | DateTime --will not contain partial seconds
Latitude | Double --these are in decimal degrees, not degrees/minutes/seconds
Longitude | Double --this way straight decimal math can be performed
Keys and Indices
PK_D | Primary, Unique, No Null, Simple
XTStamp | ASC
UIDX_ID | Unique, No Null, Simple
X_ID | ASC
ReceiverDetails表
Fields
ReceivID | LONG
Receiver_Location_Description | TEXT -- NULL OK
Beginning | DateTime --no partial seconds
Ending | DateTime --no partial seconds
Lat | DOUBLE
Lon | DOUBLE
Keys and Indicies
PK_RID | Primary, Unique, No Null, Simple
ReceivID | ASC
ValidXmitters表
Field (and primary key)
XmitID | TEXT(25) -- primary, unique, no null, simple
SQL小提琴...
...以便您可以使用表定义和代码 这个问题是针对 MSAccess 的,但正如 Phancis 指出的那样,Access 没有 SQL fiddle 样式。所以,您应该可以到这里查看我的表定义和基于Phancis 回答的代码:
http://sqlfiddle.com/#!6/e9942/4(外部链接)
加入:开始
我目前的“内胆”加入策略
首先创建一个 FirstTable_rekeyed 列顺序和复合主键(RecTStamp, ReceivID, XmitID)
所有索引/排序ASC
。我还在每一列上分别创建了索引。然后像这样填充它。
INSERT INTO FirstTable_rekeyed (RecTStamp, ReceivID, XmitID)
SELECT DISTINCT ROW RecTStamp, ReceivID, XmitID
FROM FirstTable
WHERE XmitID IN (SELECT XmitID from ValidXmitters)
ORDER BY RecTStamp, ReceivID, XmitID;
上面的查询用 153006 条记录填充新表,并在 10 秒左右的时间内返回。
当使用 TOP 1 子查询方法时,当整个方法被包装在“SELECT Count(*) FROM ( ... )”中时,以下内容会在一两秒内完成
SELECT
ReceiverRecord.RecTStamp,
ReceiverRecord.ReceivID,
ReceiverRecord.XmitID,
(SELECT TOP 1 XmitGPS.X_ID FROM SecondTable as XmitGPS WHERE ReceiverRecord.RecTStamp < XmitGPS.XTStamp ORDER BY XmitGPS.X_ID) AS AfterXmit_ID
FROM FirstTable_rekeyed AS ReceiverRecord
-- INNER JOIN SecondTable AS XmitGPS ON (ReceiverRecord.RecTStamp < XmitGPS.XTStamp)
GROUP BY RecTStamp, ReceivID, XmitID;
-- No separate join needed for the Top 1 method, but it would be required for the other methods.
-- Additionally no restriction of the returned set is needed if I create the _rekeyed table.
-- May not need GROUP BY either. Could try ORDER BY.
-- The three AfterXmit_ID alternatives below take longer than 3 minutes to complete (or do not ever complete).
-- FIRST(XmitGPS.X_ID)
-- MIN(XmitGPS.X_ID)
-- MIN(SWITCH(XmitGPS.XTStamp > ReceiverRecord.RecTStamp, XmitGPS.X_ID, Null))
以前的“内胆” JOIN 查询
首先(快......但还不够好)
SELECT
A.RecTStamp,
A.ReceivID,
A.XmitID,
MAX(IIF(B.XTStamp<= A.RecTStamp,B.XTStamp,Null)) as BeforeXTStamp,
MIN(IIF(B.XTStamp > A.RecTStamp,B.XTStamp,Null)) as AfterXTStamp
FROM FirstTable as A
INNER JOIN SecondTable as B ON
(A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
GROUP BY A.RecTStamp, A.ReceivID, A.XmitID
-- alternative for BeforeXTStamp MAX(-(B.XTStamp<=A.RecTStamp)*B.XTStamp)
-- alternatives for AfterXTStamp (see "Aside" note below)
-- 1.0/(MAX(1.0/(-(B.XTStamp>A.RecTStamp)*B.XTStamp)))
-- -1.0/(MIN(1.0/((B.XTStamp>A.RecTStamp)*B.XTStamp)))
第二(较慢)
SELECT
A.RecTStamp, AbyB1.XTStamp AS BeforeXTStamp, AbyB2.XTStamp AS AfterXTStamp
FROM (FirstTable AS A INNER JOIN
(select top 1 B1.XTStamp, A1.RecTStamp
from SecondTable as B1, FirstTable as A1
where B1.XTStamp<=A1.RecTStamp
order by B1.XTStamp DESC) AS AbyB1 --MAX (time points before)
ON A.RecTStamp = AbyB1.RecTStamp) INNER JOIN
(select top 1 B2.XTStamp, A2.RecTStamp
from SecondTable as B2, FirstTable as A2
where B2.XTStamp>A2.RecTStamp
order by B2.XTStamp ASC) AS AbyB2 --MIN (time points after)
ON A.RecTStamp = AbyB2.RecTStamp;
背景
我有一个包含不到 100 万个条目的遥测表(别名为 A),其中包含一个基于DateTime
标记、发射器 ID 和记录设备 ID 的复合主键。由于无法控制的情况,我的SQL语言是Microsoft Access中的标准Jet DB(用户将使用2007及以后的版本)。由于传输器 ID,这些条目中只有大约 200,000 个与查询相关。
还有第二个遥测表(别名 B),它包含大约 50,000 个条目和一个DateTime
主键
对于第一步,我专注于从第二个表中找到最接近第一个表中的时间戳的时间戳。
加入结果
我发现的怪癖......
...在调试过程中
JOIN
编写逻辑感觉真的很奇怪FROM FirstTable as A INNER JOIN SecondTable as B ON (A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
,就像@byrdzeye在评论中指出的那样(此后消失了)是一种交叉连接形式。请注意,在上面的代码中替换LEFT OUTER JOIN
为INNER JOIN
似乎对返回的行的数量或标识没有影响。我似乎也不能放弃 ON 子句或 say ON (1=1)
。仅使用逗号连接(而不是INNER
or LEFT OUTER
JOIN
)会导致Count(select * from A) * Count(select * from B)
此查询中返回行,而不是每个表 A 仅一行,因为 (A<>B OR A=B) 显式JOIN
返回。这显然不合适。FIRST
在给定复合主键类型的情况下似乎无法使用。
第二种JOIN
风格虽然可以说更易读,但速度较慢。JOIN
这可能是因为针对较大的表以及CROSS JOIN
在两个选项中找到的两个 s需要额外的两个 inner s。
旁白:用/替换该IIF
子句似乎会返回相同数量的条目。
适用于“之前”( ) 时间戳,但不直接适用于“之后”( ),如下所示:
因为条件的最小值始终为 0 。此 0 小于任何后纪元(字段是 Access 中的子集,并且此计算将字段转换为)。和/方法 为 AfterXTStamp 值建议的替代方案之所以有效,是因为除以零 ( ) 会生成空值,聚合函数 MIN 和 MAX 会跳过这些空值。MIN
MAX
MAX(-(B.XTStamp<=A.RecTStamp)*B.XTStamp)
MAX
MIN
MIN(-(B.XTStamp>A.RecTStamp)*B.XTStamp)
FALSE
DOUBLE
DateTime
IIF
MIN
MAX
FALSE
下一步
更进一步,我希望在第二个表中找到直接位于第一个表中时间戳两侧的时间戳,并根据到这些点的时间距离对第二个表中的数据值进行线性插值(即如果时间戳来自第一个表是“之前”和“之后”之间的 25%,我希望计算值的 25% 来自与“之后”点关联的第二个表值数据,而 75% 来自“之前” ). 使用修改后的连接类型作为内部胆量的一部分,并在下面的建议答案之后产生......
SELECT
AvgGPS.XmitID,
StrDateIso8601Msec(AvgGPS.RecTStamp) AS RecTStamp_ms,
-- StrDateIso8601MSec is a VBA function returning a TEXT string in yyyy-mm-dd hh:nn:ss.lll format
AvgGPS.ReceivID,
RD.Receiver_Location_Description,
RD.Lat AS Receiver_Lat,
RD.Lon AS Receiver_Lon,
AvgGPS.Before_Lat * (1 - AvgGPS.AfterWeight) + AvgGPS.After_Lat * AvgGPS.AfterWeight AS Xmit_Lat,
AvgGPS.Before_Lon * (1 - AvgGPS.AfterWeight) + AvgGPS.After_Lon * AvgGPS.AfterWeight AS Xmit_Lon,
AvgGPS.RecTStamp AS RecTStamp_basic
FROM ( SELECT
AfterTimestampID.RecTStamp,
AfterTimestampID.XmitID,
AfterTimestampID.ReceivID,
GPSBefore.BeforeXTStamp,
GPSBefore.Latitude AS Before_Lat,
GPSBefore.Longitude AS Before_Lon,
GPSAfter.AfterXTStamp,
GPSAfter.Latitude AS After_Lat,
GPSAfter.Longitude AS After_Lon,
( (AfterTimestampID.RecTStamp - GPSBefore.XTStamp) / (GPSAfter.XTStamp - GPSBefore.XTStamp) ) AS AfterWeight
FROM (
(SELECT
ReceiverRecord.RecTStamp,
ReceiverRecord.ReceivID,
ReceiverRecord.XmitID,
(SELECT TOP 1 XmitGPS.X_ID FROM SecondTable as XmitGPS WHERE ReceiverRecord.RecTStamp < XmitGPS.XTStamp ORDER BY XmitGPS.X_ID) AS AfterXmit_ID
FROM FirstTable AS ReceiverRecord
-- WHERE ReceiverRecord.XmitID IN (select XmitID from ValidXmitters)
GROUP BY RecTStamp, ReceivID, XmitID
) AS AfterTimestampID INNER JOIN SecondTable AS GPSAfter ON AfterTimestampID.AfterXmit_ID = GPSAfter.X_ID
) INNER JOIN SecondTable AS GPSBefore ON AfterTimestampID.AfterXmit_ID = GPSBefore.X_ID + 1
) AS AvgGPS INNER JOIN ReceiverDetails AS RD ON (AvgGPS.ReceivID = RD.ReceivID) AND (AvgGPS.RecTStamp BETWEEN RD.Beginning AND RD.Ending)
ORDER BY AvgGPS.RecTStamp, AvgGPS.ReceivID;
...返回 152928 条记录,符合(至少大约)预期记录的最终数量。在我的 i7-4790、16GB RAM、无 SSD、Win 8.1 Pro 系统上运行时间可能是 5-10 分钟。
参考资料1:MS Access Can Handle Millisecond Time Values--真正和随附的源文件[08080011.txt]
I must first compliment you on your courage to do something like this with an Access DB, which from my experience is very difficult to do anything SQL-like. Anyways, on to the review.
First join
Your
IIF
field selections might benefit from using a Switch statement instead. It seems to be sometimes the case, especially with things SQL, that aSWITCH
(more commonly known asCASE
in typical SQL) is quite fast when just making simple comparisons in the body of aSELECT
. The syntax in your case would be almost identical, although a switch can be expanded to cover a large chunk of comparisons in one field. Something to consider.A switch can also help readability, in larger statements. In context:
As for the join itself, I think
(A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
is about as good as you're going to get, given what you are trying to do. It's not that fast, but I wouldn't expect it to be either.Second join
You said this is slower. It's also less readable from a code standpoint. Given equally satisfactory result sets between 1 and 2, I'd say go for 1. At least it's obvious what you are trying to do that way. Subqueries are often not very fast (though often unavoidable) especially in this case you are throwing in an extra join in each, which must certainly complicate the execution plan.
One remark, I saw that you used old ANSI-89 join syntax. It's best to avoid that, the performance will be same or better with the more modern join syntax, and they are less ambiguous or easier to read, harder to make mistakes.
Naming things
I think the way your things are named is unhelpful at best, and cryptic at worst.
A, B, A1, B1
etc. as table aliases I think could be better. Also, I think the field names are not very good, but I realize you may not have control over this. I will just quickly quote The Codeless Code on the topic of naming things, and leave it at that..."Next steps" query
I couldn't make much sense of it how it was written, I had to take it to a text editor and do some style changes to make it more readable. I know Access' SQL editor is beyond clunky, so I usually write my queries in a good editor like Notepad++ or Sublime Text. Some of the stylistic changes I applied to make it more readable:
So as it turns out, this is a very complicated query indeed. To make sense of it, I have to start from the innermost query, your
ID
data set, which I understand is the same as your First Join. It returns the IDs and timestamps of the devices where the before/after timestamps are the closest, within the subset of devices you are interested in. So instead ofID
why not call itClosestTimestampID
.Your
Det
join is used only once:The rest of the time, it only joins the values you already have from
ClosestTimestampID
. So instead we should be able to just do this:Maybe not be a huge performance gain, but anything we can do to help the poor Jet DB optimizer will help!
I can't shake the feeling that the calculations/algorithm for
BeforeWeight
andAfterWeight
which you use to interpolate could be done better, but unfortunately I'm not very good with those.One suggestion to avoid crashing (although it's not ideal depending on your application) would be to break out your nested subqueries into tables of their own and update those when needed. I'm not sure how often you need your source data to be refreshed, but if it is not too often you might think of writing some VBA code to schedule an update of the tables and derived tables, and just leave your outermost query to pull from those tables instead of the original source. Just a thought, like I said not ideal but given the tool you may not have a choice.
Everything together:
SQL Server 执行计划(因为 Access 无法显示此)
没有最终顺序,因为它很昂贵:
聚簇索引扫描 [ReceiverDetails].[PK_ReceiverDetails] 成本 16%
聚簇索引查找 [FirstTable].[PK_FirstTable] 成本 19%
聚簇索引查找 [SecondTable].[PK_SecondTable] 成本 16%
聚簇索引查找 [SecondTable].[PK_SecondTable] 成本 16%
聚簇索引查找 [SecondTable].[PK_SecondTable] [TL2] 成本 16%
聚簇索引查找 [SecondTable].[PK_SecondTable] [TL1] 成本 16%
最终排序依据:
排序成本 36%
聚簇索引扫描 [ReceiverDetails].[PK_ReceiverDetails] 成本 10%
聚簇索引查找 [FirstTable].[PK_FirstTable] 成本 12%
聚簇索引查找 [SecondTable].[PK_SecondTable] 成本 10%
聚簇索引查找 [SecondTable].[PK_SecondTable] 成本 10%
聚簇索引查找 [SecondTable].[PK_SecondTable] [TL2] 成本 10%
聚簇索引查找 [SecondTable].[ PK_SecondTable] [TL1] 成本 10%
代码:
针对包含交叉连接的查询对我的查询进行性能测试。
FirstTable 加载了 13 条记录,SecondTable 加载了 1,000,000 条记录。
我的查询的执行计划与发布的内容没有太大变化。
交叉连接的执行计划:
嵌套循环成本 81% 使用
INNER JOIN SecondTable AS B ON (A.RecTStamp <> B.XTStamp OR A.RecTStamp = B.XTStamp
嵌套循环下降到 75% 如果使用
CROSS JOIN SecondTable AS B' or ',SecondTable AS B
流聚合 8%
索引扫描 [SecondTable][UK_ID][B] 6%
表假脱机 5%
其他几个聚集索引查找和索引查找(类似于我发布的查询)成本为 0%。
我的查询和 CROSS JOIN 的执行时间为 0.007 和 8-9 秒。
成本比较 0% 和 100%。
我将包含 50,000 条记录和一条记录的 FirstTable 加载到 ReceiverDetails 以用于连接条件并运行我的查询。
50,013 在 0.9 到 1.0 秒之间返回。
I ran second query with the cross join and allowed it to run for about 20 minutes before I killed it.
If the cross join query is filtered to return only the original 13, execution time is again, 8-9 seconds.
Placement of the filter condition was at inner most select, outer most select and both. No difference.
There is a difference between these two join conditions in favor of the CROSS JOIN, the first uses a predicate, the CROSS JOIN does not:
INNER JOIN SecondTable AS B ON (A.RecTStamp <> B.XTStamp OR A.RecTStamp = B.XTStamp) CROSS JOIN SecondTable AS B
Adding a second answer, not better than the first but without changing any of the requirements presented, there are a few of ways to beat Access into submission and appear snappy. 'Materialize' the complications a bit at a time effectivity using 'triggers'. Access tables do not have triggers so intercept and inject the crud processes.