我正在使用 Microsoft SQL Server 2017,最近我们遇到了一种情况,我们看到等待类型 IO_COMPLETION 贡献了 60%。执行更新统计后,这个问题就消失了。有一个使用表变量的过程,我们正在观察单个查询级别的 IO_COMPLETION 等待类型。对数据库执行更新统计是否会解决该特定过程中发生的 IO_COMPLETION 问题?
我正在使用 Postgres 12.10 AWS RDS。我使用 group by 查找最小日期的查询比日期上的常规最小值要快。我希望常规 min 也一样快,但不确定我是否输入了错误的索引或者我需要调整另一个参数。
我有一张桌子
CREATE TABLE IF NOT EXISTS public.ed
(
isd character varying(90) COLLATE pg_catalog."default" NOT NULL,
e_id character varying(32) COLLATE pg_catalog."default" NOT NULL,
d_date timestamp with time zone NOT NULL,
CONSTRAINT ed_pkey PRIMARY KEY (isd, e_id)
)
指数:
CREATE INDEX IF NOT EXISTS ix_ed_d_date
ON public.ed USING btree
(d_date ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS ix_ed_e_id
ON public.ed USING btree
(e_id COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
查询只需min
3 分钟:
select min(d_date)
from ed
where e_id = '62e2032b029b036ba25c73cf';
解释分析查询:
Result (cost=171.70..171.71 rows=1 width=8) (actual time=186940.968..186941.463 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.56..171.70 rows=1 width=8) (actual time=186940.963..186940.964 rows=1 loops=1)
-> Index Scan using ix_ed_d_date on ed (cost=0.56..2214942.25 rows=12943 width=8) (actual time=186940.961..186940.962 rows=1 loops=1)
Index Cond: (d_date IS NOT NULL)
Filter: ((e_id)::text = '62e2032b029b036ba25c73cf'::text)
Rows Removed by Filter: 30539883
Planning Time: 0.195 ms
Execution Time: 186941.491 ms
虽然使用 group by 的查询不到一秒钟:
select min(d_date)
from ed
where e_id in ('62e2032b029b036ba25c73cf')
group by e_id;
解释分析:
GroupAggregate (cost=0.56..5365.73 rows=2319 width=33) (actual time=92.093..92.095 rows=1 loops=1)
Group Key: e_id
-> Index Scan using ix_ed_e_id on ed (cost=0.56..5277.83 rows=12943 width=33) (actual time=6.753..90.622 rows=6698 loops=1)
Index Cond: ((e_id)::text = '62e2032b029b036ba25c73cf'::text)
Planning Time: 0.098 ms
Execution Time: 92.127 ms
我得到相同的结果,但为什么使用d_date
索引更简单的查询?我怎样才能使简单min(d_date)
而又不group by
具有使用的性能group by
?
我使用的应用程序使用 SQL Server 数据库,其中包括许多保存单行配置数据的表,在针对更传统的多行表的查询中有时需要这些表。我见过的大多数代码在处理单个查询时通过连接访问这些表,但在最近的一次代码审查中,我看到了一种使用标量子查询的方法,大致如下:
Select T.Id
From dbo.SomeTable T
Where T.SomeValue > (Select Tolerance From dbo.Settings)
虽然它显然有效,但我最初的反应是假设它违反了我们的标准做法,但我对表单进行了一些试验,发现“子查询返回超过 1 个值。当子查询如下时,这是不允许的=、!=、<、<=、>、>= 或当子查询用作表达式时”错误。这使得这似乎避免了意外 1:n 连接导致不良行为的风险。(在实践中,这些单行表不应该担心,它们相当健壮,但我已经看到它出现在系统的其他地方。)
除了(可能非常便宜)Stream Aggregate 和 Assert 之外,我的简单测试用例的执行计划看起来非常相似,我认为它们负责查询引擎在多行案例中识别和抛出错误的能力。
使用这种表格是否有普遍接受的最佳实践?在选择方法时,我应该注意哪些主要优点和缺点?
(我知道使用变量来保存数据也是一种选择,但在我们的某些代码中这样做并不总是可行的,所以我想专注于比较这两种方法和/或任何其他方式将其折叠成一个查询。)
Azure SQL 数据库。
我有一个表,我需要从中获取第一行和最近的行,Col1
并Col2
基于CreateDate
.
CREATE TABLE dbo.table1 (
Id INT IDENTITY(1,1) PRIMARY KEY ,
Col1 VARCHAR(255) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL ,
Col2 VARCHAR(255) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL ,
CreateDate DATETIME NOT NULL
) ;
我有一个像这样的索引:
CREATE INDEX IX__table1_ASC
ON dbo.table1 (Col1, Col2, CreateDate );
我获取第一行的查询是(在此处计划):
--Get the first row
SELECT TOP (1) WITH TIES
*
FROM table1
ORDER BY ROW_NUMBER()
OVER (PARTITION BY Col1, Col2
ORDER BY CreateDate );
索引扫描使用的是IX__table1_ASC
我创建的索引 (),但为什么我得到一个排序?
我获取最新行的查询(在此处计划):
--get latest row
SELECT TOP (1) WITH TIES
*
FROM table1
ORDER BY ROW_NUMBER()
OVER (PARTITION BY Col1, Col2
ORDER BY CreateDate DESC); --desc here
同样,索引扫描使用的是索引 ( IX__table1_ASC
),但这次我得到了两种。索引扫描后的第一个。优化器还不够聪明,无法以相反的顺序读取索引吗?再说一次,第二类是干什么用的?
实际的表非常大,因此您可以想象排序的成本很高。我怎样才能在这里最好地优化?
在关于“索引对性能的影响”的课程中,讲师使用此示例向我们展示了准备索引如何提高首次查询的性能:
SELECT
SOH.CustomerID,
SOH.SalesOrderID,
SOH.OrderDate,
C.TerritoryID,
ROW_NUMBER() OVER ( PARTITION BY SOH.CustomerID
ORDER BY SOH.OrderDate ) AS Row_Num
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.Customer AS C
ON SOH.CustomerID = C.CustomerID;
GO
第二个查询:
WITH Sales
AS
(
SELECT
CustomerID,
OrderDate,
SalesOrderID,
ROW_NUMBER() OVER ( PARTITION BY CustomerID
ORDER BY OrderDate ) AS Row_Num
FROM Sales.SalesOrderHeader
)
SELECT
Sales.CustomerID,
Sales.SalesOrderID,
Sales.OrderDate,
C.TerritoryID,
Sales.Row_Num
FROM Sales
JOIN Sales.Customer AS C
ON C.CustomerID = Sales.CustomerID;
GO
我有一个数据库,由以下列组成
id
, 一个看起来像这样的字符串8b28347448d3fff
(15 长)x
, 小数 (8,6)y
, 小数 (9,6)
所有列上都有索引。现在,我想找到匹配的对。在表侧foo
,最多可以有 300k 行。我能想到的查询表的方法有两种。首先,这个:
使用WHERE ... IN
. 在查询方面,可能有多达 11k 个元素possible_matching_indexes
。
SELECT id FROM foo WHERE id IN (possible_matching_indexes);
另一个是这个,它只会在查询端产生四个值( x1
, x2
, y1
, )y2
SELECT id FROM foo WHERE (x BETWEEN x1 and x2) AND (x BETWEEN y1 and y2);
哪一个更有可能表现更好?我正在使用 SQLite 数据库。但我想这可以从任何基于 SQL 的数据库中估计出来?
SQL 2017 上的这个简化查询需要 40 多秒才能完成,我怀疑它的参数嗅探问题,但不是 100% 确定。
exec sp_executesql N'
SELECT
T.[TicketRecId]
, T.[Title]
FROM dbo.Ticket T
INNER JOIN [dbo].[State] S
ON S.[StateRecId] = T.[StateRecId]
INNER JOIN [dbo].[StateType] ST
ON S.[StateTypeRecId] = ST.StateTypeRecId
INNER JOIN [dbo].[Board] B
ON B.[BoardRecId] = T.[BoardRecId]
WHERE 1=1
AND ST.[Name] NOT IN (''Closed'',''Canceled'')
AND T.BoardRecId IN (SELECT items FROM dbo.Split(@BoardRecId, '',''))
--AND T.BoardRecId = @BoardRecId2
--AND T.BoardRecId = 17
ORDER BY T.[TicketRecId] ASC
OFFSET (@PageNo - 1) * @PageSize ROWS FETCH NEXT @PageSize ROWS ONLY
',N'@PageNo int,@PageSize int,@TeamRecId int,@BoardRecId VARCHAR(2),@BoardRecId2 INT'
,@PageNo=1,@PageSize=50,@TeamRecId=4,@BoardRecId ='17',@BoardRecId2=17
dbo.Ticket
此查询返回表中 170 万行中的 15 行- 表
State
,StateType
,Board
真的很小,分别有 400、20、35 行 - 在
WHERE
子句中,如果我将IN
过滤器换成=
forT.BoardRecId
,它会在 7 秒内完成 - 删除
OFFSET FETCH
,原始查询在 13 秒内完成,前一个查询在 1 秒内完成 - 如果我设置参数值
@BoardRecId='14'
,持续时间会提高(大部分表包含具有此值的行) - 尝试使用表的全扫描更新统计信息,性能没有变化
- 尝试创建不同的索引,性能没有提高
- 试过
OPTION (RECOMPILE)
里面sp_executesql
没有帮助 - 尚未尝试重建索引,因为这必须在维护时间内完成
- 尝试用
dbo.Split
表变量和/或临时表替换,没有改进
我需要支持 multiple 的能力BoardRecIds
,这就是函数背后的原因dbo.Split
,它所做的只是打破一个逗号分隔的字符串,以便在IN
子句中使用。
该模式在列方面要大得多,因此尝试简化它,请注意连接的列都有索引。
CREATE TABLE [Support].[Ticket] (
[TicketRecId] BIGINT IDENTITY (1, 1) NOT NULL,
[BoardRecId] INT NOT NULL,
[StateRecId] INT NOT NULL,
[Title] VARCHAR (250) NOT NULL,
[IsDeleted] BIT NOT NULL,
[IsTemplate] BIT NOT NULL,
CONSTRAINT [PK_Ticket_TicketRecId] PRIMARY KEY CLUSTERED ([TicketRecId] ASC),
CONSTRAINT [FK_Ticket_BoardRecId] FOREIGN KEY ([BoardRecId]) REFERENCES [Support].[Board] ([BoardRecId]),
);
GO
CREATE NONCLUSTERED INDEX [nc_Ticket_BoardRecIdStateRecIdIsDeletedIsTemplate_Include]
ON [Support].[Ticket] ([BoardRecId],[StateRecId],[IsDeleted],[IsTemplate],)
GO
CREATE NONCLUSTERED INDEX [nc_Ticket_BoardRecId_IsDeleted_IsTemplate_Includes]
ON [Support].[Ticket] ([BoardRecId], [IsDeleted], [IsTemplate], [ContactRecId], [ContactSourceRecId])
INCLUDE ([StateRecId]);
GO
CREATE TABLE [Support].[Ticket] (
[TicketRecId] BIGINT IDENTITY (1, 1) NOT NULL,
[BoardRecId] INT NOT NULL,
[StateRecId] INT NOT NULL,
[Title] VARCHAR (250) NOT NULL,
[IsDeleted] BIT NOT NULL,
[IsTemplate] BIT NOT NULL,
CONSTRAINT [PK_Ticket_TicketRecId] PRIMARY KEY CLUSTERED ([TicketRecId] ASC),
CONSTRAINT [FK_Ticket_BoardRecId] FOREIGN KEY ([BoardRecId]) REFERENCES [Support].[Board] ([BoardRecId]),
);
GO
CREATE NONCLUSTERED INDEX [nc_Ticket_BoardRecIdStateRecIdIsDeletedIsTemplate_Include]
ON [Support].[Ticket] ([BoardRecId],[StateRecId],[IsDeleted],[IsTemplate],)
GO
CREATE NONCLUSTERED INDEX [nc_Ticket_BoardRecId_IsDeleted_IsTemplate_Includes]
ON [Support].[Ticket] ([BoardRecId], [IsDeleted], [IsTemplate], [ContactRecId], [ContactSourceRecId])
INCLUDE ([StateRecId]);
GO
CREATE TABLE [dbo].[State] (
[StateRecId] INT IDENTITY (1, 1) NOT NULL,
[BoardRecId] INT NOT NULL,
[StateTypeRecId] INT NOT NULL,
[Name] VARCHAR (50) NOT NULL,
[SortOrder] SMALLINT NOT NULL,
[IsDefault] BIT NOT NULL,
[IsDeleted] BIT DEFAULT ((0)) NOT NULL,
CONSTRAINT [PK_State_StateRecId] PRIMARY KEY CLUSTERED ([StateRecId] ASC),
CONSTRAINT [FK_State_BoardRecId] FOREIGN KEY ([BoardRecId]) REFERENCES [dbo].[Board] ([BoardRecId]),
CONSTRAINT [FK_State_StateTypeRecId] FOREIGN KEY ([StateTypeRecId]) REFERENCES [dbo].[StateType] ([StateTypeRecId]),
);
GO
CREATE TABLE [dbo].[StateType] (
[StateTypeRecId] INT IDENTITY (1, 1) NOT NULL,
[Name] VARCHAR (50) NOT NULL,
[IsDeleted] BIT DEFAULT ((0)) NOT NULL,
CONSTRAINT [PK_StateType_StateTypeRecId] PRIMARY KEY CLUSTERED ([StateTypeRecId] ASC),
);
GO
CREATE TABLE [dbo].[Board] (
[BoardRecId] INT IDENTITY (1, 1) NOT NULL,
[TeamRecId] INT NOT NULL,
[Name] VARCHAR (50) NOT NULL,
[IsExternal] BIT DEFAULT ((0)) NOT NULL,
[IsDefault] BIT DEFAULT ((0)) NOT NULL,
[IsDeleted] BIT DEFAULT ((0)) NOT NULL,
[UpdatedDateUTC] DATETIMEOFFSET (0) DEFAULT (SYSUTCDATETIME()) NOT NULL,
CONSTRAINT [PK_Board_BoardRecId] PRIMARY KEY CLUSTERED ([BoardRecId] ASC),
CONSTRAINT [FK_Board_TeamRecId] FOREIGN KEY ([TeamRecId]) REFERENCES [dbo].[Team] ([TeamRecId]),
);
GO
CREATE NONCLUSTERED INDEX [nc_TicketType_BoardRecId]
ON [dbo].[Board]([BoardRecId] ASC);
GO
使用@BoardRecId IN 的简化计划(SELECT items FROM dbo.Split(...) - 40+ 秒
使用@BoardRecId = @BoardRecId 的简化计划- 7 秒
使用@BoardRecId = 17 和 dbo.Split(...) 的简化计划- 12 秒
这些持续时间都不是最佳的,但第 2 和第 3 比 40+ 秒要好得多,所以只是想弄清楚如何充分利用糟糕的情况,并希望有人能在这里提供灵丹妙药。
我正在尝试优化包含 8000 万行以上的表。获得行数结果需要 20 多分钟。我尝试过集群、vacuum full 和 reindex,但性能没有提高。为了改进数据查询和检索,我需要配置或调整什么?我在 Windows 2019 下使用 Postgresql 12。
更新信息:
- 现在总行数约为 9200 万+
- 表列数 = 44
-
Explain query result using 'select count(*) from doc_details': Finalize Aggregate (cost=5554120.84..5554120.85 rows=1 width=8) (actual time=1249204.001..1249210.027 rows=1 loops=1) -> Gather (cost=5554120.63..5554120.83 rows=2 width=8) (actual time=1249203.642..1249210.020 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=5553120.63..5553120.63 rows=1 width=8) (actual time=1249153.615..1249153.616 rows=1 loops=3) -> Parallel Seq Scan on doc_details (cost=0.00..5456055.30 rows=38826130 width=0) (actual time=3.793..1245165.604 rows=31018949 loops=3) Planning Time: 1.290 ms Execution Time: 1249210.115 ms
(我不知道如何以 kb/mb 为单位获取行大小)
机器信息:
- Windows 2019 数据中心
- 32GB内存
- PostgreSQL 12
表信息:
Table "public.doc_details"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------------+-----------+----------+----------------------------------------------
id | integer | | not null | nextval('doc_details_id_seq'::regclass)
trans_ref_number | character varying(30) | | not null |
outbound_time | timestamp(0) without time zone | | |
lm_tracking | character varying(30) | | not null |
cargo_dealer_tracking | character varying(30) | | not null |
order_sn | character varying(30) | | |
operations_no | character varying(30) | | |
box_no | character varying(30) | | |
box_size | character varying(30) | | |
parcel_weight_kg | numeric(8,3) | | |
parcel_size | character varying(30) | | |
box_weight_kg | numeric(8,3) | | |
box_volume | integer | | |
parcel_volume | integer | | |
transportation | character varying(100) | | |
channel | character varying(30) | | |
service_code | character varying(20) | | |
country | character varying(60) | | |
destination_code | character varying(20) | | |
assignee_name | character varying(100) | | |
assignee_province_state | character varying(30) | | |
assignee_city | character varying(30) | | |
postal_code | character varying(20) | | |
assignee_telephone | character varying(30) | | |
assignee_address | text | | |
shipper_name | character varying(100) | | |
shipper_country | character varying(60) | | |
shipper_province | character varying(30) | | |
shipper_city | character varying(30) | | |
shipper_address | text | | |
shipper_telephone | character varying(30) | | |
package_qty | integer | | |
hs_code | integer | | |
hs_code_manual | integer | | |
reviewed | boolean | | |
created_at | timestamp(0) without time zone | | |
updated_at | timestamp(0) without time zone | | |
invalid | boolean | | |
arrival_id | integer | | |
excel_row_number | integer | | |
is_additional | boolean | | |
arrival_datetime | timestamp(6) without time zone | | |
invoice_date | timestamp without time zone | | |
unit_code | character varying(100) | | |
Indexes:
"doc_details_pkey" PRIMARY KEY, btree (id) CLUSTER
"doc_details_box_no_idx" btree (box_no)
"doc_details_trans_ref_number_idx" btree (trans_ref_number)
Triggers:
trigger_log_awb_box AFTER INSERT ON doc_details FOR EACH ROW EXECUTE FUNCTION log_awb_box()
我正在使用dbcc show_statistics在我的直方图中寻找倾斜数据并提高我的统计质量。
OPTIMIZE FOR UNKNOWN不使用值 - 相反,它使用密度向量。
如果您运行DBCC SHOWSTATISTICS ,则它是第二个结果集的“所有 密度”列中列出的值。
在下图中,由于数据倾斜,估计的行数与实际的行数存在差异。
这谈到了 @variables 和recompile,它们可以提供帮助并且是解决方案的一部分。
问题:
如何在缓存的执行计划中找到估计行数与实际行数有差异的查询?
我有一个递归查询需要很长时间 - 30+ 毫秒,其中手动提取相同数据的单个查询需要 < 0.12 毫秒。所以我们说的时间是 250 倍。
我有以下数据库结构,允许组成员身份的 DAG(此处为 db-fiddle):
create table subjects
(
subject_id bigint not null
constraint pk_subjects
primary key
);
create table subject_group_members
(
subject_group_id bigint not null
constraint fk_subject_group_members_subject_group_id_subjects_subject_id
references subjects(subject_id)
on delete cascade,
subject_id bigint not null
constraint fk_subject_group_members_subject_id_subjects_subject_id
references subjects(subject_id)
on delete cascade,
constraint pk_subject_group_members
primary key (subject_group_id, subject_id)
);
create index idx_subject_group_members_subject_id
on subject_group_members (subject_id);
create index idx_subject_group_members_subject_group_id
on subject_group_members (subject_group_id);
数据可能如下所示:
subject_group_id | 主题ID |
---|---|
1 | 2 |
1 | 3 |
1 | 4 |
2 | 5 |
3 | 5 |
我想知道 5 所属的所有组(1 通过继承,2 和 3 直接,不是 4 或任何其他主题 ID)。
此查询按预期工作:
with recursive flat_members(subject_group_id, subject_id) as (
select subject_group_id, subject_id
from subject_group_members gm
union
select
flat_members.subject_group_id as subject_group_id,
subject_group_members.subject_id as subject_id
from subject_group_members
join flat_members on flat_members.subject_id = subject_group_members.subject_group_id
)
select * from flat_members where subject_id = 5
但是使用真实数据运行,我得到了这个查询计划:
CTE Scan on flat_members (cost=36759729.47..59962757.76 rows=5156229 width=16) (actual time=26.526..55.166 rows=3 loops=1)
Filter: (subject_id = 30459)
Rows Removed by Filter: 48984
CTE flat_members
-> Recursive Union (cost=0.00..36759729.47 rows=1031245702 width=16) (actual time=0.022..47.638 rows=48987 loops=1)
-> Seq Scan on subject_group_members gm (cost=0.00..745.82 rows=48382 width=16) (actual time=0.019..4.286 rows=48382 loops=1)
-> Merge Join (cost=63629.74..1613406.96 rows=103119732 width=16) (actual time=10.897..11.038 rows=320 loops=2)
Merge Cond: (subject_group_members.subject_group_id = flat_members_1.subject_id)
-> Index Scan using idx_subject_group_members_subject_group_id on subject_group_members (cost=0.29..1651.02 rows=48382 width=16) (actual time=0.009..1.987 rows=24192 loops=2)
-> Materialize (cost=63629.45..66048.55 rows=483820 width=16) (actual time=4.124..6.592 rows=24668 loops=2)
-> Sort (cost=63629.45..64839.00 rows=483820 width=16) (actual time=4.120..5.034 rows=24494 loops=2)
Sort Key: flat_members_1.subject_id
Sort Method: quicksort Memory: 53kB
-> WorkTable Scan on flat_members flat_members_1 (cost=0.00..9676.40 rows=483820 width=16) (actual time=0.001..0.916 rows=24494 loops=2)
Planning Time: 0.296 ms
Execution Time: 56.735 ms
现在,如果我手动执行,查询select subject_group_id from subject_group_members where subject_id = 30459
并跟踪树,则有 4 个查询,每个查询大约需要 0.02 毫秒。
有没有一种方法可以使递归查询接近手动进行递归的速度?