Espero que alguém possa me ajudar a aprender como o SQL Server usa estatísticas para estimar o número de registros.
Scripts de teste
USE [tempdb]
GO
CREATE TABLE t1
(
a INT NOT NULL,
b INT NOT NULL,
c INT CHECK (c between 1 and 50),
CONSTRAINT pk_a primary key(a)
);
GO
INSERT INTO t1(a,b,c)
SELECT number, number%1000+1, number%50+1
FROM master..spt_values
WHERE type = 'P' AND number BETWEEN 1 and 1000;
GO
CREATE STATISTICS s_b ON t1(b);
CREATE STATISTICS s_c ON t1(c);
GO
Consulta de amostra
DECLARE @c INT=300
SELECT * FROM t1 WHERE b>@c
SELECT * FROM t1 WHERE b>300
Consulta 1
Número estimado de linhas 300
Consulta 2
Número estimado de linhas 700
Perguntas
- Por que uma diferença tão grande entre a 1ª e a 2ª estimativa de linha de consulta.
- 2ª Consulta como SQL estimando para 700 a partir das Estatísticas
- Qualquer bom artigo para aprender estes.
As consultas fornecem estimativas diferentes porque a primeira não se beneficia da Otimização de Incorporação de Parâmetros . Do ponto de vista do otimizador, a consulta também pode ser:
To get an idea of how cardinality estimates might work for simple queries, sometimes I think about what I would do if I had to program the query optimizer to produce an estimate. Often those thoughts don't exactly match what SQL Server does but I find it helpful to keep in mind that the optimizer is almost always working with imperfect or limited information. Let's go back to your first query:
Without a
RECOMPILE
hint the optimizer doesn't know the value of the@c
local variable. There isn't enough information here to do a proper estimate, but the optimizer needs to create an estimate for all queries. So what can a programmer of the optimizer do? I suppose that a sophisticated algorithm that makes a guess based on the distribution of values in the column would be possible to implement, but Microsoft uses a hardcoded guess of 30% for an unknown inequality. You can observe this by changing the value of the local variable:The estimate is still 300 rows. Note that the estimates do not need to be consistent between queries. The following query also has an estimate of 300 rows as opposed to 1000 - 300 = 700:
Conceptually, it can be rewritten as
Which will have the same hardcoded 30% estimate.
Your second query does not use a local variable, so the optimizer can directly use the statistics object that you created on the column to aid with estimation. Histograms can only have up to 201 steps and your table has 1000 distinct values so there are some filter values in which the optimizer does not have complete information. In those cases it needs to make more of a guess. Here's how you can view the histogram for the
b
column:On my machine I got lucky in that 300 is the high value for one of the steps of the histogram:
You can find an explanation of the different columns in the documentation. Going forward I'm going to assume that you know what they all mean. The query requests rows with
b > 300
, so the histogram with aRANGE_HI_KEY
of 300 is irrelevant, as are all of the steps with a lowerRANGE_HI_KEY
. For this query, I would program the optimizer to simply sum all of theRANGE_ROWS
andEQ_ROWS
values for the remaining 133 steps of the histogram. Those columns add up to 700 and that is SQL Server's estimate.Other filter values may not give exact results. For example, the following two queries both have cardinality estimates of 704 rows:
Both estimates are very close but not exactly right. The histogram does not contain enough information for those values to provide an exact estimate.
Community wiki answer:
For the first case, see the Q & A: How SQL server generates a Query plan with Auto Create Statistics set to OFF. You use a variable so it's considered unknown input and is estimated as 30% of table cardinality (30% of 1000 rows).
In the second case it uses statistics histograms. You can learn about them in SQL Server: Part 2 : All About SQL Server Statistics :Histogram by Nelson John.
You'll also get the 700 row estimate for your parameterized query if you force a new plan to be compiled based on the current value of
@c1
. For example: