SQL Server - Como as páginas de dados são armazenadas ao usar um índice clusterizado

Question

user1716729

Asked: 2017-09-19 01:48:40 +0800 CST2017-09-19 01:48:40 +0800 CST 2017-09-19 01:48:40 +0800 CST

Estatísticas e estimativa de linha

772

Espero que alguém possa me ajudar a aprender como o SQL Server usa estatísticas para estimar o número de registros.

Scripts de teste

USE [tempdb]

GO

CREATE TABLE t1
(
a INT NOT NULL, 
b INT NOT NULL, 
c INT CHECK (c between 1 and 50), 
CONSTRAINT pk_a primary key(a)
);

GO

INSERT INTO t1(a,b,c) 
SELECT number, number%1000+1, number%50+1 
FROM master..spt_values 
WHERE type = 'P' AND number BETWEEN 1 and 1000;

GO

CREATE STATISTICS s_b ON t1(b);
CREATE STATISTICS s_c ON t1(c);

GO

Consulta de amostra

DECLARE @c INT=300 
SELECT * FROM t1 WHERE b>@c 
SELECT * FROM t1 WHERE b>300

Consulta 1

Número estimado de linhas 300

Consulta 2

Número estimado de linhas 700

Perguntas

Por que uma diferença tão grande entre a 1ª e a 2ª estimativa de linha de consulta.
2ª Consulta como SQL estimando para 700 a partir das Estatísticas
Qualquer bom artigo para aprender estes.

2 respostas

Voted

Joe Obbish · Answer 1 · 2017-09-19T16:01:43+08:00

As consultas fornecem estimativas diferentes porque a primeira não se beneficia da Otimização de Incorporação de Parâmetros . Do ponto de vista do otimizador, a consulta também pode ser:

SELECT * FROM t1 WHERE b>???;

To get an idea of how cardinality estimates might work for simple queries, sometimes I think about what I would do if I had to program the query optimizer to produce an estimate. Often those thoughts don't exactly match what SQL Server does but I find it helpful to keep in mind that the optimizer is almost always working with imperfect or limited information. Let's go back to your first query:

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE b>@c;

Without a RECOMPILE hint the optimizer doesn't know the value of the @c local variable. There isn't enough information here to do a proper estimate, but the optimizer needs to create an estimate for all queries. So what can a programmer of the optimizer do? I suppose that a sophisticated algorithm that makes a guess based on the distribution of values in the column would be possible to implement, but Microsoft uses a hardcoded guess of 30% for an unknown inequality. You can observe this by changing the value of the local variable:

DECLARE @c INT=999999999; 
SELECT * FROM t1 WHERE b>@c;

The estimate is still 300 rows. Note that the estimates do not need to be consistent between queries. The following query also has an estimate of 300 rows as opposed to 1000 - 300 = 700:

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE NOT (b>@c);

Conceptually, it can be rewritten as

DECLARE @c INT=300; 
SELECT * FROM t1 WHERE b<=@c;

Which will have the same hardcoded 30% estimate.

Your second query does not use a local variable, so the optimizer can directly use the statistics object that you created on the column to aid with estimation. Histograms can only have up to 201 steps and your table has 1000 distinct values so there are some filter values in which the optimizer does not have complete information. In those cases it needs to make more of a guess. Here's how you can view the histogram for the b column:

DBCC SHOW_STATISTICS ('t1', 's_b') WITH HISTOGRAM;

On my machine I got lucky in that 300 is the high value for one of the steps of the histogram:

You can find an explanation of the different columns in the documentation. Going forward I'm going to assume that you know what they all mean. The query requests rows with b > 300, so the histogram with a RANGE_HI_KEY of 300 is irrelevant, as are all of the steps with a lower RANGE_HI_KEY. For this query, I would program the optimizer to simply sum all of the RANGE_ROWS and EQ_ROWS values for the remaining 133 steps of the histogram. Those columns add up to 700 and that is SQL Server's estimate.

Other filter values may not give exact results. For example, the following two queries both have cardinality estimates of 704 rows:

SELECT * FROM t1 WHERE b>293; -- returns 707 rows
SELECT * FROM t1 WHERE b>299; -- returns 701 rows

Both estimates are very close but not exactly right. The histogram does not contain enough information for those values to provide an exact estimate.

2 revsuser126897 · Answer 2 · 2017-09-19T19:31:03+08:00

2 revsuser126897

2017-09-19T19:31:03+08:002017-09-19T19:31:03+08:00

_{Community wiki answer:}

For the first case, see the Q & A: How SQL server generates a Query plan with Auto Create Statistics set to OFF. You use a variable so it's considered unknown input and is estimated as 30% of table cardinality (30% of 1000 rows).

In the second case it uses statistics histograms. You can learn about them in SQL Server: Part 2 : All About SQL Server Statistics :Histogram by Nelson John.

You'll also get the 700 row estimate for your parameterized query if you force a new plan to be compiled based on the current value of @c1. For example:

SELECT * FROM t1 WHERE b>@c OPTION (RECOMPILE);

1

Estatísticas e estimativa de linha

conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

Como fazer a saída do sqlplus aparecer em uma linha?

Selecione qual tem data máxima ou data mais recente

Como faço para listar todos os esquemas no PostgreSQL?

Listar todas as colunas de uma tabela especificada

Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

Como você mysqldump tabela (s) específica (s)?

Listar os privilégios do banco de dados usando o psql

Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

Como faço para listar todos os bancos de dados e tabelas usando o psql?

Estatísticas e estimativa de linha

2 respostas

relate perguntas