SQL Server - Como as páginas de dados são armazenadas ao usar um índice clusterizado

Question

Joe Obbish

Asked: 2017-04-22 07:33:41 +0800 CST2017-04-22 07:33:41 +0800 CST 2017-04-22 07:33:41 +0800 CST

Por que uma subconsulta reduz a estimativa de linha para 1?

772

Considere a seguinte consulta artificial, mas simples:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END AS ID2
FROM X_HEAP;

Eu esperaria que a estimativa de linha final para essa consulta fosse igual ao número de linhas na X_HEAPtabela. O que quer que eu esteja fazendo na subconsulta não deve importar para a estimativa de linha porque ela não pode filtrar nenhuma linha. No entanto, no SQL Server 2016, vejo a estimativa de linha reduzida para 1 por causa da subconsulta:

Por que isso acontece? O que posso fazer sobre isso?

É muito fácil reproduzir esse problema com a sintaxe correta. Aqui está um conjunto de definições de tabela que farão isso:

CREATE TABLE dbo.X_HEAP (ID INT NOT NULL)
CREATE TABLE dbo.X_OTHER_TABLE (ID INT NOT NULL);
CREATE TABLE dbo.X_OTHER_TABLE_2 (ID INT NOT NULL);

INSERT INTO dbo.X_HEAP WITH (TABLOCK)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values;

CREATE STATISTICS X_HEAP__ID ON X_HEAP (ID) WITH FULLSCAN;

link do violino db .

2 respostas

Voted

Joe Obbish · Answer 1 · 2017-04-22T07:33:41+08:00

Isso definitivamente parece um comportamento não intencional. É verdade que as estimativas de cardinalidade não precisam ser consistentes em cada etapa de um plano, mas esse é um plano de consulta relativamente simples e a estimativa de cardinalidade final é inconsistente com o que a consulta está fazendo. Uma estimativa de cardinalidade tão baixa pode resultar em escolhas ruins para tipos de junção e métodos de acesso para outras tabelas downstream em um plano mais complicado.

Por tentativa e erro, podemos criar algumas consultas semelhantes para as quais o problema não aparece:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;

Também podemos criar mais consultas para as quais o problema aparece:

SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP;

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP;

Parece haver um padrão: se houver uma expressão dentro do CASEque não se espera que seja executada e a expressão de resultado for uma subconsulta em uma tabela, a estimativa de linha cairá para 1 após essa expressão.

Se eu escrever a consulta em uma tabela com um índice clusterizado, as regras mudam um pouco. Podemos usar os mesmos dados:

CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))

INSERT INTO dbo.X_CI WITH (TABLOCK)
SELECT * FROM dbo.X_HEAP;

UPDATE STATISTICS X_CI WITH FULLSCAN;

Esta consulta tem uma estimativa final de 1.000 linhas:

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
  END
FROM dbo.X_CI;

Mas esta consulta tem uma estimativa final de 1 linha:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END
FROM dbo.X_CI;

Para aprofundar isso, podemos usar o sinalizador de rastreamento não documentado 2363 para obter informações sobre como o otimizador de consulta executou cálculos de seletividade. Achei útil emparelhar esse sinalizador de rastreamento com o sinalizador de rastreamento não documentado 8606 . O TF 2363 parece fornecer cálculos de seletividade tanto para a árvore simplificada quanto para a árvore após a normalização do projeto. Ter ambos os sinalizadores de rastreamento habilitados deixa claro quais cálculos se aplicam a qual árvore.

Vamos tentar para a consulta original postada na pergunta:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END AS ID2
FROM X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Aqui está parte da saída que eu acho relevante, juntamente com alguns comentários:

Plan for computation:

  CSelCalcColumnInInterval -- this is the type of calculator used

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID -- this is the column used for the calculation

Pass-through selectivity: 0 -- all rows are expected to have a true value for the case expression

Stats collection generated: 

  CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- the row estimate after the join will still be 1000

      CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)

      CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)

...

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 1 -- no rows are expected to have a true value for the case expression

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter) -- the row estimate after the join will still be 1

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- here is the row estimate after the previous join

          CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)

          CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)

      CStCollBaseTable(ID=3, CARD=1 TBL: X_OTHER_TABLE_2)

Agora vamos tentar para uma consulta semelhante que não tenha o problema. Vou usar este:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT -1) 
  END AS ID2
FROM dbo.X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Saída de depuração no final:

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 1

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

          CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)

          CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)

      CStCollConstTable(ID=4, CARD=1) -- this is different than before because we select a constant instead of from a table

Vamos tentar outra consulta para a qual a estimativa de linha incorreta está presente:

SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
  END AS ID2
FROM dbo.X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

No final, a estimativa de cardinalidade cai para 1 linha, novamente após Seletividade de passagem = 1. A estimativa de cardinalidade é preservada após uma seletividade de 0,501 e 0,499.

Plan for computation:

 CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 0.501

...

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 0.499

...

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 1

Stats collection generated: 

  CStCollOuterJoin(ID=12, CARD=1 x_jtLeftOuter) -- this is associated with the ELSE expression

      CStCollOuterJoin(ID=11, CARD=1000 x_jtLeftOuter)

          CStCollOuterJoin(ID=10, CARD=1000 x_jtLeftOuter)

              CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)

              CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)

          CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

      CStCollBaseTable(ID=4, CARD=1 TBL: X_OTHER_TABLE)

Vamos novamente mudar para outra consulta semelhante que não apresenta o problema. Vou usar este:

SELECT 
  ID
, CASE
    WHEN ID < 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    WHEN ID >= 500 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END AS ID2
FROM dbo.X_HEAP
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Na saída de depuração nunca há uma etapa que tenha uma seletividade de passagem de 1. A estimativa de cardinalidade permanece em 1.000 linhas.

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID

Pass-through selectivity: 0.499

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

          CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)

          CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)

      CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

End selectivity computation

E a consulta quando envolve uma tabela com um índice clusterizado? Considere a seguinte consulta com o problema de estimativa de linha:

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
  END
FROM dbo.X_CI
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

O final da saída de depuração é semelhante ao que já vimos:

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [SE_DB].[dbo].[X_CI].ID

Pass-through selectivity: 1

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter)

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

          CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)

          CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)

      CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

No entanto, a consulta no IC sem o problema tem uma saída diferente. Usando esta consulta:

SELECT 
  ID
, CASE
    WHEN ID = 0 
    THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
  END
FROM dbo.X_CI
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Resultados em diferentes calculadoras sendo usadas. CSelCalcColumnInIntervalnão aparece mais:

Plan for computation:

  CSelCalcFixedFilter (0.559)

Pass-through selectivity: 0.559

Stats collection generated: 

  CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

      CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)

      CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

...

Plan for computation:

  CSelCalcUniqueKeyFilter

Pass-through selectivity: 0.001

Stats collection generated: 

  CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)

      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)

          CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)

          CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

      CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE)

Em conclusão, parecemos obter uma estimativa de linha ruim após a subconsulta nas seguintes condições:

A CSelCalcColumnInIntervalcalculadora de seletividade é usada. Não sei exatamente quando isso é usado, mas parece aparecer com muito mais frequência quando a tabela base é um heap.
Seletividade de passagem = 1. Em outras palavras, CASEespera-se que uma das expressões seja avaliada como falsa para todas as linhas. Não importa se a primeira CASEexpressão for avaliada como verdadeira para todas as linhas.
Existe uma junção externa para CStCollBaseTable. Em outras palavras, a CASEexpressão de resultado é uma subconsulta em uma tabela. Um valor constante não funcionará.

Perhaps under those conditions the query optimizer is unintentionally applying the pass-through selectivity to the row estimate of the outer table instead of to the work done on the inner part of the nested loop. That would reduce the row estimate to 1.

I was able to find two workarounds. I was not able to reproduce the issue when using APPLY instead of a subquery. The output of trace flag 2363 was very different with APPLY. Here's one way to rewrite the original query in the question:

SELECT 
  h.ID
, a.ID2
FROM X_HEAP h
OUTER APPLY
(
SELECT CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END
) a(ID2);

The legacy CE appears to avoid the issue as well.

SELECT 
  ID
, CASE
    WHEN ID <> 0 
    THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
  END AS ID2
FROM X_HEAP
OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

A connect item was submitted for this issue (with some of the details that Paul White provided in his answer).

Paul White · Answer 2 · 2017-04-24T08:55:09+08:00

This cardinality estimation (CE) issue surfaces when:

The join is an outer join with a pass-through predicate
The selectivity of the pass-through predicate is estimated to be exactly 1.

Note: The particular calculator used to determine the selectivity is not important.

Details

The CE computes the selectivity of the outer join as the sum of:

The inner join selectivity with the same predicate
The anti join selectivity with the same predicate

The only difference between an outer and inner join is that an outer join also returns rows that do not match on the join predicate. The anti join provides exactly this difference. Cardinality estimation for inner and anti join is easier than for outer join directly.

The join selectivity estimation process is very straightforward:

First, the selectivity S_PT of the pass-through predicate is assessed.
- This is done using whichever calculator is appropriate to the circumstances.
- The predicate is the whole thing, including any negating IsFalseOrNull component.
Inner join selectivity := 1 - S_PT
Anti join selectivity := S_PT

The anti join represents rows that will 'pass through' the join. The inner join represents rows that will not 'pass through'. Note that 'pass through' means rows that flow through the join without running the inner side at all. To emphasise: all rows will be returned by the join, the distinction is between rows that run the inner side of the join before emerging, and those that do not.

Clearly, adding 1 - S_PT to S_PT should always give a total selectivity of 1, meaning all rows are returned by the join, as expected.

Indeed, the above calculation works exactly as described for all values of S_PT except 1.

When S_PT = 1, both inner join and anti join selectivities are estimated to be zero, resulting in a cardinality estimate (for the join as a whole) of one row. As far as I can tell, this is unintentional, and should be reported as a bug.

A related issue

This bug is more likely to manifest than one might think, due to a separate CE limitation. This arises when the CASE expression uses an EXISTS clause (as is common). For example the following modified query from the question does not encounter the unexpected cardinality estimate:

-- This is fine
SELECT 
    CASE
        WHEN XH.ID = 1
        THEN (SELECT TOP (1) XOT.ID FROM dbo.X_OTHER_TABLE AS XOT) 
    END
FROM dbo.X_HEAP AS XH;

Introducing a trivial EXISTS does cause the issue to surface:

-- This is not fine
SELECT 
    CASE
        WHEN EXISTS (SELECT 1 WHERE XH.ID = 1)
        THEN (SELECT TOP (1) XOT.ID FROM dbo.X_OTHER_TABLE AS XOT) 
    END
FROM dbo.X_HEAP AS XH;

Using EXISTS introduces a semi join (highlighted) to the execution plan:

The estimate for the semi join is fine. The problem is that the CE treats the associated probe column as a simple projection, with a fixed selectivity of 1:

Semijoin with probe column treated as a Project.

Selectivity of probe column = 1

Isso atende automaticamente a uma das condições exigidas para que esta emissão de CE se manifeste, independentemente do conteúdo da EXISTScláusula.

Para obter informações básicas importantes, consulte Subconsultas em CASEexpressões de Craig Freedman.

Por que uma subconsulta reduz a estimativa de linha para 1?

Details

A related issue

conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

Como fazer a saída do sqlplus aparecer em uma linha?

Selecione qual tem data máxima ou data mais recente

Como faço para listar todos os esquemas no PostgreSQL?

Listar todas as colunas de uma tabela especificada

Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

Como você mysqldump tabela (s) específica (s)?

Listar os privilégios do banco de dados usando o psql

Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

Como faço para listar todos os bancos de dados e tabelas usando o psql?

Por que uma subconsulta reduz a estimativa de linha para 1?

2 respostas

Details

A related issue

relate perguntas