SQL Server - Como as páginas de dados são armazenadas ao usar um índice clusterizado

Question

Matthew

Asked: 2016-12-14 16:44:33 +0800 CST2016-12-14 16:44:33 +0800 CST 2016-12-14 16:44:33 +0800 CST

Um MERGE com OUTPUT é uma prática melhor do que um INSERT e SELECT condicional?

772

Frequentemente encontramos a situação "Se não existir, insira". O blog de Dan Guzman tem uma excelente investigação sobre como tornar esse processo seguro para threads.

Eu tenho uma tabela básica que simplesmente cataloga uma string para um número inteiro de um arquivo SEQUENCE. Em um procedimento armazenado, preciso obter a chave inteira para o valor, se existir, ou INSERTobter o valor resultante. Há uma restrição de exclusividade na dbo.NameLookup.ItemNamecoluna, portanto, a integridade dos dados não está em risco, mas não quero encontrar as exceções.

Não é um IDENTITYentão não consigo SCOPE_IDENTITYe o valor pode ser NULLem certos casos.

Na minha situação, só tenho que lidar com INSERTa segurança na mesa, então estou tentando decidir se é melhor praticar usar MERGEassim:

SET NOCOUNT, XACT_ABORT ON;

DECLARE @vValueId INT 
DECLARE @inserted AS TABLE (Id INT NOT NULL)

MERGE 
    dbo.NameLookup WITH (HOLDLOCK) AS f 
USING 
    (SELECT @vName AS val WHERE @vName IS NOT NULL AND LEN(@vName) > 0) AS new_item
        ON f.ItemName= new_item.val
WHEN MATCHED THEN
    UPDATE SET @vValueId = f.Id
WHEN NOT MATCHED BY TARGET THEN
    INSERT
      (ItemName)
    VALUES
      (@vName)
OUTPUT inserted.Id AS Id INTO @inserted;
SELECT @vValueId = s.Id FROM @inserted AS s

Eu poderia fazer isso sem usar MERGEapenas um condicional INSERTseguido por um SELECT . Acho que esta segunda abordagem é mais clara para o leitor, mas não estou convencido de que seja uma prática "melhor"

SET NOCOUNT, XACT_ABORT ON;

INSERT INTO 
    dbo.NameLookup (ItemName)
SELECT
    @vName
WHERE
    NOT EXISTS (SELECT * FROM dbo.NameLookup AS t WHERE @vName IS NOT NULL AND LEN(@vName) > 0 AND t.ItemName = @vName)

DECLARE @vValueId int;
SELECT @vValueId = i.Id FROM dbo.NameLookup AS i WHERE i.ItemName = @vName

Ou talvez haja outra maneira melhor que não considerei

Pesquisei e fiz referência a outras perguntas. Este: https://stackoverflow.com/questions/5288283/sql-server-insert-if-not-exists-best-practice é o mais apropriado que encontrei, mas não parece muito aplicável ao meu caso de uso. Outras questões para a IF NOT EXISTS() THENabordagem que não acho aceitável.

2 respostas

Voted

Solomon Rutzky · Answer 1 · 2016-12-16T22:32:09+08:00

Como você está usando uma Sequência, você pode usar a mesma função NEXT VALUE FOR -- que você já tem em uma Restrição Padrão no Idcampo Chave Primária -- para gerar um novo Idvalor antes do tempo. Gerar o valor primeiro significa que você não precisa se preocupar em não ter SCOPE_IDENTITY, o que significa que você não precisa da OUTPUTcláusula ou fazer um adicional SELECTpara obter o novo valor; você terá o valor antes de fazer o INSERT, e nem precisa mexer SET IDENTITY INSERT ON / OFF:-)

Então isso cuida de parte da situação geral. A outra parte é lidar com o problema de simultaneidade de dois processos, exatamente ao mesmo tempo, sem encontrar uma linha existente para a mesma string exata e prosseguir com o arquivo INSERT. A preocupação é evitar a violação de Unique Constraint que ocorreria.

Uma maneira de lidar com esses tipos de problemas de simultaneidade é forçar essa operação específica a ser de thread único. A maneira de fazer isso é usando bloqueios de aplicativos (que funcionam entre sessões). Embora eficazes, eles podem ser um pouco pesados para uma situação como esta, onde a frequência de colisões é provavelmente bastante baixa.

A outra maneira de lidar com as colisões é aceitar que às vezes elas ocorrerão e lidar com elas, em vez de tentar evitá-las. Usando a TRY...CATCHconstrução, você pode interceptar efetivamente um erro específico (neste caso: "unique constraint violação", Msg 2601) e reexecutar o SELECTpara obter o Idvalor, pois sabemos que ele agora existe devido a estar no CATCHbloco com aquele determinado erro. Outros erros podem ser tratados da maneira típica RAISERROR/ RETURNou THROW.

Configuração do teste: sequência, tabela e índice exclusivo

USE [tempdb];

CREATE SEQUENCE dbo.MagicNumber
  AS INT
  START WITH 1
  INCREMENT BY 1;

CREATE TABLE dbo.NameLookup
(
  [Id] INT NOT NULL
         CONSTRAINT [PK_NameLookup] PRIMARY KEY CLUSTERED
        CONSTRAINT [DF_NameLookup_Id] DEFAULT (NEXT VALUE FOR dbo.MagicNumber),
  [ItemName] NVARCHAR(50) NOT NULL         
);

CREATE UNIQUE NONCLUSTERED INDEX [UIX_NameLookup_ItemName]
  ON dbo.NameLookup ([ItemName]);
GO

Configuração do Teste: Procedimento Armazenado

CREATE PROCEDURE dbo.GetOrInsertName
(
  @SomeName NVARCHAR(50),
  @ID INT OUTPUT,
  @TestRaceCondition BIT = 0
)
AS
SET NOCOUNT ON;

BEGIN TRY
  SELECT @ID = nl.[Id]
  FROM   dbo.NameLookup nl
  WHERE  nl.[ItemName] = @SomeName
  AND    @TestRaceCondition = 0;

  IF (@ID IS NULL)
  BEGIN
    SET @ID = NEXT VALUE FOR dbo.MagicNumber;

    INSERT INTO dbo.NameLookup ([Id], [ItemName])
    VALUES (@ID, @SomeName);
  END;
END TRY
BEGIN CATCH
  IF (ERROR_NUMBER() = 2601) -- "Cannot insert duplicate key row in object"
  BEGIN
    SELECT @ID = nl.[Id]
    FROM   dbo.NameLookup nl
    WHERE  nl.[ItemName] = @SomeName;
  END;
  ELSE
  BEGIN
    ;THROW; -- SQL Server 2012 or newer
    /*
    DECLARE @ErrorNumber INT = ERROR_NUMBER(),
            @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();

    RAISERROR(N'Msg %d: %s', 16, 1, @ErrorNumber, @ErrorMessage);
    RETURN;
    */
  END;

END CATCH;
GO

O teste

DECLARE @ItemID INT;
EXEC dbo.GetOrInsertName
  @SomeName = N'test1',
  @ID = @ItemID OUTPUT;
SELECT @ItemID AS [ItemID];
GO

DECLARE @ItemID INT;
EXEC dbo.GetOrInsertName
  @SomeName = N'test1',
  @ID = @ItemID OUTPUT,
  @TestRaceCondition = 1;
SELECT @ItemID AS [ItemID];
GO

Pergunta do OP

Por que isso é melhor do que o MERGE? Não obterei a mesma funcionalidade sem TRYusar a WHERE NOT EXISTScláusula?

MERGEtem vários "problemas" (várias referências estão vinculadas na resposta do @SqlZim, portanto, não há necessidade de duplicar essas informações aqui). E não há bloqueio adicional nessa abordagem (menos contenção), portanto, deve ser melhor na simultaneidade. Nesta abordagem, você nunca obterá uma violação de restrição única, tudo sem nenhum HOLDLOCK, etc. É praticamente garantido que funcione.

O raciocínio por trás dessa abordagem é:

Se você tiver execuções suficientes desse procedimento para precisar se preocupar com colisões, não deverá:
1. tomar mais medidas do que o necessário
2. manter bloqueios em quaisquer recursos por mais tempo do que o necessário
Como as colisões só podem ocorrer em novas entradas (novas entradas enviadas exatamente ao mesmo tempo ), a frequência de cair no CATCHbloco em primeiro lugar será bem baixa. Faz mais sentido otimizar o código que será executado 99% do tempo em vez do código que será executado 1% do tempo (a menos que não haja custo para otimizar ambos, mas esse não é o caso aqui).

Comentário da resposta do @SqlZim (ênfase adicionada)

Eu pessoalmente prefiro tentar adaptar uma solução para evitar fazer isso quando possível . Nesse caso, não acho que usar os bloqueios de serializableseja uma abordagem pesada e tenho certeza de que lidaria bem com alta simultaneidade.

Eu concordaria com esta primeira frase se fosse alterada para afirmar "e _quando prudente". Só porque algo é tecnicamente possível não significa que a situação (ou seja, caso de uso pretendido) seria beneficiada por isso.

O problema que vejo com essa abordagem é que ela bloqueia mais do que está sendo sugerido. É importante reler a documentação citada sobre "serializável", especificamente o seguinte (grifo nosso):

Outras transações não podem inserir novas linhas com valores de chave que cairiam no intervalo de chaves lidas por qualquer instrução na transação atual até que a transação atual seja concluída.

Agora, aqui está o comentário no código de exemplo:

SELECT [Id]
FROM   dbo.NameLookup WITH (SERIALIZABLE) /* hold that key range for @vName */

A palavra operativa é "alcance". O bloqueio que está sendo feito não é apenas no valor em @vName, mas mais precisamente em um intervalo começando emo local onde esse novo valor deve ir (ou seja, entre os valores-chave existentes em ambos os lados de onde o novo valor se encaixa), mas não o valor em si. Ou seja, outros processos serão impedidos de inserir novos valores, dependendo do(s) valor(es) que estão sendo pesquisados. Se a pesquisa estiver sendo feita no topo do intervalo, a inserção de qualquer coisa que possa ocupar essa mesma posição será bloqueada. Por exemplo, se existem os valores "a", "b" e "d", então se um processo estiver fazendo o SELECT em "f", então não será possível inserir os valores "g" ou mesmo "e" ( já que qualquer um deles virá imediatamente após "d"). Mas, inserir um valor de "c" será possível, pois não seria colocado no intervalo "reservado".

O exemplo a seguir deve ilustrar esse comportamento:

(Na guia de consulta (ou seja, Sessão) nº 1)

INSERT INTO dbo.NameLookup ([ItemName]) VALUES (N'test5');

BEGIN TRAN;

SELECT [Id]
FROM   dbo.NameLookup WITH (SERIALIZABLE) /* hold that key range for @vName */
WHERE  ItemName = N'test8';

--ROLLBACK;

(Na guia de consulta (ou seja, Sessão) nº 2)

EXEC dbo.NameLookup_getset_byName @vName = N'test4';
-- works just fine

EXEC dbo.NameLookup_getset_byName @vName = N'test9';
-- hangs until you either hit "cancel" in this query tab,
-- OR issue a COMMIT or ROLLBACK in query tab #1

EXEC dbo.NameLookup_getset_byName @vName = N'test7';
-- hangs until you either hit "cancel" in this query tab,
-- OR issue a COMMIT or ROLLBACK in query tab #1

EXEC dbo.NameLookup_getset_byName @vName = N's';
-- works just fine

EXEC dbo.NameLookup_getset_byName @vName = N'u';
-- hangs until you either hit "cancel" in this query tab,
-- OR issue a COMMIT or ROLLBACK in query tab #1

Da mesma forma, se o valor "C" existir e o valor "A" estiver sendo selecionado (e, portanto, bloqueado), você poderá inserir um valor "D", mas não um valor "B":

(Na guia de consulta (ou seja, Sessão) nº 1)

INSERT INTO dbo.NameLookup ([ItemName]) VALUES (N'testC');

BEGIN TRAN

SELECT [Id]
FROM   dbo.NameLookup WITH (SERIALIZABLE) /* hold that key range for @vName */
WHERE  ItemName = N'testA';

--ROLLBACK;

(Na guia de consulta (ou seja, Sessão) nº 2)

EXEC dbo.NameLookup_getset_byName @vName = N'testD';
-- works just fine

EXEC dbo.NameLookup_getset_byName @vName = N'testB';
-- hangs until you either hit "cancel" in this query tab,
-- OR issue a COMMIT or ROLLBACK in query tab #1

Para ser justo, na minha abordagem sugerida, quando houver uma exceção, haverá 4 entradas no log de transações que não acontecerão nesta abordagem de "transação serializável". MAS, como eu disse acima, se a exceção acontecer 1% (ou mesmo 5%) do tempo, isso é muito menos impactante do que o caso muito mais provável do SELECT inicial bloqueando temporariamente as operações INSERT.

Outro problema, embora menor, com essa abordagem de "transação serializável + cláusula OUTPUT" é que a OUTPUTcláusula (em seu uso atual) envia os dados de volta como um conjunto de resultados. Um conjunto de resultados requer mais sobrecarga (provavelmente em ambos os lados: no SQL Server para gerenciar o cursor interno e na camada do aplicativo para gerenciar o objeto DataReader) do que um OUTPUTparâmetro simples. Dado que estamos lidando apenas com um único valor escalar e que a suposição é uma alta frequência de execuções, essa sobrecarga extra do conjunto de resultados provavelmente aumenta.

Embora a OUTPUTcláusula possa ser usada de forma a retornar um OUTPUTparâmetro, isso exigiria etapas adicionais para criar uma tabela temporária ou variável de tabela e, em seguida, selecionar o valor dessa tabela temporária/variável de tabela no OUTPUTparâmetro.

Esclarecimento adicional: _{Resposta à resposta de @SqlZim (resposta atualizada) à minha resposta à resposta de @SqlZim (na resposta original) à minha declaração sobre simultaneidade e desempenho ;-)}

Desculpe se esta parte é um pouquinho longa, mas neste ponto estamos apenas nas nuances das duas abordagens.

Acredito que a forma como as informações são apresentadas pode levar a suposições falsas sobre a quantidade de bloqueio que se pode esperar encontrar ao usar serializableno cenário apresentado na pergunta original.

Sim, admito que sou tendencioso, embora seja justo:

É impossível para um humano não ser tendencioso, pelo menos em um pequeno grau, e eu tento mantê-lo no mínimo,
O exemplo dado foi simplista, mas foi para fins ilustrativos para transmitir o comportamento sem complicá-lo demais. Insinuar frequência excessiva não foi intencional, embora eu entenda que também não declarei explicitamente o contrário e pode ser lido como implicando um problema maior do que realmente existe. Vou tentar esclarecer isso abaixo.
Também incluí um exemplo de bloqueio de um intervalo entre duas chaves existentes (o segundo conjunto de blocos "Guia de consulta 1" e "Guia de consulta 2").
Eu encontrei (e voluntariamente) o "custo oculto" da minha abordagem, que são as quatro entradas extras do Tran Log cada vez que INSERTfalha devido a uma violação de restrição exclusiva. Não vi isso mencionado em nenhuma das outras respostas / postagens.

Sobre a abordagem "JFDI" de @gbn, a postagem "Ugly Pragmatism For The Win" de Michael J. Swart e o comentário de Aaron Bertrand na postagem de Michael (sobre seus testes mostrando quais cenários diminuíram o desempenho) e seu comentário sobre sua "adaptação de Michael J . Adaptação de Stewart do procedimento Try Catch JFDI de @gbn" afirmando:

Se você estiver inserindo novos valores com mais frequência do que selecionando valores existentes, isso pode ser mais eficaz do que a versão de @srutzky. Caso contrário, eu preferiria a versão de @srutzky a esta.

Com relação à discussão gbn/Michael/Aaron relacionada à abordagem "JFDI", seria incorreto igualar minha sugestão à abordagem "JFDI" da gbn. Devido à natureza da operação "Obter ou inserir", há uma necessidade explícita de fazer isso SELECTpara obter o IDvalor dos registros existentes. Esse SELECT atua como a IF EXISTSverificação, o que torna essa abordagem mais igual à variação "CheckTryCatch" dos testes de Aaron. O código reescrito de Michael (e sua adaptação final da adaptação de Michael) também inclui um WHERE NOT EXISTSpara fazer a mesma verificação primeiro. Conseqüentemente, minha sugestão (juntamente com o código final de Michael e sua adaptação do código final dele) não atingirá o CATCHobjetivo com tanta frequência. Só poderiam ser situações em que duas sessões,ItemNameINSERT...SELECTexatamente no mesmo momento, de modo que ambas as sessões recebam um "verdadeiro" para WHERE NOT EXISTSo exatamente no mesmo momento e, portanto, ambas tentem fazer INSERTo exatamente no mesmo momento. Esse cenário muito específico acontece com muito menos frequência do que selecionar um existente ItemNameou inserir um novo ItemNamequando nenhum outro processo está tentando fazê-lo exatamente no mesmo momento .

COM TUDO ACIMA EM MENTE: Por que prefiro minha abordagem?

Primeiro, vamos ver qual bloqueio ocorre na abordagem "serializável". Conforme mencionado acima, o "intervalo" bloqueado depende dos valores de chave existentes em ambos os lados de onde o novo valor de chave se encaixaria. O início ou o fim do intervalo também pode ser o início ou o fim do índice, respectivamente, se não houver nenhum valor de chave existente nessa direção. Suponha que temos o seguinte índice e chaves ( ^representa o início do índice enquanto $representa o final dele):

Range #:    |--- 1 ---|--- 2 ---|--- 3 ---|--- 4 ---|
Key Value:  ^         C         F         J         $

Se a sessão 55 tentar inserir um valor de chave de:

A, então o intervalo # 1 (de ^a C) é bloqueado: a sessão 56 não pode inserir um valor de B, mesmo que único e válido (ainda). Mas a sessão 56 pode inserir valores de D, Ge M.
D, then range # 2 (from C to F) is locked: session 56 cannot insert a value of E (yet). But session 56 can insert values of A, G, and M.
M, then range # 4 (from J to $) is locked: session 56 cannot insert a value of X (yet). But session 56 can insert values of A, D, and G.

As more key values are added, the ranges between key values becomes narrower, hence reducing the probability / frequency of multiple values being inserted at the same time fighting over the same range. Admittedly, this is not a major problem, and fortunately it appears to be a problem that actually decreases over time.

The issue with my approach was described above: it only happens when two sessions attempt to insert the same key value at the same time. In this respect it comes down to what has the higher probability of happening: two different, yet close, key values are attempted at the same time, or the same key value is attempted at the same time? I suppose the answer lies in the structure of the app doing the inserts, but generally speaking I would assume it to be more likely that two different values that just happen to share the same range are being inserted. But the only way to really know would be to test both on the O.P.s system.

Next, let's consider two scenarios and how each approach handles them:

All requests being for unique key values:

Nesse caso, o CATCHbloco na minha sugestão nunca é inserido, portanto, não há "problema" (ou seja, 4 entradas de log de tran e o tempo que leva para fazer isso). Mas, na abordagem "serializável", mesmo com todas as inserções sendo únicas, sempre haverá algum potencial para bloquear outras inserções no mesmo intervalo (embora não por muito tempo).
Alta frequência de solicitações para o mesmo valor de chave ao mesmo tempo:

Nesse caso -- um grau muito baixo de exclusividade em termos de solicitações de entrada para valores de chave inexistentes -- o CATCHbloco em minha sugestão será inserido regularmente. O efeito disso será que cada inserção com falha precisará retroceder automaticamente e gravar as 4 entradas no Log de transações, o que representa um leve impacto no desempenho a cada vez. Mas a operação geral nunca deve falhar (pelo menos não devido a isso).

(There was an issue with the previous version of the "updated" approach that allowed it to suffer from deadlocks. An updlock hint was added to address this and it no longer gets deadlocks.) BUT, in the "serializable" approach (even the updated, optimized version), the operation will deadlock. Why? Because the serializable behavior only prevents INSERT operations in the range that has been read and hence locked; it doesn't prevent SELECT operations on that range.

The serializable approach, in this case, would seem to have no additional overhead, and might perform slightly better than what I am suggesting.

As with many / most discussions regarding performance, due to there being so many factors that can affect the outcome, the only way to really have a sense of how something will perform is to try it out in the target environment where it will run. At that point it won't be a matter of opinion :).

SqlZim · Answer 2 · 2016-12-17T11:10:50+08:00

Updated Answer

Response to @srutzky

Another, albeit minor, issue with this "serializable transaction + OUTPUT clause" approach is that the OUTPUT clause (in its present usage) sends the data back as a result set. A result set requires more overhead (probably on both sides: in SQL Server to manage the internal cursor, and in the app layer to manage the DataReader object) than a simple OUTPUT parameter. Given that we are only dealing with a single scalar value, and that the assumption is a high frequency of executions, that extra overhead of the result set probably adds up.

I agree, and for those same reasons I do use output parameters when prudent. It was my mistake not to use an output parameter on my initial answer, I was being lazy.

Here is a revised procedure using an output parameter, additional optimizations, along with next value for that @srutzky explains in his answer:

create procedure dbo.NameLookup_getset_byName (@vName nvarchar(50), @vValueId int output) as
begin
  set nocount on;
  set xact_abort on;
  set @vValueId = null;
  if nullif(@vName,'') is null                                 
    return;                                        /* if @vName is empty, return early */
  select  @vValueId = Id                                              /* go get the Id */
    from  dbo.NameLookup
    where ItemName = @vName;
  if @vValueId is not null                                 /* if we got the id, return */
    return;
  begin try;                                  /* if it is not there, then get the lock */
    begin tran;
      select  @vValueId = Id
        from  dbo.NameLookup with (updlock, serializable) /* hold key range for @vName */
        where ItemName = @vName;
      if @@rowcount = 0                    /* if we still do not have an Id for @vName */
      begin;                                         /* get a new Id and insert @vName */
        set @vValueId = next value for dbo.IdSequence;      /* get next sequence value */
        insert into dbo.NameLookup (ItemName, Id)
          values (@vName, @vValueId);
      end;
    commit tran;
  end try
  begin catch;
    if @@trancount > 0 
      begin;
        rollback transaction;
        throw;
      end;
  end catch;
end;

update note: Including updlock with the select will grab the proper locks in this scenario. Thanks to @srutzky, who pointed out that this could cause deadlocks when only using serializable on the select.

Note: This might not be the case, but if it is possible the procedure will be called with a value for @vValueId, include set @vValueId = null; after set xact_abort on;, otherwise it can be removed.

Concerning @srutzky's examples of key range locking behavior:

@srutzky only uses one value in his table, and locks the "next"/"infinity" key for his tests to illustrate key range locking. While his tests illustrate what happens in those situations, I believe the way the information is presented could lead to false assumptions about the amount of locking one could expect to encounter when using serializable in the scenario as presented in the original question.

Even though I perceive a bias (perhaps falsely) in the way he presents his explanation and examples of key range locking, they are still correct.

After more research, I found a particularly pertinent blog article from 2011 by Michael J. Swart: Mythbusting: Concurrent Update/Insert Solutions. In it, he tests multiple methods for accuracy and concurrency. Method 4: Increased Isolation + Fine Tuning Locks is based on Sam Saffron's post Insert or Update Pattern For SQL Server, and the only method in the original test to meet his expectations (joined later by merge with (holdlock)).

In February of 2016, Michael J. Swart posted Ugly Pragmatism For The Win. In that post, he covers some additional tuning he made to his Saffron upsert procedures to reduce locking (which I included in the procedure above).

After making those changes, Michael wasn't happy that his procedure was starting to look more complicated and consulted with a colleage named Chris. Chris read all of the original Mythbusters post and read all the comments and asked about @gbn's TRY CATCH JFDI pattern. This pattern is similar to @srutzky's answer, and is the solution that Michael ended up using in that instance.

Michael J Swart:

Yesterday I had my mind changed about the best way to do concurrency. I describe several methods in Mythbusting: Concurrent Update/Insert Solutions. My preferred method is to increase the isolation level and fine tune locks.

At least that was my preference. I recently changed my approach to use a method that gbn suggested in the comments. He describes his method as the “TRY CATCH JFDI pattern”. Normally I avoid solutions like that. There’s a rule of thumb that says developers should not rely on catching errors or exceptions for control flow. But I broke that rule of thumb yesterday.

By the way, I love the gbn’s description for the pattern “JFDI”. It reminds me of Shia Labeouf’s motivational video.

In my opinion, both solutions are viable. While I still prefer to increase the isolation level and fine tune locks, @srutzky's answer is also valid and may or may not be more performant in your specific situation.

Perhaps in the future I too will arrive at the same conclusion that Michael J. Swart did, but I'm just not there yet.

It isn't my preference, but here is what my adapation of Michael J. Stewart's adaptation of @gbn's Try Catch JFDI procedure would look like:

create procedure dbo.NameLookup_JFDI (
    @vName nvarchar(50)
  , @vValueId int output
  ) as
begin
  set nocount on;
  set xact_abort on;
  set @vValueId = null;
  if nullif(@vName,'') is null                                 
    return;                     /* if @vName is empty, return early */
  begin try                                                 /* JFDI */
    insert into dbo.NameLookup (ItemName)
      select @vName
      where not exists (
        select 1
          from dbo.NameLookup
          where ItemName = @vName);
  end try
  begin catch        /* ignore duplicate key errors, throw the rest */
    if error_number() not in (2601, 2627) throw;
  end catch
  select  @vValueId = Id                              /* get the Id */
    from  dbo.NameLookup
    where ItemName = @vName
  end;

If you are inserting new values more often than selecting existing values, this may be more performant than @srutzky's version. Otherwise I would prefer @srutzky's version over this one.

Aaron Bertrand's comments on Michael J Swart's post links to relevant testing he has done and led to this exchange. Excerpt from comment section on Ugly Pragmatism For the Win:

Sometimes, though, JFDI leads to worse performance overall, depending on what % of calls fail. Raising exceptions has substantial overhead. I showed this in a couple of posts:

http://sqlperformance.com/2012/08/t-sql-queries/error-handling

https://www.mssqltips.com/sqlservertip/2632/checking-for-potential-constraint-violations-before-entering-sql-server-try-and-catch-logic/

Comment by Aaron Bertrand — February 11, 2016 @ 11:49 am

and the reply of:

You’re right Aaron, and we did test it.

It turns out that in our case, the percent of calls that failed was 0 (when rounded to the nearest percent).

I think you illustrate the point that as much as possible, evaluate things on a case-by-case basis over following rules-of-thumb.

It’s also why we added the not-strictly-necessary WHERE NOT EXISTS clause.

Comment by Michael J. Swart — February 11, 2016 @ 11:57 am

New links:

Original answer

I still prefer the Sam Saffron upsert approach vs using merge, especially when dealing with a single row.

I would adapt that upsert method to this situation like this:

declare @vName nvarchar(50) = 'Invader';
declare @vValueId int       = null;

if nullif(@vName,'') is not null /* this gets your where condition taken care of before we start doing anything */
begin tran;
  select @vValueId = Id
    from dbo.NameLookup with (serializable) 
    where ItemName = @vName;
  if @@rowcount > 0 
    begin;
      select @vValueId as id;
    end;
    else
    begin;
      insert into dbo.NameLookup (ItemName)
        output inserted.id
          values (@vName);
      end;
commit tran;

I would be consistent with your naming, and as serializable is the same as holdlock, pick one and be consistent in its use. I tend to use serializable because it is the same name used as when specifying set transaction isolation level serializable.

By using serializable or holdlock a range lock is taken based on the value of @vName which makes any other operations wait if they selecting or inserting values into dbo.NameLookup that include the value in the where clause.

For the range lock to work properly, there needs to be an index on the ItemName column this applies when using merge as well.

Here is what the procedure would look like mostly following Erland Sommarskog's whitepapers for error handling, using throw. If throw isn't how you are raising your errors, change it to be consistent with the rest of your procedures:

create procedure dbo.NameLookup_getset_byName (@vName nvarchar(50) ) as
begin
  set nocount on;
  set xact_abort on;
  declare @vValueId int;
  if nullif(@vName,'') is null /* if @vName is null or empty, select Id as null */
    begin
      select Id = cast(null as int);
    end 
    else                       /* else go get the Id */
    begin try;
      begin tran;
        select @vValueId = Id
          from dbo.NameLookup with (serializable) /* hold key range for @vName */
          where ItemName = @vName;
        if @@rowcount > 0      /* if we have an Id for @vName select @vValueId */
          begin;
            select @vValueId as Id; 
          end;
          else                     /* else insert @vName and output the new Id */
          begin;
            insert into dbo.NameLookup (ItemName)
              output inserted.Id
                values (@vName);
            end;
      commit tran;
    end try
    begin catch;
      if @@trancount > 0 
        begin;
          rollback transaction;
          throw;
        end;
    end catch;
  end;
go

To summarize what is going on in the procedure above: set nocount on; set xact_abort on; like you always do, then if our input variable is null or empty, select id = cast(null as int) as the result. If it isn't null or empty, then get the Id for our variable while holding that spot in case it isn't there. If the Id is there, send it out. If it isn't there, insert it and send out that new Id.

Meanwhile, other calls to this procedure trying to find the Id for the same value will wait until the first transaction is done and then select & return it. Other calls to this procedure or other statements looking for other values will continue on because this one isn't in the way.

While I agree with @srutzky that you can handle collisions and swallow the exceptions for this sort of issue, I personally prefer to try and tailor a solution to avoid doing that when possible. In this case, I don't feel that using the locks from serializable is a heavy handed approach, and I would be confident it would handle high concurrency well.

Quote from sql server documentation on the table hints serializable / holdlock:

SERIALIZABLE

Is equivalent to HOLDLOCK. Makes shared locks more restrictive by holding them until a transaction is completed, instead of releasing the shared lock as soon as the required table or data page is no longer needed, whether the transaction has been completed or not. The scan is performed with the same semantics as a transaction running at the SERIALIZABLE isolation level. For more information about isolation levels, see SET TRANSACTION ISOLATION LEVEL (Transact-SQL).

Quote from sql server documentation on transaction isolation level serializable

SERIALIZABLE Specifies the following:

Statements cannot read data that has been modified but not yet committed by other transactions.

No other transactions can modify data that has been read by the current transaction until the current transaction completes.

Other transactions cannot insert new rows with key values that would fall in the range of keys read by any statements in the current transaction until the current transaction completes.

Links related to the solution above:

Insert or Update pattern for Sql Server - Sam Saffron
Documentation on serializable and other Table Hints - MSDN
Error and Transaction Handling in SQL Server Part One – Jumpstart Error Handling - Erland Sommarskog
Erland Sommarskog's advice regarding @@rowcount, (which I didn't follow in this instance).

MERGE has a spotty history, and it seems to take more poking around to make sure that the code is behaving how you want it to under all that syntax. Relevant merge articles:

One last link, Kendra Little did a rough comparison of merge vs insert with left join, with the caveat where she says "I didn’t do thorough load testing on this", but it is still a good read.

Um MERGE com OUTPUT é uma prática melhor do que um INSERT e SELECT condicional?

Updated Answer

Original answer

conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

Como fazer a saída do sqlplus aparecer em uma linha?

Selecione qual tem data máxima ou data mais recente

Como faço para listar todos os esquemas no PostgreSQL?

Listar todas as colunas de uma tabela especificada

Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

Como você mysqldump tabela (s) específica (s)?

Listar os privilégios do banco de dados usando o psql

Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

Como faço para listar todos os bancos de dados e tabelas usando o psql?

Um MERGE com OUTPUT é uma prática melhor do que um INSERT e SELECT condicional?

2 respostas

Updated Answer

Original answer

relate perguntas