6

How to Batch Updates A Few Thousand Rows at a Time

 3 years ago
source link: https://www.brentozar.com/archive/2020/12/how-to-batch-updates-a-few-thousand-rows-at-a-time/?utm_campaign=Feed%3A+BrentOzar-SqlServerDba+%28Brent+Ozar+Unlimited%29
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to Batch Updates A Few Thousand Rows at a Time

You’ve got a staging table with millions of rows, and you want to join that over to a production table and update the contents. However, when you try to do it all in one big statement, you end up with lock escalation, large transaction log usage, slow replication to Availability Groups, and angry users with pitchforks gathered in the Zoom lobby.

In this post, I’ll explain how to use a combination of two separate topics that I’ve blogged about recently:

Setting up the problem

I’ll start with the Stack Overflow database (any version will work) and make a copy of the Users table called Users_Staging, and I’ll change a bunch of values in it to simulate a table we need to import from somewhere else:

SELECT *
INTO dbo.Users_Staging
FROM dbo.Users;
/* Change some of their data randomly: */
UPDATE dbo.Users_Staging
Reputation=CASEWHEN Id%2=0THENReputation+100ELSEReputation END,
LastAccessDate=CASEWHEN Id%3=0THENGETDATE()ELSELastAccessDate END,
DownVotes=CASEWHEN Id%10=0THEN0ELSEDownVotes END,
UpVotes=CASEWHEN Id%11=0THEN0ELSEUpVotes END,
Views=CASEWHEN Id%7=0THENViews+1ELSEViews END;
CREATE UNIQUE CLUSTERED INDEX Id ON dbo.Users_Staging(Id);

If I try to run a single update statement and update all of the rows:

UPDATEu
SET Age=us.Age,CreationDate=us.CreationDate,DisplayName=us.DisplayName,
DownVotes=us.DownVotes,EmailHash=us.EmailHash,LastAccessDate=us.LastAccessDate,
Location=us.Location,Reputation=us.Reputation,UpVotes=us.UpVotes,
Views=us.Views,WebsiteUrl=us.WebsiteUrl,AccountId=us.AccountId
FROM dbo.Usersu
INNER JOIN dbo.Users_Staging us ONu.Id=us.Id

Then while it’s running, check the locks it’s holding in another window with sp_WhoIsActive @get_locks = 1:

See how the Users table says “OBJECT” request_mode “X”? That means my update query has gotten an eXclusive lock on the Users table. That’s your sign that other queries will be screaming with anger as they wait around for your update to finish.

Now, sometimes that’s what you actually want: sometimes you want to rip the Band-Aid off and get all of your work done in a single transaction. However, sometimes you want to work through the operation in small chunks, avoiding lock escalation. In that case, we’re going to need a batching process.

How to fix it using the fast ordered delete technique and the output table technique

I’m going to encapsulate my work in a stored procedure so that it can be repeatedly called by whatever application I’m using to control my ETL process. You could do this same technique with a loop (like while exists rows in the staging table), but I’m choosing not to cover that here. When you get your own blog, you’ll realize that you also get to control what you write about … and everyone will complain regardless. Here we go:

CREATE ORALTER PROC dbo.usp_UsersETL
@RowsAffected INT=NULLOUTPUT AS
BEGIN
CREATE TABLE#RowsAffected (Id INT);
BEGIN TRAN;
WITH RowsToUpdate AS(SELECT TOP1000*FROM dbo.Users_Staging ORDER BY Id
UPDATEu
SET Age=us.Age,CreationDate=us.CreationDate,DisplayName=us.DisplayName,
DownVotes=us.DownVotes,EmailHash=us.EmailHash,LastAccessDate=us.LastAccessDate,
Location=us.Location,Reputation=us.Reputation,UpVotes=us.UpVotes,
Views=us.Views,WebsiteUrl=us.WebsiteUrl,AccountId=us.AccountId
OUTPUT INSERTED.Id INTO#RowsAffected
FROM RowsToUpdate us
INNER JOIN dbo.UsersuONu.Id=us.Id
DELETE dbo.Users_Staging
WHERE Id IN(SELECT Id FROM#RowsAffected);
COMMIT
SELECT@RowsAffected=COUNT(*)FROM#RowsAffected;

When I execute that stored procedure, the locks look like a hot mess, but note the lock level on the Users object:

Now, they say “OBJECT” request_mode=”IX”, which means INTENT exclusive as opposed to just straight exclusive. This means that SQL Server is starting some work, and it might need to escalate the locks to table level…but as long as you keep the number of rows & locks low, it won’t have to. In this case, my stored procedure runs & finishes quickly without escalating to a table-level lock.

There two parts of the proc that make this magic happen. This part:

WITH RowsToUpdate AS(SELECT TOP1000*FROM dbo.Users_Staging ORDER BY Id)

Tells SQL Server that it’s only going to grab 1,000 rows, and it’s going to be easy to identify exactly which 1,000 rows they are because our staging table has a clustered index on Id. That enables SQL Server to grab those 1,000 rows first, then do exactly 1,000 clustered index seeks on the dbo.Users table.

The second magical component:

OUTPUT INSERTED.Id INTO#RowsAffected

Tells SQL Server to track which 1,000 Ids got updated. This way, we can be certain about which rows we can safely remove from the Users_Staging table.

For bonus points, if you wanted to keep the rows in the dbo.Users_Staging table while you worked rather than deleting them, you could do something like:

  • Add an Is_Processed bit column to dbo.Users_Staging
  • Add “WHERE Is_Processed IS NULL” filter to the update clause so that we don’t update rows twice
  • After the update finishes, update the Is_Processed column in dbo.Users_Staging to denote that the row is already taken care of

However, when you add more stuff like this, you also introduce more overhead to the batch process. I’ve also seen cases where complex filters on the dbo.Users_Staging table would cause SQL Server to not quickly identify the next 1,000 rows to process.

If you’re doing this work,
you should also read…

This blog post was to explain one very specific technique: combining fast ordered deletes with an output table. This post isn’t meant to be an overall compendium of everything you need to know while building ETL code. However, as long as you’ve finished this post, I want to leave you with a few related links that you’re gonna love because they help you build more predictable and performant code:

Or, if you’d like to watch me write this blog post, I did it on a recent stream:

To see more of these, follow me on Twitch or YouTube.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK