SQL server tuning: DML for multiple rows in one shot
Today's topic relate to a series of trainings I spent doing at one of the top 10 ECommerce in Europe as a database/etl architect for 2 years time. The idea of those series were to train junior to senior level developpers on how to optimize their applicative code with database-related operations.
One of the series that I like particularly was about sending arrays of value to the database in order to do a series of DML operations.
I still have the video series and they have no relation to anything done in this Ecommerce company. The copyright is mind, so I can definitely share it with people. However, they are way too big to share on a blog.
I have also seen similar training sessions on the Net but I would happy to discuss with your projects on how to implement it in Coldfusion, .Net or Php... or any other kind of Web language.
I will summarize my findings and my advices on this article regarding this need that is quite often overlooked in the web industry.
1. Ad-hoc versus stored procedure
As a DBA, there is absolutely no controversy here for me. Any applicative code should be calling stored procedures when reaching the database for any DML operations or even Selects.
Some DBAs can say it is because of performance. Well, this reason is shaky. It is true that stored procedure helps the SQL server engine to reuse execution plans. However, performance can be sometimes an issue when dealing with searches.
Many searches like "Give me the orders for specific dates and/or with specific customer names and/or specific cities, etc..." have what we call dynaic portions of (hopefully) SARGable conditions and stored procedure with CASE WHEN conditions will not optimize the queries.
There have been a lot of debate here but the truth and the solution is somewhat simple. You will eventually use parameterized query.
Parameterized query will help you gain the performance needed and increase security and avoid sql injection with proper parameter check management, at the application layer or at the stored procedure laywer.
And in the last sentence I gave my preferrence: Stored procedure.
Yes, indeed! Even though some queries might be very dynamic, for security reason (and not really for performance reason (just a bit), stored procedure is unequivocal for me.
Having that said, if you want to update a table, for instance, on several rows (for example, you like to update the shopping cart on one stored procedure call), you end up passing several values for the same parameters (for exemple, CartDetailIds) to the database.
2. Fetch stored procedure calls vs one stored procedure call
Keeping the same example, you can decide in one click of a button to update the shopping cart (addition of quantities, removal of products, etc...). As I sais, there are three ways to update your cart:
1 is to build dynamically your DML operation and do one execution dynamically: as I stated, this is not very good for security reason.
Another one is to call n times the same stored procedure that would update each cart detail, one at a time, like fetching stored procedure calls. This solution is not good as well for two reasons. The first one is that you will need to deal with a transacxtion block at the application layer and I would rather keep the transaction as short as possible. The second reason is that the performance is significantly degraded. Do the test. I did the test and I could see significant differences. (I can share the tests upon request).
Finally, you can pass all the details to one stored procedure call and let the stored procedure deal with the transaction and the update operation at once. So in this example, the trick would be to pass some kind of array of data with CartDetailId and Quantity values, quantity 0 would be to remove the line. We will discuss later how we pass such arrays but fist, conceptually, I like to pin point the fact that you could definitely improve performance by using Merge statement (to deal with inserts, updates and deletes of Cart details at the same time) and then the Ouptut clause to render to the application the results of the changes (notably render the inserts as new CartIds would be created. If you did not have any other table to update (which is probably not true as you probably would update the Cart table (for the total amount column or a last update date column), you would then do an implicit transaction and not even need to deal with transaciton block.
3. Solution to pass arrays of data to the stored procedures
There are for this 5 solutions. I tested all 4 solutions and I will comment worse to best performance solutions.
The worse one is the one that calls several time a stored procedures but this one was already discarded for reasons stated above. On a scale of performance, this solution is noted 1/10.
The second one is passing arrays with lists (texte with a separator). For example in our case we could pass @CartDetailIds = '541001,541002,541004' with @CartDetailQuantities = '1,0,5'. The stored procedure would then parse the texts and allocate to each cartdetailid the corresponding quantity. To do so, you need to create a parsing function. This is pretty classic and this kind of function can be done in any sql server version or any database engines (oracle or Mysql, etc...). This makes this solution universal and therefore useful if you are not committed to a specific technology. Then you would use this function to gather the pieces and insert all data onto one single variable table. The preparation would be done and the DML operation would then be possible. I would rate this solution in terms of performance a good 7/10 performance. This job does the job nicely and answer all the issues: security, performance, scalability.
The third one is passing an xml. You can pass an array with an XML fragment like <CartDetail id="541001" Quantity="1"/><CartDetail id="541002" Quantity="0"/><CartDetail id="541004" Quantity="5"/>. Once passed onto the stored procedure to an xml parameter @CartDetails, you can rebuild a table with xquery and do then the same than above. The performance is good and I would rate it a bit higher than the first one. Actually even in terms of perfornance, it bits the first solution but not by much. I would give a 8/10. Note that Xml uses a lot of verbose. The advantage is the universality of the language for sure. However, the text is longer to pass and if you have to pass thousands of rows, it might be a concern.
The fourth one is passing a json fragment. This solution is similar to the XML fragment solution but has the advantage of a better integration with nowdays applications, rendering very easily a json output. The json parsing functions are only available after sql 2016 however. I used this solution quite often now. It is relatively fast.
The fifth solution is passing a table-valued parameter (TVP). This solution is by far the most efficient one. Again during my test, performance is very good and much better than the previous solution (if you push the stress to hundreds of thousand of rows at once). Instanciating a table can be done on framework 3.5 and above on .Net. Another advantage is the referential integrity of the data passed into the stored procedure as you can decide the exact datatypes of each column. In our example, quantity should be a smallint, allowing values from 0 to 255 maximum and reducing the storage and eventually improving the performance. Your main concern when you use TVP is how to change a stored procedure if you use the same TVP in any stored procedure. You cannot change a table datatype if it is used already. You might consider then TVP versions to get the release not on one big shot and increase regression testing time. But this can be managed nicely if you know how to deal with versions.
4. Things that make the small difference
Now that I talk about performance (and be more a DBA), I like to talk about architecture. I like TVP for another reason. Sometimes I am a bit too neat on the architecture and TVP helps me being neater again. As I like to enforce a filegroup different of primary on all my tables, as I like to define datatypes like telephone or email and use those user-defined datatypes on my tables, I like TVP because I can enfore integrity on all my parameters, even the array ones.
When integrity struggles with performance, like FK might, there can be a debate but we are talking about defining a datatype or a check constraint, I am in favor of spending the time on it. Once we spent the time on it, maybe use an additional day or two to cover all grounds on each table, then we achieve top-of-the art storage. Data partitioning is in the same kind of things when you need to spend time to render the best copy for your database structure.
So TVP is the same kind of thing. Do not hesitate to spend extra-time and ask extra question to the porject manager/business analyst to get the best storage and best performance form your database.
Then everything will fall into pieces nicely: integrity, security, scalability.
Do not hesitate to ask me question and ask me for the videos related to this training.
Have a good day!