Introduction
This article is not going to give you a pinpoint solution for a specific delete challenge, but rather it is meant to give a broader picture of the problem domain and the various tools you can opt based on your situation.
There are several articles out there which addresses the bulk delete problem, however in this post we are going to first explore the various factors which influence the bulk delete and then we will deep dive into an approach which will seamlessly eliminate the influence of various factors and allow us to adopt the trial & error approach with a greater flexibility. So without further due, lets begin with a sincere thanks to all those who took time to contribute their thoughts and shared their experiences with our virtual nation of thirsty surfers. Remember that this article is more a Conceptual or Approach oriented post instead of a technical solution.
So before we begin, lets focus on some of our strengths and bottlenecks which will familiarize us with the environment.
Let's understand the nature of the problem:
Philosophical...... No two patients are same, though they suffer with the same disease but the treatment will vary from patient to patient, there are various factors which influence the recovery, such as family history, food habits, mental status, allergies and so on. likewise no two databases are same, though both encountered the same problem (disease), and the factors such as processor speed, software version (standard/enterprise), ram, row size, disk space, and several other factors influence the resolution. As medical science let you try different prescriptions with the objective to determine the most suitable cure, likewise the SQL server provides you various tools along with limitless flexibility to determine the approach which address your problem.
Let's understand what we are dealing with.... (Data Buckets)
As per industry standards, we can classify the data into three categories/buckets w.r.t its accessibility needs. Hot, Warm and Cold. Hot data is the one which is stored on highly reliable hardware and it is essential for your business to run. Warm data is the one which is not critical for your business, but yet you need it often for data analysis or some background tasks such as Promotion Performance. Cold data is something which you don’t need immediately and you can maintain it in backup drives and facilitate it on demand.
Let's summarize our strengths (Platform Strengths)
#1 Concurrency :- One of the biggest and by far the strongest strength we possess is the concurrency, since database supports concurrent users, so we not only use it for inserts and updates, but we can also utilize it for deletes too.
#2 Indexing :- We all know indexing, and if not then this article is not meant for you. Indexing not just help in faster retrieval of records, but it is also equally helpful in updating and deleting the records.
#3 Joins :- Often we use joins for retrieving the data from related tables, but have you ever thought that one can use it for temporarily eliminating the non-clustered indexes in the delete clause, hence prevent full table scan and seek operations during delete processing, down below you will see how we can use joins to achieve this goal.
#4 Jobs :- Flexibility to schedule repeated tasks and let SQL Server handle the pain of monitoring and executing it on timely basis.
Example
Let's say that you are responsible for maintaining an eCommerce database which accepts 100,000 orders per day. The database is growing drastically and you at least have to maintain last nine months of data on hand, lets call it Hot Data.
… Say there is a order table with following columns
OrderID – Bigint
OrderDate – DateTime
CustomerID – Int
OrderAmount – Money
LineItemCount – Int
The challenge is to move more than nine months old data into an Archive database or some other location without impacting the day to day operations.
Step-1: Typical Approach… Lets delete all the data on weekly basis where OrderDate<Today-9Months
Problem: Huge data has to be deleted and resource limitations may cause interruptions in day 2 day operations. Time required to delete such data may not be consistent as it depends on the number of rows to be deleted.
Takeaway: Unknown execution time, Longer the job longer the interruption.
Step-2: Second Thought… How about deleting the data into chunks, i.e., delete 1000 records each time in a cursor
Problem: Cursors causes performance bottleneck and running a cursor of 1000 records will not resolve the Unknown execution time factor.
Takeaway: Not a good idea
Step-3: Delegation…..How about isolating the chunks from cursors and delegating the chunk logic to external object. Lets say we want to begin with chunk size 1000.
Chunk Delegation: Lets create a new table by name xOrderDeleteQueue and maintain the following columns in it……. QueueID int, OrderDate, Min_OrderID, Max_OrderID and Status_ID. And lets populate this table with various chunks we want to delete, difference between Min_orderID and Max_OrderID can be Chunk-Size i.e. 1000.
Cursor Delegation: Instead of using cursors, can we use SQL server jobs, and schedule a delete statement which picks the first pending QueueID and delete all the data from order table with in Min_OrderID and Max_OrderID range.
Takeaway: (a) Data is deleted in chunks, so the overall impact on resources is minimized. (b) We are flexible enough to schedule the job during various non-peak hours to minimize the impact during business hours.
Step-4: Photocopy… How about scheduling parallel sql server jobs to delete the data, this will ensure to delete more data during non peak hours and hence adds for flexibility.
So lets create couple of sql server jobs and replicate the first job.
Step-5: Stability… As I indicated earlier that every patient has its own prescription, you will have to opt the Trial & Error to figure out your stability threshold. Here are possible combinations and factors
> Increase the chunk size and keep # of jobs constant
> Or, Keep the chunk size 1000 or 10,000 and keep increasing the # of jobs until you find the yielding point.
> Change your model from storing Order# range to some other fields based on your requirement.
Surgery: There are various other surgical operations to be considered beside the guideline mentioned above, and those includes but not limited to
> Bulk delete
> Indexes
> Transaction vs. Non-Transaction
> Transaction Log on delete
> Index size
> Cascading delete
Conclusion
As mentioned earlier that is is more a conceptual direction as opposed to a direct technical solution, there are various possibilities to explore in this direction and it just provides you some guidelines on how to eliminate the influencing factors while taking the advantage of available resources.
History/Motivation behind this article
I often come across a request from fellow developers that Enterprise version is required for database maintenance and archive processes, this article just shades the light on the fact that going for Enterprise just for database maintenance doesn't justify bang for the buck, however if there is a greater need of data availability and ETL then it may justify the value for money.