1. Understanding the concept

One of the main challenges to a distributed system is how to handle transactions across several servers and technologies.

What if your application connect for the same transactions to several web services or database servers and if any failure occurs, need a rollback of all that is been implemented?

How do you handle this also during an ETL workflow with several servers on several data centers?

I can give you many examples where transactions are not easy to handle. Noting that distributed transactions could also be dangerous and block the entire system if a rollback cannot occur


2. Do not mix technologies

One of the nightmares I encounter during my consulting life is the following example. Let’s assume that a stored procedure ran from a SQL server job does push from a linked server a large amount of data and then update several tables from the source database and finally bcp a bunch of data to flat files and send an email with an attached file with error (errors bcp’ed to flat files too).

During all of the process, a transaction block is designed with try catch block to roll it all back.

In some case the bcp portion (run with Xp_cmdshell) gets blocked by an OS that do not respond for various reasons (that could be a memory leak). The production supervisor decides to stop the job, triggering a rollback. Unfortunately he waits hours and the rollback does not occur and the main table where data were pushed on the destination server gets frozen by a lot of locks? The supervisor calls the DBA on-call that eventually fix the issue by ending the OS process that runs the bcp and was hanging on the SQL server box.

In this case, the main issue is the distribution of the transactions on several software and here the rollback is initiated by SQL server engine (by stopping the job) but has no permission to rollback the DOS cmd (the bcp) initiated by a process on the OS level.

What does this mean? Do not mix technologies! Because if everything works well, the transaction will be committed and nothing will fail (because the Dos cmd is actually already committed as the transaction scope within SQL server does not have any authority).

In this very case, it is much better to consider the tasks to be independent and play each step only if the previous step works. In this case, it is better to implement it within an ETL/Workflow system. Each step, you can constrain the success and failure and based on a log system, mark a rollback or not on the specific technology of the step that failed.

3. Keep it simple

Obviously, a distributed environment require a certain level of complexity when dealing with transactions but if you can make it simple, make it simple!

Try to identify what is really transactional and what is not for 2 reasons: one is for performance but one is also to make your transaction simple.

A good architect will always look for new technology, look at various ways to set up transactions on distributed environment but will use manly his common sense to make the system viable especially if one of the components fails.

4. Keep it scalable

The “keep it simple” paragraph is probably the hardest one to describe as by means, a transaction on a distributed environment is complex.

“Keep it scalable” is easier to explain and is plain common sense. Most of the developers would make it work! Most of the architect would make it work with the right environment setup and the right software layers.

For example, if the task is to purge a multi-million row tables on a SQL server platform. As the production environment has man y concurrencies, he will make sure to keep the transactions short by implementing/using indexes and by keeping the DML operations as small as possible (by packets usually).

5. Keep it manageable

Another thing to focus on is to make sure that complex system has a very clear documentation and technology that is easily to manage. Another classic mistake of many companies is to setup technology that they cannot manage.

Another one is to setup a system that is hard to maintain when there is a new feature to implement or hard to restart when there is a failure that blocked the system for a while. Both things need to be considered when implementing such system.

6. ETL system

What I like about ETL! There is one thing that I really like. It is the way you have to follow a strict workflow! Each task is independent of the transaction scope and therefore if a failure occurred, you need to setup the contingency plan within your ETL workflow. In SSIS, Event handlers make sure you can set up generic tasks for many different tasks, which is good.

I usually put together SQL tasks that I consider a transaction block (within the same stored procedure call). This allows me to rollbacks several DML operations at once.

However, when changing technology (like BCP, FTP, Email, Zip, etc...), there are handled outside the same transaction scope with their independent tasks.

My ETL flow then needed to set up contingency plans if needed for any error on any task.

Most of the type, the contingency plan is just to log the error and the system would make sure to alert the Production team and guide him to restart the system or fix the error to get the incriminated transaction victim to be back in the ETL process.

For specific tasks, you can eventually set up a rollback plan.


7. Development cycle

The main classic mistake that IT service do is believing that the development phase is the longest part of the project. Here are the main phases for me of a project:

1. Architectural studies: see if the project is possible, goes well within the current Information system

2. Architectural implementation: define how to setup servers and code layers to accommodate the needs

3. Infrastructure installation: install the necessary network and servers

4. Development: develop the different code layers (this task can be done during step 3) based on architectural implementation

5. Q&A phase: tests and validation of the code layers and load/stress tests

6. Production deployment: deploy on production

7. Production supervision: supervise for a period the deployment.

Based on my experience, one step that is too often ignored is the step 2, the architectural implementation.

It reminds a project. I was in charge of the architecture and the development (oh well, it happens). Phase 1 was fairly quick as it involved current system in place. Phase 2 could have been very fast as well but it took me about 2 weeks to really grasp all the functional concepts and render them feasible technically. The project manager answered most of the functional needs but technical limitations rendered the specs largely under-documented. Technical documentation was written by me but required a lot of changes or at least a lot of assumptions from the part of the functional specs, which were not précised enough, especially when errors were occurring (classic mistake as well).

The project manager was giving me 2 months to deploy on production the specific system (it was a master data management system) and was worry about the fact that I spent 2 weeks on step 2.

Well thanks to it, the development phase took 1 week (he thought it would take 3 weeks).  The Q&A phase took 1 week (the project manager thought it would take 2 weeks) and the production phase took 1 day to deploy and 1 week to supervise.

Even though the project manager estimated the step 2 to last only 1 week instead of 2 weeks, he was agreeably surprised that the project overall took 5 weeks instead of the previously planned 8 weeks.

Since the deployment, there were no bugs fixing to implement. Only new features.

I wanted to spend time explaining this story because I think this is why I wanted to be an Architect. I believe that development starts way too quickly and main issues are coming from a bad architecture and it is especially true in a distributed environment with distributed transactions!