Musings on optimizing your OLAP processing in SSAS: measures and partitions
There are many ways to set up your cube but there are definite best practices to implement if you want to scale your cube. The first point will just remind us that a cube is an analytical tool and should not be used as a source of detailed report. Drillthrough utilization is according to me to avoid as this mislead the users to the purpose of a cube.
Your cube is an analytical tool
Main issues that we encounter if the misleading use of OLAP cube. Because end users see potentials and look at the data onto Excel (that is very friendly for them to set up reports), they sometimes believe that details can be reported by doing MDX queries. Let’s take a simple classical example. We have a bunch of orders from an Ecommerce. One can analyse the ordered sales and quantity per day, month or year but we cannot set up a report per order or customer! Well technically you can do all but the cube is not meant to do this and defeat the pupose of aggregation designs for that! A cube should be an analytical tool! You will solve just by these principles probably 50% of performance issues!
Avoid complex measures if you can
By complex measures, I mostly refer to measures that cannot be well aggregated such as Distinct count or last child measures. Sums, Counts are the measures that should be use the most as they are easily aggregatable. I am not saying not to use the other measure aggregates, I am just advising not to do them if not necessary. Your processing will then be optimized because your aggregation designs would be much easier for the engine to set. That is another 20-30% easy increase in performance there.
Partition your cube
Microsoft Best practice advises to partition your measure groups when it reaches 200 million rows. Microsoft also advise to avoid small partitions. Well, both are right and I strongly advise to partition your measure groups for the following reasons. The first one is to ease your processing. By separating your facts, you avoid to reprocess old historical facts that would never change. However those facts still remain in your cube as they are part of your analysis and mining strategies. I have many clients with million rows facts. There are many ways to optimize performance and one of them is to aggregate into fact tables granularities that are not useful. For example, we can aggregate order sales by day and product just because we would never analyze by customer. However, this is risky because your sales and marketing users might require additional dimensions such as the Age of the customer at the time of the order and your aggregate would then need to be recalculated. In other words, sure you can optimize your cube by having less fact rows but in general I advise more towards partitioning your fact tables and generally by date or incremental ids. By doing so, your daily, hourly processing would then be focus on your current partitions. Even with proactive caching it is nice to have a way to set up incremental processing on a specific partition.
Set up your cube partitioning based on life cycle of your facts
This is especially true when it comes to transactions, which is usually the main measures source of any cube. A transaction has always a life cycle, whether is it a one-day cycle, one-month cycle or one-year cycle. Rarely, your transaction can change at any time or if it does, this means your system has flaws. Your transaction should follow a clear deterministic life cycle. Otherwise you might have “Zombie” transactions in your system. Let’s take again the example of an order in an Ecommerce system. I will take a regular case. Your order is passed the same day you pay, then is fulfilled within the month and then can be reimbursed/return within the last 90 days. After 90 days, the order is closed and cannot be touched. Exceptions should be made only by manually requesting to the IT people to reopen the order status. Then we do statistics and we notice on this Ecommerce that the orders are 100% created, 90% paid, 70% fulfilled and 10% reimbursed within the 90 days and then 0.1% changed after 90 days. Of course we can say that incremental processing is not feasible due to the fact that orders changed at any time but this would prevent the Sales team to have a fair amount of analysis in almost real time. What I would do in the case of this Ecommerce is run incremental processing one every hour on the current partition, run a current partition processing daily (and set the partition to be worth a month of data) and then do a full partition processing daily of the previous three months. Finally I would merge the monthly partitions if they are small, every year, for the Year minus 2 (or merge in separate partitions if a year worth of data is more than 200 million rows). Finally, I would reprocess the whole thing once a month if not too big and if it is, I would process fully partitions up to 1 year old. Updating dimensions is based on how frequent the dimensions change but as we re process the current partition daily, I would just update dimensions usually once a day just before the partition processing.
This way, it would make the creation of orders appears within the hour on your cube (without proactive caching). The payment usually would appear within the hour but if not, it would appear the next day. The fulfillment and returns/reimbursement and order closing status would appear the next day as well.
Again, there is no need to be absolutely accurate for the real time because this is an analysis tool. If people likes to be accurate about the a specific order status, an OLAP cube is not the tool to use. A simple OLTP query on the production database for a specific order detail is what you need. As for how many orders are still in open status, one can do an asynchroneous query (which would be almost similar to our cube) or a synchroneous query on the production system (if this system is well indexed/ status partitioned/sharded).
Set up your cube processing schedule
You can then automate your processing by setting a small table that would gather all necessary information for your Measure group partitions (Cube name, measure group name, Partition Id, Minimum and maximum values for your indremental processing, query binding source). Once you have done that. You can use SSIS to set up all your processing needs.