Data density: method of scoring your database

A database is a container for data. The aim for a database is mainly storing in a way that we can read and write data in an efficient way. An efficient way in terms of securing your data, read your data in the most up-to-date state, write your data in the best organizational way, etc…
Back in the days, databases were quite small and concurrency or performance issues were rather small. Now this is the opposite. Those are huge issues and we often call DBAs to the rescue for that matter. PCI or other Audit compliance adds to the mix the security of the data, that was not a main issue back in the days as well.

Obviously there are many issues a DBA faces but I think it comes down to one main issue: Density. Density of data is the same as people density. So often, public services have difficulties to keep Post offices open in rural areas just because the need for it is small or scarce. It is like a luxury! Government would rather manage medium and large cities. It is easier to deliver services, easier to clean streets, easier to deal with things. Of course, public services need to be organized but they gain a nice service for the buck spent.

Well in database, one can say the same thing. If your data map in your databases were going everywhere on the physical disk, you would have harder time to manage to clean your data and all. I am not talking about having thousands of tables/ indexes. I am talking about having data all over on your physical or logical storage space. For example, a table would gather information about orders but the orders would be all over the place with data split everywhere.

The first thing would be that you would not know how to index your data if you did not know how the orders were coming. Fortunately, you would probably index your order by an identifier and then index by the date and the user that ordered mainly. But let’s say for the sake of the example that you would not know, the orders would be all over the data map of your data. In other words, unmanageable.

The second thing is you would store large empty space because of the data splits in all your indexes. Storage would be fragmented and you would need more of them.

The third thing is you would use a lot of resources just for index maintenances when you do DML operations or reindexing.

The fourth thing is you would have decrease performance regularly because of the instability of your indexes and the cost of the maintenance. The concurrency locks would be longer, the transactions would take longer to perform, etc…

All bad stuff. But what is common to all the issues is DENSITY. Ah!

What do I mean by DENSITY? Simple definition. Data density, for me, is how much data I can fit in the same physical block of storage space. That is it. We do not need to use big words. DBAs/Architects would use, what is the depth of your indexes, how many data splits it performs, what is the chronic fragmentation, etc….

And we can go on and on with different terms. They are all exact and to be a good DBA/Architect, you need to know all the concepts because you are usually called when the damage is done.

BUT if you are called before the damage is done. The answer is plain simple! Try to put as much data in the smallest storage space.

Obviously, you do not want to buy a cabinet if you need to store at the end a car. Same for an index/table. You have consideration such as fillfactors and covering indexes, etc… But all of those are just strategy to minimize in the long run the same concept: Density.

Yes indeed! The goal for fillfactor is to say: I want to make sure I give enough room when I rebuild or create an index so that there is less data split and overall less space taken so my data are all together nicely ordered and not too much fragmented.

The goal for covering indexes is to limit the number of indexes to create on a table so that we take less space and then DML operations are overall smarter and reads smarter than if we did create several indexes instead of the covering one!

So my assumption, maybe not share by many DBAs is this one. DENSITY is the key!

There are exception of course, otherwise the work of a DB architect and DBA would be boring. Also Front end Ecommerce databases are not into the mix. Why because obviously the main concern is not storage it is performance and lock concurrency with volatile database. So the NoSql deal is the best one. The column store indexes in some case and the denormalization is some other is the best option,   especially when you work with large cache stores.

No, I am talking about when you need to access the data on real time and in a rich way. For example, CRM database or Logistics database, or Payment transactions, etc…

My first principle for a good production database is to get only the data you need in the table or partition you care. I have seen so many client databases will the entire history of orders or transactions on production database. This fit the principle. Obviously, if you have issue of density in a large table, your density issue is smaller if your table is smaller.

My second principle is to make sure that your table does not have hundreds of columns or large empty “large” data-typed columns, such as Text, Varchar(2000), xml… In other words when you write or read data on such tables, your key lookups and index updates get faster to be done. Usually developers think: why care about data types, why not putting an int instead of smallint. This is where the architect/DBA kicks in! Obviously a Key-value type table would be pretty dense as there would be not much columns (assuming the clustered index would follow roughly the order where the data are inserted).

The rest: putting the right indexes, not too many indexes, avoid subselect into Select statement, write the columns on your select instead of “*”, etc… make perfect sense and usually end up being following the same principle of DENSITY.

I would not say a database is only that, of course. There are set-based programming where the database is very useful, like Rating and Billing of transactions or direct marketing for example, etc… There are transactions to deal with on the diverse DML operations to set up. There are set-based programming and security compliance to do and your database engine can assist you with that (on logins, permissions, database, backup and data encryption). Good for us.

But overall, I still believe the main important thing is DENSITY!

So, I decided to set up for all my audits a DENSITY scorecard for all the databases besides all the server settings I need to audit when I audit a client database production environment.

If you execute this statement on the specific table:
EXEC sp_SpaceUsed ‘<Schema>.<Table>’, you obtained very useful information such as the number of rows and the total size allocated in KB. There are other ways to know the size by using dmvs but I like this way as it is pretty simple.

Then you build up a stored procedure to run the statement on all tables and then store the results in one table.

You then calculate the number of data pages based on the total allocated, knowing that a data page is equal to 8 kilobytes. 8 kilobytes are the minimum SQL server will read. It is therefore important to know the number of data pages. You then divide the number of rows with this number of data page to get the ratio Rows per page. The higher is the number the denser is your table (this included all indexes).
Obviously, you can do the same calculation per index (using dmvs again), but I like to keep it simple at the table level.
You can then identify the tables that have low density to table with high density. You can suspect issues on the table with low density.
You can also scorecard your database by calculating the same ratio by adding up the numbers.

Sum (Rows) / {Sum(Total Allocated Storage In Kb) / 8 kb}.

You can also remove the outliers (the tables that are too dense and not dense enough when they are really far off the median value. However, if their weight is big on the overall average, I would not advise it.

In my experience, when the table density was higher than 10 rows per page, we were usually in the clear regarding performance and storage. Usually we obtained issue when the density was between 1 and 3 rows per page.

This scoring methodology allowed us to pinpoint the tables with storage issues and pinpoint overall the database with low density.

As stated above, there are many ways to improve density and the consequences of improving density are usually quick on performance and storage.