I mentioned in my latter blog posts of 2008 that one of my focal points of 2009 would relate to databases. After 9 years of troubleshooting ColdFusion and JRun applications there are two almost absolutes that stand out.
- Most application problems are caused by or cause dependency issues.
- Most dependency issues relate to databases.
The reason for this is that in most web application development cases we are interacting with dynamic data, data that is changing and/or being changed by what we are doing, in the application code. In this article, the first of what will definitely be many in 2009, I am trying to get very basic and introduce overall concepts that I will amplify in subsequent articles.
Flying home from England I began to think of data very basically, almost as a commodity and it occurred to me that we should start to think of the distance of data. Something which is rarely considered either at the inception of planning an application or considered in its ongoing maintenance. So what do I mean by the distance of data?
Firstly I will explain what I almost always see in CF-JRun applications. There is typically a database which stores data needed in the business, this is stored in files on a hard drive. Hopefully there is some form of redundancy, multiple hard drives in a RAID configuration, multiple DB servers. So at this stage the data we need is sitting in database files on the hard drive(s). Just as in life generally distance is relational to time; in considering data retrieval and update we can assume that without modifiers the further distant the data is the longer it will take to be retrieved and/or updated. So in the most basic install the CF-web servers and the database reside on the same physical server; granted this is rarer than it used to be but I still sometimes see that. In this case, the distance to the data in the database required by the CF-web server is closer than if they were on two separate physical servers. We can further reduce the distance of the data required by the CF-web server by holding regularly used data in memory, in cache. This is the most common method I have found to reduce the distance to the data; caching.
As time passes, the database size increases, in most cases, as data is added. We are now increasing the distance to the data again as the database engine has to work harder to locate required records-recordets. I have seen tables with hundreds of millions of records spanning many years where no more than 5% of that data is accessed on a regular basis. In this case we are increasing the distance to the data minute by minute and archiving and partitioning can help enormously yet I hardly find their use. By effectively archiving and partitioning we can reduce the distance to the data we need regularly by moving the older less needed data to other storage areas.In this blog piece I introduced a concept called "Data Distance" and I will be amplfying that by getting more specific about examples and solutions as 2009 rolls on.