I was working with Fannie Mae (the failed mortgage giant) these days in the Database Architecture team as one of the senior member and came across an interesting proposal. The P&E (Performance and Engineering) team has submitted a white paper on something they call ‘Data Co-Location’ apparently referring to clustering data according to the needs. The solution partially was implemented in Perl/Shell/SQL/pl-sql combination and I was authorized to re-engineer this beast to a meaningful and logical conclusion.
The theory sounded perfect bu the implementation was not child’s play. Keeping aside the complexity, the requirements were vague at the lower level. Nevertheless, why fear when I am hear, right? Anyway, the fruit juice is not about how I got to the solution but what’s the talk?
In these days of Partitioning, Parallel Processing, EMC Disk Arrays and Striping, does the way we store data really makes a difference, apparently it does. Boyce Codd had one of his principles set in stone is that “In Relational Database, the order of tuples or columns is not important”. He was absolutely right, but only theoretically. When milliseconds make difference to your system’s performance, you need to think outside of the box and Clustering does exactly that. When data is stored in a specific order complying to the exact same order of usage, it make more sense that the data is stored in that order indeed.
Like a well oiled machine, data that we know has reached a stage where read-only operations dominate manipulations, it makes more sense to set the cluster right. All those Oracle gurus out there know about the ‘Clustering Factor’ and its importance as well. I can guarantee that the gain in performance outweighs the effort to accomplish the task. When data is ‘ordered’, it is well understood by the so called performance paraphernalia of Statistics, Explain Plans, I/O, Buffer Gets, Cache Hits, etc.
Think outside the box, cluster the clutter.