I am currently in the process of writing a small business SBIR grant to the National Institute of Health. The title is "Investigator-driven Data Transformations and Integration", and the idea is create a simple system for transforming data so that a researcher can organize information according to their needs and integrate multiple data sources into a common schema.
These type of systems often go by the acronym ETL (Extract, Transform, and Load), and long ago I liked tools that Informatica was selling. Of course "Extraction" is querying or parsing (rather mundane) and Loading usually refers to relational database loading (really boring, but attention must be paid to it). The Transformation part is whats interesting, of course, and tools for doing this have been around in some form since the first hard drive I assume.
Data transformations are important when loading new data or reorganizing your exisiting data (schema evolution), but the types of transformations are mostly the same for both.
Here are some common data transformations
- adding, removing, and renaming fields
- atomic data translation (e.g. string to number)
- mapping terms (e.g. string to controlled vocabulary string)
- structural rearrangements (e.g. string to list, sub-record creation)
- inferring references to objects (e.g. string becomes explicit foreign key)
Our approach to this problem rests on two important observations:
- Composable transformation primitives can assembled by a non-technical person using a simple interactive interface
- Most of the difficulty in transforming data has to do with managing exceptional data
So were proposing to build a powerful system that makes it very easy to author transformation progams and deal with exceptions that come up in the process of running them.
Working with my colleague Anthony on material for the background of this approach, we both thought it would be useful to compare our approach to Federated Database Systems. The contrast is that we are making tools that result in an off-line physical integration of data (often called a "warehouse") instead of a dynamic query-based integration, which is common in Federated Databases.
Both of us have been part of teams that built Federated Database Systems (Kleisli and OPM), and both of us have also spent the last 10 years or so building real systems that involved integrating biomedical databases. So its not surprising that we came the same conclusions about the limited utility of Federated systems. Anthony admits that they are good for prototyping a data integration project, but that's about it, and I agree.
Here are the reasons that Federated systems are limited:
- You often need to transform (filter, clean, fix) data before you can integrate it, and this is hard to do on the fly
Query languages used by federated systems aren't general designed for data transformation and they usually aren't optimized for it either
- Physically integrated warehouses allow you to find all/most of the exceptions earlier than Federated systems
Exceptions cause programs to break or give unexpect results and it is often best to look at a lot of data upfront to find out where they are hiding and test your assumptions.
- Federated querying is slow
OK, that's a cheap shot, but its true and should probably be on the top of the list. They are slower because federated systems have a more difficult time using optimizations that exploit data indexes. Of course, physical warehouses take a long time to build, but the user doesn't see that.
- Federate systems require service-level agreements between data providers
The agreement to provide a query service that will be used (and abused) by someone else requires far more commitment than just getting bulk access to the data for physical integration. The exception to this would be someone trying to control access by only giving query interfaces, and this is likely to be someone who will get mad when you fire off long running federated queries or try to subvert them with robots that download all the data anyway.