data flattening and data unflattening

Data Flattening and Data Unflattening

What is data flattening and data unflattening

Data flattening usually refers to the act of flattening semi-structured data, such as name-value pairs in JSON, into separate columns where the name becomes the column name that holds the values in the rows. Data unflattening is the opposite; adding nested structure to relational data.

If flattening does not sound great, you have good intuition. Flattening is mostly done to put data in a format that a relational data warehouse can use natively, and make sure queries perform well. But you do lose information; it’s like having a shadow instead of the real object. When you flatten you lose information. This is one reason why a data lake should store the full, raw structure. You cannot unflatten data unless you have this extra information somewhere else.

Flattening data also limits your analytics. Often you want to be able to walk the actual JSON structure. When you flatten JSON, those analytics become nested queries, which is a horribly slow query to run on any relational database.

If you have mostly relational data, then flattening some data may make good sense. But if you have a lot, or plan to have a lot, you need native support for both relational and semi-structured data in the same data warehouse. For a great example of semi-structured data support you can read this whitepaper on high-performance semi-structured analytics, and what it takes.