What Is Query Federation?
The query federation concept was born more than 20 years ago when computer scientists began designing systems for the interoperability of heterogeneous databases. Lately, interest in this area has grown thanks to big data.
Starting from the general concept, a federation is a specific type of grouping of objects. Each member that is part of a federation maintains autonomy but at the same time has to give the centralizing entity, which supersedes it, some grants.
A federation of data silos is a feature that enables clients to access already aggregated data without worrying about linking various data sources. Thus, query federation is a feature that enables clients to access remote heterogeneous data stored in different locations through a simple query. More specifically, it is a technology that provides programmers virtualized access to data from different sources. Its aim is to make it easier for developers to query existing data on remote systems by joining tables scattered in different physical locations.
Experts define a virtual data federation technology database as a data structure that contains metadata about remote data, rather than actual data. The traditional alternative is to build a separate and collective on-premise data warehouse, but modern vendors have worked around this by offering data federation technology solutions that act as a "digital reference resource" where data from multiple locations can be called back as required.
Various names are used to refer to the types of approaches that use data federation technology, including data virtualization, infrastructure as a service (IaaS), or enterprise information integration (EII). These types of provisions appeal to a variety of companies in certain industries, such as data governance.
Users of data systems containing the federated query feature can create virtual databases through the acquisition of remote data in a middleware. With the most advanced solutions, the creation of a virtualized database could be as simple as creating specific connectors between different data sources.
There are several federation technologies available on the market today, capabilities vary from one implementation to the next. Some well-known examples are SQL technologies like IBM technology BIG SQL, Athena AWS federated query (which works using Lambda), AWS Redshift (which enabled federated queries several years ago), Google BigQuery Cloud SQL federation (for both Postgres and MySQL) or Apache Hive.
A good federated query engine should provide query support and unique capabilities for each underlying data source, and should be able to integrate more kinds of database platforms and data infrastructures as possible.
Let’s now see the main advantages of relying on query federation engine systems with the main aim of aggregating data without worrying about how to link various sources:
- Programmers that rely on query federation will have a single side image of the data and should be able to easily work with that as it’s only a single source.
- Data clients do not need to learn and practice with different query dialects or interfaces, only simple SQL syntax.
- Data access will be granted in real-time as no caching mechanism or any physical layer will be present between the federated engine and the various sources of truth. The data will be up-to-date for sure.
- Query federation is a great solution when physical data consolidation is impractical. Moving a huge amount of data from one source to another could be disruptive to existing applications.
- When working with big data, moving huge amounts of information from one place to another can be costly. Virtualizing aggregations means not moving data, thus resulting in good savings.
- Simplified querying of data from different sources means also simplified maintenance. Maintaining physical information instead of maintaining links to external sources means maintaining a lot of ETL processes.
- Virtualization of specific database parts could enhance the separation of concerns. It could be easier to assign specific grants to specific classes of users in an ad-hoc virtualized database.
- It provides centralized authorization to every data source linked. No need to hold several different authentication mechanisms.
As with every virtualization, relying on query federation comes with several drawbacks and challenges and may not be ideal for all situations.
- Opening up existing data sources to various query federation engines could increase the load on source systems. System administrators should preview clients' usage to ensure enough network bandwidth, processing power, and sufficient system resources for the workload.
- Query federation doesn’t address data quality or data cleansing. Aggregated virtualized data is brought to the middleware system as is, without worrying about the quality of that data.
- It’s impossible to write in the external data source from the federated query engine. Only reads are possible.
- Programmers can find it difficult to know where the underlying data is stored. Query federation engines should grant data transparency.
- When it comes to performance comparison with data warehouses, data federation engines really don’t keep the pace. Furthermore, performance enhancements could be difficult. A developer should consider minimizing data transfer, together with minimizing joins between different data sources, and follow implementation-specific best practices for the various data sources database engines. Execution plans of the query will be split among different sources each one with its proprietary engine.
- Supposing that one of the sources of the federated query engine is not reachable or moved off the source, the data will not be available anymore. It’s impossible to maintain a local version of the data.
- Generally speaking, query federations on big data may lack performance, reliability, and data quality. It’s a relatively new feature still early in the adoption curve, when applied to this scenario.