TLDR; In this multi-series blog post we will take a look at the GithubArchive dataset consisting of public events happening on GitHub. We will cover discovering the data using Streamlit and Jupyter and the Firebolt Python SDK, writing a small data app using the Firebolt JDBC driver as well as leveraging Apache Airflow workflows for keeping our data up-to-date. So stay tuned for all the blog posts!
- Part 1: Analyzing the GitHub Events Dataset using Firebolt - Querying and Ingestion
- Part 2: Using Jupyter for data exploration
- Part 3: Incremental Updates with Apache Airflow
- Part 4: Writing a data app using Java
Apache Airflow is one of the heavily used orchestration tools in the data engineering industry. Firebolt has a nice Apache Airflow provider and we will use it in this blog post to have a full end-to-end integration for our GitHub dataset and ensure that incremental updates keep flowing in. The Firebolt provider contains operators that allow to start and stop engines as well as execute SQL queries against configured Firebolt connections. This allows for a start-execute-stop execution of a Firebolt engine to reduce costs if you only require a certain type of engine during workflow execution.
If you prefer watching videos, check out the corresponding video on youtube.
Doing an incremental update in our Firebolt database requires two different workflows.
- Synchronize data into our S3 bucket
- Incremental update from the data in the S3 bucket
Both workflows are not tied or dependent on each other, as each workflow is supposed to figure out which data is required. For example the data in our S3 bucket is synced until three days ago, but the data in Firebolt is already three months old – this should not stop the synchronization into S3 in any way.
Local Airflow development with Docker and the Firebolt provider
It is required to create a custom Docker Airflow image including the Firebolt provider.
Before we can test out our workflows, we need to extend the Apache Airflow base image with the Firebolt Airflow provider. If you are running locally and testing with Docker, this is done in a few seconds. You need to create a Dockerfile with a requirements.txt that contains the dependency to the Firebolt Provider. This is what the requirements.txt looks like
With a custom Dockerfile like this, the provider is installed as part of the build
Now you can build this image and configure a tag for it
If you are running Airflow locally using the docker compose files from the official HOWTO, then you can spin up your Airflow setup with a custom image like this by setting the correct environment variable to the above configured tag:
When checking the admin interface you are able now to create a Firebolt connection. The following screenshot shows a successful setup. The advanced connection properties can contain an engine name that should be used for this connection.
Synchronizing S3 buckets
Now this is essentially a bread-and-butter task for data engineers. Getting data from one point to another and making sure the format or the directory structure is right.
The above sample is heavily shortened, so make sure to check the GitHub repository for the complete source. The PythonOperator() refers to a method named update_github_events_data() and that method is figuring out the latest file in the S3 bucket and starts downloading all the complete days from that date onwards. The most important optimization is to ensure that only one file gets downloaded locally, then uploaded and then removed, so that you do not run out of space on your local disk as you would need terabytes of data for the whole dataset.
Running incremental updates
The second workflow is doing all the magic with regards to indexing into Firebolt. There are a few steps to cater for
- Ensure a engine is up, that can write data
- Find out the current most up-to-date timestamp in the live dataset
- Do an incremental update of data that is more current than the just queried timestamp
- Optionally: Power down that engine again to save money
In order to ensure that the engine is up – or wait until it has started, there is a custom FireboltStartEngineOperator()
Once the engine is started (the operator is blocking), we now need to run two pieces of code. First retrieving the last update:
The second step requires some python calculations first. If there is a delta of days between the last update and yesterday, dynamically create a filter for the WHERE clause in the SQL statement that updates our data. That filter refers only to files that are newer than the start date. Once that filter is created it is appended to the query that is used to insert data into Firebolt like this
When the query is running in Airflow it will look like this, if the missing days in our data are 22nd, 23rd and 24th of January 2023.
Finally if you wanted to power down your engine again, you could run the FireboltStopEngineOperator() which waits until the engine is shut down.
This ends our little journey into Apache Airflow, make sure to check out the complete workflows and the setup in the GitHub repository.
Now with the infrastructure of constant synchronization in place, the next post will be about our first standalone data app using the Firebolt JDBC driver.
If you have any questions, head over to help.firebolt.io and ask!