TLDR; In this multi-series blog post we will take a look at the GithubArchive dataset consisting of public events happening on GitHub. We will cover discovering the data using the Firebolt Python SDK, writing a small data app using the Firebolt JDBC driver as well as leveraging Apache Airflow workflows for keeping our data up-to-date. So stay tuned for all the blog posts!
- Part 1: Analyzing the GitHub Events Dataset using Firebolt - Querying with Streamlit
- Part 2: Using Jupyter for data exploration
- Part 3: Incremental Updates with Apache Airflow
- Part 4: Writing a data app using Java
Jupyter has been around for more than a decade. It is still one of the go-to tools for notebooks, allowing you to have a list of code snippets that are easy to present and consume for others. Combining code with graphs makes demos way more interactive than just showing code snippets.
Using the Firebolt Python SDK we can also have a demo with a fair share of graphs showing our GitHub events data set.
The most important part is initializing the connection to Firebolt. When using pipenv for dependencies and starting jupyter via pipenv run jupyter-notebook, then the .env file is automatically read. Either using this to store credentials or a tool like envchain to retrieve your Firebolt credentials from a secret store allows you to not put any account data in your python snippets like this:
Now, the cursor can be reused across all further examples.
An initial example to create a graph from data using matplotlib could be to count the events per day like this:
This would result in the following output
With about 365 points per year this requires some smoothing, either using a more coarse value like month - or creating a rolling window from the average like this:
This time the graph looks a little different
You can spot the Christmas/New Year dip happening reliably every year. Starting with covid the growth seems to be less of a curve but has more ups and downs. Also there is one big dip in late 2021. Turns out that's just data missing, not the big GitHub decline...
Next steps could be to use a time series anomaly detection library or data frames to change the data into a desired format for further visualizations.
You can see these examples and more in the youtube video
If you have any questions, head over to help.firebolt.io and ask!
Also take a look at the Jupyter notebook in the GitHub repository.