<- Back to all posts

Using Jupyter for data exploration

January 17, 2023

January 17, 2023

Analyzing the GitHub Events Dataset using Firebolt - Using Jupyter for data exploration

Alexander Reelsen

Developer Advocate at Firebolt

Using Jupyter for data exploration

January 17, 2023

January 17, 2023

Analyzing the GitHub Events Dataset using Firebolt - Using Jupyter for data exploration

Alexander Reelsen

Developer Advocate at Firebolt

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

TLDR; In this multi-series blog post we will take a look at the GithubArchive dataset consisting of public events happening on GitHub. We will cover discovering the data using the Firebolt Python SDK, writing a small data app using the Firebolt JDBC driver as well as leveraging Apache Airflow workflows for keeping our data up-to-date. So stay tuned for all the blog posts!

Part 1: Analyzing the GitHub Events Dataset using Firebolt - Querying with Streamlit
Part 2: Using Jupyter for data exploration
Part 3: Incremental Updates with Apache Airflow
Part 4: Writing a data app using Java

Jupyter has been around for more than a decade. It is still one of the go-to tools for notebooks, allowing you to have a list of code snippets that are easy to present and consume for others. Combining code with graphs makes demos way more interactive than just showing code snippets.

Using the Firebolt Python SDK we can also have a demo with a fair share of graphs showing our GitHub events data set.

The most important part is initializing the connection to Firebolt. When using pipenv for dependencies and starting jupyter via pipenv run jupyter-notebook, then the .env file is automatically read. Either using this to store credentials or a tool like envchain to retrieve your Firebolt credentials from a secret store allows you to not put any account data in your python snippets like this:

Now, the cursor can be reused across all further examples.

An initial example to create a graph from data using matplotlib could be to count the events per day like this:

‍‍

This would result in the following output

With about 365 points per year this requires some smoothing, either using a more coarse value like month - or creating a rolling window from the average like this:

This time the graph looks a little different

You can spot the Christmas/New Year dip happening reliably every year. Starting with covid the growth seems to be less of a curve but has more ups and downs. Also there is one big dip in late 2021. Turns out that's just data missing, not the big GitHub decline...

Next steps could be to use a time series anomaly detection library or data frames to change the data into a desired format for further visualizations.

You can see these examples and more in the youtube video

If you have any questions, head over to help.firebolt.io and ask!

Also take a look at the Jupyter notebook in the GitHub repository.

‍

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

Introducing Firebolt Core - Self-Hosted Firebolt, For Free, Forever

Dive into the workings of the forever free, self-hosted edition of Firebolt’s distributed query engine

Mosha Pasumansky

Making Firebolt Fast By Doing Practically Nothing

Learn about the different methods deployed in Firebolt for reducing the number of scanned rows (aka pruning).

Ori Brostovski

Live Engine Upgrades, Zero Downtime: The Firebolt Method

Discover how Firebolt delivers seamless, no-downtime upgrades using shadow clusters and real-time performance.

Ilya Shakhat

Intrigued? Want to read some more?