TLDR; In this multi-series blog post we will take a look at the GithubArchive dataset consisting of public events happening on GitHub. We will cover discovering the data using the Firebolt Python SDK, writing a small data app using the Firebolt JDBC driver as well as leveraging Apache Airflow workflows for keeping our data up-to-date. So stay tuned for all the blog posts!
- Part 1: Analyzing the GitHub Events Dataset using Firebolt - Querying with Streamlit
- Part 2: Using Jupyter for data exploration
- Part 3: Incremental Updates with Apache Airflow
- Part 4: Writing a data app using Java
In the last part of our series we will take a look at a data application. In case you are wondering what our definition of a data app is, take a look at this blog post.
This app uses an OAuth login via GitHub to retrieve the username and filter for that across our GitHub archive dataset. This is what the application looks like:
Let's take a look at the tech stack before diving into some code:
- Javalin: Web Framework
- JDBI using the Firebolt JDBC driver to retrieve data
- Pac4j for authentication against GitHub via OAuth
- JTE for templating
Now let's take a look at the setup of some parts of the application.
JDBC & JDBI setup
Firebolt features a full JDBC driver, which means that potentially a fair share of frameworks and data tools work out of the box by properly including that driver. JDBI is an abstraction on top of JDBC simplifying data access to a data store. The setup looks like this in Java
This object now allows to send queries or inserts to the JDBC databases, but a core feature is the easy mapping of SQL queries to Java objects.
Using Java interfaces to retrieve data
The created Jdbi object now allows to define interface methods that have SQL queries attached to them. This project features a GithubProfileDao. A sample interface looks like this:
There is a lot to unpack in the few lines of code above, apart from seeing that multi-line strings in Java are a natural fit for SQL queries.
The totalCount() method is returning the total number of documents within the table returning a single number.
The second method is more interesting. On the one hand a input parameter is passed for the filter clause. On the other hand a map having the key for the current year and the value of the corresponding count of events. This saves us from creating an extra POJO to store these responses which in turn would require some code to map a response row to that POJO - at the expense of being somewhat more generic with a map.
JDBI will pick up this interface on initialization and there is no need to write any implementation for it. If you want to retrieve the total number of events all you need to do is:
So, now that all the basics are up and running, the final step is to execute the data fetching methods in a Javalin handler.
Rendering GitHub statistics
In order to render the GitHub activity the controller must fetch all the required data. In the code snippet below there is only one SQL query shown, whereas in the actual source it is half a dozen queries for each user.
Let's go through this snippet step by step. If a user calls an URL like /stats/spinscale the following happens:
- Retrieve profile information from the HTTP session
- Redirect if a user tries to check different user activity than for him/herself
- Retrieve the activity per year for this user
- Put data into a hashmap and render the `activity.jte` using that map
The full code is doing a little bit more and also includes a handler to check if a user is logged in at and redirects to the GitHub login if that is not the case.
The final step is running the code somewhere. A practical first step is packaging the app into a docker container using a Dockerfile. This will allow it to run it at a PaaS provider of your choice.
Mine was fly.io. With an easy way to configure secrets as well as running fly deploy on the command line to roll out, this basically took no time to set up and it even has a free tier! Albeit that free tier has a maximum memory of 256MB, this is more than enough for a small Javalin application. Startup time is also pretty fast with only a little more than 120ms on my notebook.
If you are interested in more details about setting up Javalin, please take a look at the GitHub repository of this sample project.
If you have any questions, head over to help.firebolt.io and ask!
This post concludes our series about analyzing the GitHub archive data set. Stay tuned for the next series and see you soon!