Analyzing the NYC Restaurant Inspections Dataset

New York City is known for its diverse culinary scene, with thousands of restaurants serving a wide array of cuisines. To ensure the safety and hygiene of these establishments, the NYC Department of Health conducts regular inspections and maintains a comprehensive dataset of inspection records. In this tutorial, we'll walk you through the NYC restaurant inspections dataset's schema and provide sample SQL queries with explanations to help you get started.

Understanding the NYC Restaurant Inspections Dataset

Before diving into the analysis, it's crucial to understand the dataset's schema. Here's a tabular representation of the dataset.

Column Name Description
camis Unique identifier for each restaurant
dba Doing Business As (DBA) name of the restaurant
boro Borough where the restaurant is located
building Building number of the restaurant
street Street address of the restaurant
zipcode ZIP code of the restaurant
phone Phone number of the restaurant
cuisine_description Description of the restaurant's cuisine
inspection_date Date of the inspection
action Action taken during the inspection
violation_code Code indicating the type of violation
violation_description Description of the violation
critical_flag Flag indicating if the violation is critical
score Inspection score
grade Inspection grade (A,B,C,Z-Pending,N-Not yet graded, Z-Pending,P-Pending issued on reopening)
grade_date Date of the inspection grade
record_date Date when the record was created
inspection_type Type of inspection
latitude Latitude coordinates of the restaurant
longitude Longitude coordinates of the restaurant
community_board Community board where the restaurant is located
council_district Council district where the restaurant is located
census_tract Census tract where the restaurant is located
bin Building Identification Number (BIN)
bbl Borough-Block-Lot (BBL)
nta Neighborhood Tabulation Area (NTA)
location_point1 Location coordinates (point) in text format

Now that you're familiar with the dataset's schema, let's explore how to load the data from an S3 bucket and run some sample SQL queries to extract useful information.

Loading Restaurant Inspection data into Firebolt

Loading data into Firebolt can be done one of two ways. You can use the “COPY FROM” command to perform schema inference and load the data.  

Option 1: The first approach is to use the “COPY FROM”  statement to copy data from the external bucket.

1COPY INTO nyc_restaurant_inspections FROM 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyc_restaurant_inspections/parquet/'
2WITH PATTERN="*.parquet" AUTO_CREATE=TRUE TYPE=PARQUET;

Option 2: The second way is to use the concept of an external table to move data into Firebolt. We will use the second approach here using the following three steps.

  1. Create an external table as a connector from Firebolt to your external data source. In the table definition, you specify credentials that allow Firebolt to read from the data source. For more information, see CREATE EXTERNAL TABLE and Using AWS roles to access S3. Data that you ingest must be in an Amazon S3 bucket in the same AWS Region as the Firebolt database. 
1CREATE EXTERNAL TABLE ex_nyc_restaurant_inspections (
2  camis INTEGER,
3  dba TEXT,
4  boro TEXT,
5  building TEXT,
6  street TEXT,
7  zipcode TEXT,
8  phone TEXT,
9  cuisine_description TEXT,
10  inspection_date TEXT,
11  action TEXT,
12  violation_code TEXT,
13  violation_description TEXT,
14  critical_flag TEXT,
15  score NUMERIC,
16  grade TEXT,
17  grade_date DATE,
18  record_date DATE,
19  inspection_type TEXT,
20  latitude NUMERIC,
21  longitude NUMERIC,
22  community_board TEXT,
23  council_district TEXT,
24  census_tract TEXT,
25  bin TEXT,
26  bbl TEXT,
27  nta TEXT,
28  location_point1 TEXT
29)URL = 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyc_restaurant_inspections/parquet/'
30OBJECT_PATTERN = '*.parquet'
31TYPE = (PARQUET);

  1. Create a table to store the data in Firebolt to be queried.
1CREATE  TABLE nyc_restaurant_inspections(
2  camis INTEGER,
3  dba TEXT,
4  boro TEXT,
5  building TEXT,
6  street TEXT,
7  zipcode TEXT,
8  phone TEXT,
9  cuisine_description TEXT,
10  inspection_date TEXT,
11  action TEXT,
12  violation_code TEXT,
13  violation_description TEXT,
14  critical_flag TEXT,
15  score NUMERIC,
16  grade TEXT,
17  grade_date DATE,
18  record_date DATE,
19  inspection_type TEXT,
20  latitude NUMERIC,
21  longitude NUMERIC,
22  community_board TEXT,
23  council_district TEXT,
24  census_tract TEXT,
25  bin TEXT,
26  bbl TEXT,
27  nta TEXT,
28  location_point1 TEXT
29);

  1. Use the INSERT sql statement using a general purpose engine to load data from the external data source into the fact or dimension table.
1INSERT INTO
2  nyc_restaurant_inspections
3SELECT
4  camis,
5  dba,
6  boro,
7  building,
8  street,
9  zipcode,
10  phone,
11  cuisine_description,
12  inspection_date,
13  action,
14  violation_code,
15  violation_description,
16  critical_flag,
17  score,
18  grade,
19  grade_date,
20  record_date,
21  inspection_type,
22  latitude,
23  longitude,
24  community_board,
25  council_district,
26  census_tract,
27  bin,
28  bbl,
29  nta,
30  location_point1
31FROM
32  ex_nyc_restaurant_inspections;

Sample SQL Queries

Query 1: Find the Total Number of Inspections by Borough

1SELECT boro, COUNT(*) as total_inspections
2FROM nyc_restaurant_inspections
3GROUP BY boro
4ORDER BY total_inspections DESC;

This query calculates the total number of inspections conducted in each borough and presents the results in descending order of inspection counts.

Query 2: Identify the Most Common Cuisine Types

1SELECT cuisine_description, COUNT(*) as count
2FROM nyc_restaurant_inspections
3GROUP BY cuisine_description
4ORDER BY count DESC
5LIMIT 10;

This query identifies the top 10 most common cuisine types in NYC restaurants based on the dataset.

Query 3: Find the Restaurants with the Highest Scores

1SELECT dba, boro, max(score) as score
2FROM nyc_restaurant_inspections
3GROUP BY ALL
4ORDER BY score DESC
5LIMIT 10;

This query retrieves the top 10 restaurants with the highest inspection scores, along with their boroughs.

Query 4: Analyze Violations by Critical Flag

1SELECT critical_flag, COUNT(*) as count
2FROM nyc_restaurant_inspections
3GROUP BY critical_flag;

This query provides an overview of the number of inspections categorized by critical and non-critical violations.

Query 5: Calculate the Average Inspection Score by Cuisine Type

1SELECT cuisine_description, AVG(score) as avg_score
2FROM nyc_restaurant_inspections
3GROUP BY cuisine_description
4ORDER BY avg_score DESC
5LIMIT 10;

This query calculates the average inspection scores for different cuisine types and presents the top 10 with the highest average scores.

These sample SQL queries should help you get started with analyzing the NYC restaurant inspections dataset. Depending on your specific analysis goals, you can customize and expand these queries to gain deeper insights into the data. As a data engineer, your skills in data manipulation and SQL will play a crucial role in uncovering meaningful patterns and trends in the dataset.

Send me as pdf