STAT 29000: Project 13 — Fall 2021

Motivation: In the previous projects, you’ve gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we haven’t yet done is creating your own database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and many of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.

Context: This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the [sqlite3](www.sqlite.org/) command line tool to interact with the database.

Scope: sql, sqlite, unix

Learning Objectives

Create a sqlite database schema.
Populate the database with data using INSERT statements.
Populate the database with data using the command line interface (CLI) for sqlite3.
Run queries on a database.
Create an index to speed up queries.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/flights/subset/2007.csv

Questions

Question 1

First thing is first, create a new Jupyter Notebook called firstname-lastname-project13.ipynb. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the sqlite3 command line tool from the terminal window.

Okay, once completed, the first step is schema creation. First, it is important to note. The goal of this project is to put the data in /depot/datamine/data/flights/subset/2007.csv into a sqlite database we will call firstname-lastname-project13.db.

With that in mind, run the following (in your terminal) to get a sample of the data.

head /depot/datamine/data/flights/subset/2007.csv

You should receive a result like:

Output

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3
2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0
2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0
2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1
2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0
2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24

An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types — just more "set in stone" than the very easily changed data.frame.

Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the sqlite_schema table will contain the type, name, tbl_name, rootpage, and sql for the database object.

For sqlite, the "database object" could refer to a table, index, view, or trigger.

This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is here.

For our purposes, when I refer to "schema", what I really mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the exact commands to build the database from scratch without the data itself. For example, take a look at our imdb.db database by running the following in your terminal.

module use /scratch/brown/kamstut/tdm/opt/modulefiles
module load sqlite/3.36.0

sqlite3 /depot/datamine/data/movies_and_tv/imdb.db

This will open the command line interface (CLI) for sqlite3. It will look similar to:

sqlite>

Type .schema to see the "schema" for the database.

Any command you run in the sqlite CLI that starts with a dot (.) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing .help.

After running .schema, you should see a variety of legitimate SQL commands that will create the structure of your database without the data itself. This is an extremely useful self-documenting tool that is particularly useful.

Okay, great. Now, let’s study the sample of our 2007.csv dataset. Create a markdown list of key:value pairs for each column in the dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.

For example:

Year: INTEGER

Where the value is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" here.

Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are approximate numeric data types and exact numeric data types. What you are most familiar with is the approximate numeric data types. In R or Python for example, try running the following:

(3 - 2.9) <= 0.1

Output

FALSE

(3 - 2.9) <= 0.1

Output

False

Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is critical that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for our example, store text as TEXT, numbers without decimal places as INTEGER, and numbers with decimal places as REAL — our example dataset doesn’t have a need for NUMERIC.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 2

Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let’s put together our CREATE TABLE statement that will create our table in the database.

See here for some good examples. Realize that the CREATE TABLE statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table flights.

Once you’ve written your CREATE TABLE statement, copy and paste it into the sqlite CLI. Upon success, you should see the statement printed when running the dot command .schema. Fantastic! You can also verify that the table exists by running the dot command .tables.

Congratulations! To finish things off, please paste the CREATE TABLE statement into a markdown cell in your notebook.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 3

The next step in the project is to add the data! After all, it is a _data_base.

To insert data into a table is a bit cumbersome. For example, let’s say we wanted to add the following row to our flights table.

Data to add

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0

The SQL way would be to run the following query.

INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0);

NOT ideal — especially since we have over 7 million rows to add! You could programmatically generate a .sql file with the INSERT INTO statement, hook the database up with Python or R and insert the data that way, or you could use the wonderful dot commands sqlite already provides.

You may find this post very helpful.

You want to make sure you don’t include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI.

.header on
SELECT * FROM flights LIMIT 2;

The .header on dot command will print the header line for every query you run. If you have double entered the header line, it will appear twice. Once for the .header on and another time because that is the first row of your dataset.

Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table.

To connect to your database:

%load_ext sql
%sql sqlite:////home/PURDUEALIAS/flights.db

Assuming flights.db is in your home directory, and you change PURDUEALIAS to your alias, for example mdw for Dr. Ward or kamstut for Kevin.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 4

For this question, please run take screenshots of your output from the terminal and add them to your notebook using a markdown cell. To do so, let’s say you have an image called my_image.png in your $HOME directory. All you need to do is run the following command in a markdown cell:

![](/home/PURDUEALIAS/my_image.png)

Be sure to replace PURDUEALIAS with your alias.

Woohoo! You’ve successfully created a database and populated it with data from a dataset — pretty cool! Now, run the following dot command in order to time our queries: .timer on. This will print out the time it takes to run each query. For example, try the following:

SELECT * FROM flights LIMIT 5;

Cool! Time the following query.

SELECT * FROM flights ORDER BY DepTime LIMIT 1000;

Output

Run Time: real 1.824 user 0.836007 sys 0.605384

That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index!

Run the following query.

EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;

The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An index is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. Much more efficient! So what is the catch here? Space.

Leave the sqlite CLI by running .quit. Now, see how much space your flights.db file is using.

ls -la $HOME/flights.db

Output

571M

Okay, after I add an index on the DepTime column, the file is now 653M — while that isn’t a huge difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to update the indexes. This can add a lot of time. These are just tradeoffs to consider when you’re working with a database.

In this case, we don’t care about the extra bit of space — create an index on the DepTime column. This article is a nice easy read that covers this in more detail.

Great! Once you’ve created your index, run the following query.

SELECT * FROM flights ORDER BY DepTime LIMIT 1000;

Output

Run Time: real 0.263 user 0.014261 sys 0.032923

Wow! That is some serious improvement. What does the "plan" look like?

EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;

You’ll notice the "plan" shows it will utilize the index to speed the query up. Great!

Items to submit

Code used to solve this problem.
Output from running the code.

Question 5

We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a new index to speed the query up. Time the query after creating the index. Did it work well?

Document the steps of this problem just like you did for question (4).

Optional challenge: Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup.

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.