STAT 29000: Project 13 — Fall 2021
Motivation: In the previous projects, you’ve gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we haven’t yet done is creating your own database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and many of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.
Context: This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the [sqlite3](www.sqlite.org/) command line tool to interact with the database.
Scope: sql, sqlite, unix
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/flights/subset/2007.csv
Questions
Question 1
First thing is first, create a new Jupyter Notebook called firstname-lastname-project13.ipynb
. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the sqlite3
command line tool from the terminal window.
Okay, once completed, the first step is schema creation. First, it is important to note. The goal of this project is to put the data in /depot/datamine/data/flights/subset/2007.csv
into a sqlite database we will call firstname-lastname-project13.db
.
With that in mind, run the following (in your terminal) to get a sample of the data.
head /depot/datamine/data/flights/subset/2007.csv
You should receive a result like:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay 2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0 2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0 2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31 2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3 2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0 2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0 2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1 2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0 2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24
An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types — just more "set in stone" than the very easily changed data.frame.
Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the sqlite_schema
table will contain the type, name, tbl_name, rootpage, and sql for the database object.
For sqlite, the "database object" could refer to a table, index, view, or trigger. |
This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is here.
For our purposes, when I refer to "schema", what I really mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the exact commands to build the database from scratch without the data itself. For example, take a look at our imdb.db
database by running the following in your terminal.
module use /scratch/brown/kamstut/tdm/opt/modulefiles
module load sqlite/3.36.0
sqlite3 /depot/datamine/data/movies_and_tv/imdb.db
This will open the command line interface (CLI) for sqlite3. It will look similar to:
sqlite>
Type .schema
to see the "schema" for the database.
Any command you run in the sqlite CLI that starts with a dot ( |
After running .schema
, you should see a variety of legitimate SQL commands that will create the structure of your database without the data itself. This is an extremely useful self-documenting tool that is particularly useful.
Okay, great. Now, let’s study the sample of our 2007.csv
dataset. Create a markdown list of key:value pairs for each column in the dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.
For example:
-
Year: INTEGER
Where the value is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" here.
Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are approximate numeric data types and exact numeric data types. What you are most familiar with is the approximate numeric data types. In R or Python for example, try running the following:
(3 - 2.9) <= 0.1
FALSE
(3 - 2.9) <= 0.1
False
Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is critical that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for our example, store text as TEXT, numbers without decimal places as INTEGER, and numbers with decimal places as REAL — our example dataset doesn’t have a need for NUMERIC.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let’s put together our CREATE TABLE
statement that will create our table in the database.
See here for some good examples. Realize that the CREATE TABLE
statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table flights
.
Once you’ve written your CREATE TABLE
statement, copy and paste it into the sqlite CLI. Upon success, you should see the statement printed when running the dot command .schema
. Fantastic! You can also verify that the table exists by running the dot command .tables
.
Congratulations! To finish things off, please paste the CREATE TABLE
statement into a markdown cell in your notebook.
-
Code used to solve this problem.
-
Output from running the code.
Question 3
The next step in the project is to add the data! After all, it is a _data_base.
To insert data into a table is a bit cumbersome. For example, let’s say we wanted to add the following row to our flights
table.
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay 2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
The SQL way would be to run the following query.
INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0);
NOT ideal — especially since we have over 7 million rows to add! You could programmatically generate a .sql
file with the INSERT INTO
statement, hook the database up with Python or R and insert the data that way, or you could use the wonderful dot commands sqlite already provides.
You may find this post very helpful.
You want to make sure you don’t include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI.
The |
Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table.
To connect to your database:
Assuming |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
For this question, please run take screenshots of your output from the terminal and add them to your notebook using a markdown cell. To do so, let’s say you have an image called
Be sure to replace PURDUEALIAS with your alias. |
Woohoo! You’ve successfully created a database and populated it with data from a dataset — pretty cool! Now, run the following dot command in order to time our queries: .timer on
. This will print out the time it takes to run each query. For example, try the following:
SELECT * FROM flights LIMIT 5;
Cool! Time the following query.
SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
Run Time: real 1.824 user 0.836007 sys 0.605384
That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index!
Run the following query.
EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An index is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. Much more efficient! So what is the catch here? Space.
Leave the sqlite CLI by running .quit
. Now, see how much space your flights.db
file is using.
ls -la $HOME/flights.db
571M
Okay, after I add an index on the DepTime
column, the file is now 653M
— while that isn’t a huge difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to update the indexes. This can add a lot of time. These are just tradeoffs to consider when you’re working with a database.
In this case, we don’t care about the extra bit of space — create an index on the DepTime
column. This article is a nice easy read that covers this in more detail.
Great! Once you’ve created your index, run the following query.
SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
Run Time: real 0.263 user 0.014261 sys 0.032923
Wow! That is some serious improvement. What does the "plan" look like?
EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
You’ll notice the "plan" shows it will utilize the index to speed the query up. Great!
-
Code used to solve this problem.
-
Output from running the code.
Question 5
We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a new index to speed the query up. Time the query after creating the index. Did it work well?
Document the steps of this problem just like you did for question (4).
Optional challenge: Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |