TDM 10200: Project 7 — 2023
Motivation: Pandas allows us to work with data frames. The actions that we perform on data frames will sometimes remind us of similar actions that we have performed on data frames during the previous semester with R. For instance, we often want to extract information about one or more variables, sometimes grouping the data according to one variable and summarizing another variable within each of those groups.
Context: Unifying our understanding of Pandas and the ability to develop functions will allow us to systematically analyze data.
Scope: Pandas and functions
Dataset(s)
The following questions will use the following dataset(s):
/anvil/projects/tdm/data/flights/subset/
Questions
ONE
Make sure to have 2 cores when you start your Jupyter Lab session.
For convenience, we will help you quickly get all of the data for all of the flights that depart or arrive at Indianapolis airport, as follows. Make a new cell in JupyterLab that has exactly this content (please copy and paste for accuracy):
%%bash
head -n1 /anvil/projects/tdm/data/flights/subset/1987.csv >~/INDflights.csv
grep -h ",IND," /anvil/projects/tdm/data/flights/subset/*.csv >>~/INDflights.csv
We want to use Pandas to read in the data frame, and it will have a lot of columns. So we set Pandas to display an unlimited number of columns.
import pandas as pd
pd.set_option('display.max_columns', None)
Afterwards, in a separate cell in JupyterLab, you can read in your data to a Pandas data frame like this:
myDF = pd.read_csv('~/INDflights.csv')
Do not worry that you get a DtypeWarning
; this will not affect our work on this project~
Now your data frame called myDF
will contain all of the data for the flights (from October 1987 to April 2008) that depart or arrive at Indianapolis airport, which has 3-letter code IND
.
These files correspond to the years 1987 through 2008. Your data frame should contain all of the data for all of the flights with IND
as the Origin
or Dest
airport.
-
How many flights are there altogether in
myDF
? You can check this usingmyDF.shape
. -
How many of the flights are departing from
IND
? (I.e., theOrigin
airport isIND
.) -
How many of the flights are arriving to
IND
? (I.e., theDest
airport isIND
.)
-
Code used to answer the question.
-
Result of code.
TWO
-
For flights departing from 'IND' (i.e., with
IND
as theOrigin
), what are the 20 most popular destination airports (i.e., the 20 most popularDest
airports)? -
For flights departing from 'IND' (i.e., with
IND
as theOrigin
), what are the 5 most popular airlines (i.e., the 5 most popular `UniqueCarrier`s)?
-
Code used to answer the question.
-
Result of code.
THREE
-
Wrap your work for question 2a into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 20 most popular destination airports in that data frame.
-
Wrap your work for question 2b into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 5 most popular airlines in that data frame.
-
Code used to answer the question.
-
Result of code.
FOUR
Test your functions from question 3a and 3b on a couple of other airports. Hint: If we use huge airports, we likely will not have enough member in Pandas and our kernel might crash. So we will consider some midsize airports for testing the functions. Test your functions from questions 3a and 3b on Jacksonville (JAX
) and Buffalo (BUF
).
-
Code used to answer the question.
-
Result of code.
TA applications for The Data Mine are currently being accepted. Please visit us here to apply! |
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |