Reminder If you have not installed pak, do so now. You can install it from CRAN with
install.packages("pak")Once installed, load it withlibrary(pak).
Examples from Tidy data example.
library(pak)
## Warning: package 'pak' was built under R version 4.3.3
pak::pkg_install("tidyverse")
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
# Visualizing the structure of the data to understand it better
head(billboard) # billboard is a dataset that comes with the tidyr package
## # A tibble: 6 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
In this step, the pivot_longer function is employed to
transform the billboard dataset from a wide format to a
long format.
names_to = "week": Specifies that the
names of the original set of columns (wk1 to
wk76) are to be stored in a new column named
week.values_to = "rank": Signifies that the
values of the original set of columns will be gathered into a new column
named rank.values_drop_na = TRUE: Ensures that
any resulting rows containing NA in the rank
column are omitted from the billboard2 dataset.billboard2 <- billboard %>%
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
billboard2
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Next, the mutate function is utilized to create and
modify variables within the long-format dataset created in the previous
step.
week = as.integer(gsub("wk", "", week)):
Converts the week column to integer by removing the “wk”
prefix from the values in the week column and then coercing
them to integer.date = as.Date(date.entered) + 7 * (week - 1):
Calculates a new date column by adding the number of weeks
(converted to days) to the date.entered column, allowing
tracking of the specific date related to each week”s data.date.entered = NULL: Removes the
original date.entered column after the new
date column has been created.billboard3 <- billboard2 %>%
mutate(
week = as.integer(gsub("wk", "", week)),
# Adding to dates in R adds days!
date = as.Date(date.entered) + 7 * (week - 1),
date.entered = NULL
)
billboard3
## # A tibble: 5,307 × 5
## artist track week rank date
## <chr> <chr> <int> <dbl> <date>
## 1 2 Pac Baby Don't Cry (Keep... 1 87 2000-02-26
## 2 2 Pac Baby Don't Cry (Keep... 2 82 2000-03-04
## 3 2 Pac Baby Don't Cry (Keep... 3 72 2000-03-11
## 4 2 Pac Baby Don't Cry (Keep... 4 77 2000-03-18
## 5 2 Pac Baby Don't Cry (Keep... 5 87 2000-03-25
## 6 2 Pac Baby Don't Cry (Keep... 6 94 2000-04-01
## 7 2 Pac Baby Don't Cry (Keep... 7 99 2000-04-08
## 8 2Ge+her The Hardest Part Of ... 1 91 2000-09-02
## 9 2Ge+her The Hardest Part Of ... 2 87 2000-09-09
## 10 2Ge+her The Hardest Part Of ... 3 92 2000-09-16
## # ℹ 5,297 more rows
Finally, the arrange function is applied to organize the
dataset based on the artist, track, and
week columns. This operation ensures a coherent and ordered
display of the dataset, making it more manageable and intuitive for
subsequent analysis.
long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)
glimpse(long_billboard_sorted)
## Rows: 5,307
## Columns: 5
## $ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
## $ track <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
## $ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
## $ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…
This example code creates a new song data frame holding
unique artist and track combinations from the
billboard3 dataframe, and assigns a unique
song_id to each row (representing each unique song).
song <- billboard3 %>%
distinct(artist, track) %>%
mutate(song_id = row_number())
glimpse(song)
## Rows: 317
## Columns: 3
## $ artist <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
## $ track <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
## $ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
The song dataframe is then joined with the
billboard3 dataframe to create a new dataframe
rank that includes the song_id column.
rank <- billboard3 %>%
left_join(song, c("artist", "track")) %>%
select(song_id, date, week, rank)
glimpse(rank)
## Rows: 5,307
## Columns: 4
## $ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ date <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
## $ week <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
## $ rank <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…
import pandas as pd
from datetime import timedelta
billboard = pd.read_csv("billboard.csv")
# Visualizing the structure of the data to understand it better
billboard.head()
## artist track date.entered wk1 wk2 wk3 wk4 ... wk70 wk71 wk72 wk73 wk74 wk75 wk76
## 0 2 Pac Baby Don't Cry (Keep... 2000-02-26 87 82.0 72.0 77.0 ... NaN NaN NaN NaN NaN NaN NaN
## 1 2Ge+her The Hardest Part Of ... 2000-09-02 91 87.0 92.0 NaN ... NaN NaN NaN NaN NaN NaN NaN
## 2 3 Doors Down Kryptonite 2000-04-08 81 70.0 68.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN
## 3 3 Doors Down Loser 2000-10-21 76 76.0 72.0 69.0 ... NaN NaN NaN NaN NaN NaN NaN
## 4 504 Boyz Wobble Wobble 2000-04-15 57 34.0 25.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN
##
## [5 rows x 79 columns]
In this step, the melt function is employed to transform
the billboard dataset from a wide format to a long
format.
id_vars: Specifies the columns that
will remain as identifiers in the long-format dataset.var_name = "week": Specifies that the
names of the original set of columns (wk1 to
wk76) are to be stored in a new column named
week.value_name = "rank": Signifies that
the values of the original set of columns will be gathered into a new
column named rank.billboard2 = billboard.melt(id_vars = ["artist", "track", "date.entered"],
var_name = "week",
value_name = "rank")
billboard2
## artist track date.entered week rank
## 0 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87.0
## 1 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91.0
## 2 3 Doors Down Kryptonite 2000-04-08 wk1 81.0
## 3 3 Doors Down Loser 2000-10-21 wk1 76.0
## 4 504 Boyz Wobble Wobble 2000-04-15 wk1 57.0
## ... ... ... ... ... ...
## 24087 Yankee Grey Another Nine Minutes 2000-04-29 wk76 NaN
## 24088 Yearwood, Trisha Real Live Woman 2000-04-01 wk76 NaN
## 24089 Ying Yang Twins Whistle While You Tw... 2000-03-18 wk76 NaN
## 24090 Zombie Nation Kernkraft 400 2000-09-02 wk76 NaN
## 24091 matchbox twenty Bent 2000-04-29 wk76 NaN
##
## [24092 rows x 5 columns]
Next, the transform function is utilized to create and
modify variables within the long-format dataset created in the previous
step.
lambda x: x.str.replace("wk", "").astype(int):
Converts the week column to integer by removing the “wk”
prefix from the values in the week column and then coercing
them to integer.pd.to_timedelta(7 * (billboard2["week"] - 1), unit = "D"):
Calculates a new date column by adding the number of weeks
(converted to days) to the date.entered column, allowing
tracking of the specific date related to each week”s data.drop(columns = ["date.entered"]):
Removes the original date.entered column after the new
date column has been created.billboard2["week"] = billboard2["week"].transform(lambda x: x.str.replace("wk", "").astype(int))
billboard2["date"] = pd.to_datetime(billboard2["date.entered"]) + pd.to_timedelta(7 * (billboard2["week"] - 1), unit = "D")
billboard3 = billboard2.drop(columns = ["date.entered"])
billboard3
## artist track week rank date
## 0 2 Pac Baby Don't Cry (Keep... 1 87.0 2000-02-26
## 1 2Ge+her The Hardest Part Of ... 1 91.0 2000-09-02
## 2 3 Doors Down Kryptonite 1 81.0 2000-04-08
## 3 3 Doors Down Loser 1 76.0 2000-10-21
## 4 504 Boyz Wobble Wobble 1 57.0 2000-04-15
## ... ... ... ... ... ...
## 24087 Yankee Grey Another Nine Minutes 76 NaN 2001-10-06
## 24088 Yearwood, Trisha Real Live Woman 76 NaN 2001-09-08
## 24089 Ying Yang Twins Whistle While You Tw... 76 NaN 2001-08-25
## 24090 Zombie Nation Kernkraft 400 76 NaN 2002-02-09
## 24091 matchbox twenty Bent 76 NaN 2001-10-06
##
## [24092 rows x 5 columns]
Finally, the sort_values function is applied to organize
the dataset based on the artist, track, and
week columns. This operation ensures a coherent and ordered
display of the dataset, making it more manageable and intuitive for
subsequent analysis.
long_billboard_sorted = billboard3.sort_values(by = ["artist", "track", "week"])
long_billboard_sorted
## artist track week rank date
## 0 2 Pac Baby Don't Cry (Keep... 1 87.0 2000-02-26
## 317 2 Pac Baby Don't Cry (Keep... 2 82.0 2000-03-04
## 634 2 Pac Baby Don't Cry (Keep... 3 72.0 2000-03-11
## 951 2 Pac Baby Don't Cry (Keep... 4 77.0 2000-03-18
## 1268 2 Pac Baby Don't Cry (Keep... 5 87.0 2000-03-25
## ... ... ... ... ... ...
## 22823 matchbox twenty Bent 72 NaN 2001-09-08
## 23140 matchbox twenty Bent 73 NaN 2001-09-15
## 23457 matchbox twenty Bent 74 NaN 2001-09-22
## 23774 matchbox twenty Bent 75 NaN 2001-09-29
## 24091 matchbox twenty Bent 76 NaN 2001-10-06
##
## [24092 rows x 5 columns]
This example code creates a new song data frame holding
unique artist and track combinations from the
billboard3 dataframe, and assigns a unique
song_id to each row (representing each unique song).
song = billboard3[["artist", "track"]].drop_duplicates().reset_index(drop = True)
song["song_id"] = song.index + 1
song
## artist track song_id
## 0 2 Pac Baby Don't Cry (Keep... 1
## 1 2Ge+her The Hardest Part Of ... 2
## 2 3 Doors Down Kryptonite 3
## 3 3 Doors Down Loser 4
## 4 504 Boyz Wobble Wobble 5
## .. ... ... ...
## 312 Yankee Grey Another Nine Minutes 313
## 313 Yearwood, Trisha Real Live Woman 314
## 314 Ying Yang Twins Whistle While You Tw... 315
## 315 Zombie Nation Kernkraft 400 316
## 316 matchbox twenty Bent 317
##
## [317 rows x 3 columns]
The song dataframe is then merged with the
billboard3 dataframe to create a new dataframe
rank that includes the song_id column.
rank = billboard3.merge(song, on = ["artist", "track"])[["song_id", "date", "week", "rank"]]
rank
## song_id date week rank
## 0 1 2000-02-26 1 87.0
## 1 2 2000-09-02 1 91.0
## 2 3 2000-04-08 1 81.0
## 3 4 2000-10-21 1 76.0
## 4 5 2000-04-15 1 57.0
## ... ... ... ... ...
## 24087 313 2001-10-06 76 NaN
## 24088 314 2001-09-08 76 NaN
## 24089 315 2001-08-25 76 NaN
## 24090 316 2002-02-09 76 NaN
## 24091 317 2001-10-06 76 NaN
##
## [24092 rows x 4 columns]
In this section, we will join the flights dataset with
the weather dataset from the nycflights13 R
package to analyze how weather conditions might have affected the
flights.
Make sure you have loaded in flights.csv and
weather.csv!
library(nycflights13) # R has built-in datasets that can be loaded directly from a library
## Warning: package 'nycflights13' was built under R version 4.3.3
import pandas as pd
flights = pd.read_csv("flights.csv")
weather = pd.read_csv("weather.csv")
In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.
# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))
# Python example
flights_weather_inner_merged = pd.merge(flights, weather, how = "inner", on = ["year", "month", "day", "hour", "origin"])
A left join returns all rows from the left dataset and the matched
rows from the right dataset. If there is no match found in the right
dataset, then the result will contain NA. Use a left join
when you want to retain all records from the “left” dataset, and add
matching records from the “right” dataset where available.
# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
# Python example
flights_weather_left_merged = pd.merge(flights, weather, how = "left", on = ["year", "month", "day", "hour", "origin"])
In a right join, all rows from the right dataset and the matched rows
from the left dataset are returned. If there is no match found in the
left dataset, then the result will contain NA. It is the
opposite of a left join and is used when you want to retain all records
from the “right” dataset.
# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
# Python example
flights_weather_right_merged = pd.merge(flights, weather, how = "right", on = ["year", "month", "day", "hour", "origin"])
A full join returns all rows when there is a match in either the left
or right dataset. If there is no match found in either dataset, then the
result will contain NA. It is useful when you want to
retain all records from both datasets.
# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)
## # A tibble: 6 × 29
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
# Python example
flights_weather_full_merged = pd.merge(flights, weather, how = "outer", on = ["year", "month", "day", "hour", "origin"])
flights_weather_full_merged.head()
## year month day dep_time sched_dep_time ... wind_gust precip pressure visib time_hour_y
## 0 2013 1 1 NaN NaN ... NaN 0.0 1012.0 10.0 2013-01-01 01:00:00
## 1 2013 1 1 NaN NaN ... NaN 0.0 1012.6 10.0 2013-01-01 01:00:00
## 2 2013 1 1 NaN NaN ... 23.0156 0.0 1011.9 10.0 2013-01-01 01:00:00
## 3 2013 1 1 NaN NaN ... NaN 0.0 1012.3 10.0 2013-01-01 02:00:00
## 4 2013 1 1 NaN NaN ... NaN 0.0 1012.4 10.0 2013-01-01 02:00:00
##
## [5 rows x 29 columns]
Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.
# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
group_by(precip > 0.5) %>%
summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))
average_delay_per_condition
## # A tibble: 2 × 2
## `precip > 0.5` Average_Departure_Delay
## <lgl> <dbl>
## 1 FALSE 12.6
## 2 TRUE 48.8
# Python example
average_delay_per_condition = flights_weather_inner_merged.groupby(flights_weather_inner_merged["precip"] > 0.5).agg(Average_Departure_Delay = ("dep_delay", "mean"))
average_delay_per_condition
## Average_Departure_Delay
## precip
## False 12.619899
## True 48.800000
An anti join returns rows from the left dataset where there are no matching keys in the right dataset. It’s useful for identifying records in one dataset that do not have a counterpart in another dataset.
# R example
flights_weather_anti_joined <- anti_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_anti_joined)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 1153 1200 -7 1450 1529
## 2 2013 1 1 1154 1200 -6 1253 1306
## 3 2013 1 1 1155 1200 -5 1517 1510
## 4 2013 1 1 1155 1200 -5 1312 1315
## 5 2013 1 1 1157 1200 -3 1452 1456
## 6 2013 1 1 1158 1200 -2 1256 1300
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
In Python, the anti merge operation can be achieved by performing a
left merge and then filtering out the rows that have no match in the
right dataset. The indicator parameter is set to
True to create the _merge column. The
_merge column is used to identify the rows that have no
match in the right dataset with the value left_only.
# Python example
flights_weather_left_merged = pd.merge(flights, weather, how = "left", on = ["year", "month", "day", "hour", "origin"], indicator = True)
flights_weather_anti_merged = flights_weather_left_merged[(flights_weather_left_merged["_merge"] == "left_only")].drop("_merge", axis = 1)
flights_weather_anti_merged_cleaned = flights_weather_anti_merged.dropna(axis = 1, how = 'all')
flights_weather_anti_merged_cleaned.head()
## year month day dep_time sched_dep_time dep_delay ... dest air_time distance hour minute time_hour_x
## 292 2013 1 1 1153.0 1200 -7.0 ... LAX 330.0 2475 12 0 2013-01-01 12:00:00
## 293 2013 1 1 1154.0 1200 -6.0 ... BOS 40.0 200 12 0 2013-01-01 12:00:00
## 295 2013 1 1 1155.0 1200 -5.0 ... LAX 353.0 2475 12 0 2013-01-01 12:00:00
## 298 2013 1 1 1155.0 1200 -5.0 ... DCA 57.0 213 12 0 2013-01-01 12:00:00
## 301 2013 1 1 1157.0 1200 -3.0 ... MCO 151.0 937 12 0 2013-01-01 12:00:00
##
## [5 rows x 19 columns]
We will use the weather and flights datasets from the
nycflights13 package (also provided as .csv
files for Python users) for the exercises.
Please see the nycflights13”s documentation for more information about the datasets.
flights dataset to
include only flights with a delay of more than 12 hours. Group and count
this output by origin and sort the result in descending
order.data.frame showing
the number of flights delayed over 12 hours by airport, ordered from
most to least.flights data set. Make sure to report this result in
hours.data.frame with a
single row showing the average air time in hours and the number of
flights from JFK to LAX.airports dataset,
report the frequency of the time zones of destinations in descending
order. Additionally, find an example of an airport with a missing time
zone and report the name of the airport, explaining how you checked for
it.data.frame listing the time zones by frequency in
descending order.data.frame showing
the top 3 months along with their respective average, minimum, and
maximum departure delay values.