GitHub Classroom Assignment

Reminder If you have not installed pak, do so now. You can install it from CRAN with install.packages("pak") Once installed, load it with library(pak).

Data Wrangling with R

Examples from Tidy data example.

library(pak)

## Warning: package 'pak' was built under R version 4.3.3

pak::pkg_install("tidyverse")
library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

# Visualizing the structure of the data to understand it better
head(billboard) # billboard is a dataset that comes with the tidyr package

## # A tibble: 6 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
## 2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
## 3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
## 4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
## 5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
## 6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Tidying data

In this step, the pivot_longer function is employed to transform the billboard dataset from a wide format to a long format.

names_to = "week": Specifies that the names of the original set of columns (wk1 to wk76) are to be stored in a new column named week.
values_to = "rank": Signifies that the values of the original set of columns will be gathered into a new column named rank.
values_drop_na = TRUE: Ensures that any resulting rows containing NA in the rank column are omitted from the billboard2 dataset.

billboard2 <- billboard %>%
  pivot_longer(
    wk1:wk76,
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

billboard2

## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

Next, the mutate function is utilized to create and modify variables within the long-format dataset created in the previous step.

week = as.integer(gsub("wk", "", week)): Converts the week column to integer by removing the “wk” prefix from the values in the week column and then coercing them to integer.
date = as.Date(date.entered) + 7 * (week - 1): Calculates a new date column by adding the number of weeks (converted to days) to the date.entered column, allowing tracking of the specific date related to each week”s data.
date.entered = NULL: Removes the original date.entered column after the new date column has been created.

billboard3 <- billboard2 %>%
  mutate(
    week = as.integer(gsub("wk", "", week)),
    # Adding to dates in R adds days!
    date = as.Date(date.entered) + 7 * (week - 1),
    date.entered = NULL
  )

billboard3

## # A tibble: 5,307 × 5
##    artist  track                    week  rank date      
##    <chr>   <chr>                   <int> <dbl> <date>    
##  1 2 Pac   Baby Don't Cry (Keep...     1    87 2000-02-26
##  2 2 Pac   Baby Don't Cry (Keep...     2    82 2000-03-04
##  3 2 Pac   Baby Don't Cry (Keep...     3    72 2000-03-11
##  4 2 Pac   Baby Don't Cry (Keep...     4    77 2000-03-18
##  5 2 Pac   Baby Don't Cry (Keep...     5    87 2000-03-25
##  6 2 Pac   Baby Don't Cry (Keep...     6    94 2000-04-01
##  7 2 Pac   Baby Don't Cry (Keep...     7    99 2000-04-08
##  8 2Ge+her The Hardest Part Of ...     1    91 2000-09-02
##  9 2Ge+her The Hardest Part Of ...     2    87 2000-09-09
## 10 2Ge+her The Hardest Part Of ...     3    92 2000-09-16
## # ℹ 5,297 more rows

Finally, the arrange function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.

long_billboard_sorted <- billboard3 %>% arrange(artist, track, week)

glimpse(long_billboard_sorted)

## Rows: 5,307
## Columns: 5
## $ artist <chr> "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", "2 Pac", …
## $ track  <chr> "Baby Don't Cry (Keep...", "Baby Don't Cry (Keep...", "Baby Don…
## $ week   <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ rank   <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57,…
## $ date   <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 20…

This example code creates a new song data frame holding unique artist and track combinations from the billboard3 dataframe, and assigns a unique song_id to each row (representing each unique song).

song <- billboard3 %>% 
  distinct(artist, track) %>%
  mutate(song_id = row_number())

glimpse(song)

## Rows: 317
## Columns: 3
## $ artist  <chr> "2 Pac", "2Ge+her", "3 Doors Down", "3 Doors Down", "504 Boyz"…
## $ track   <chr> "Baby Don't Cry (Keep...", "The Hardest Part Of ...", "Krypton…
## $ song_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

The song dataframe is then joined with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.

rank <- billboard3 %>%
  left_join(song, c("artist", "track")) %>%
  select(song_id, date, week, rank)

glimpse(rank)

## Rows: 5,307
## Columns: 4
## $ song_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ date    <date> 2000-02-26, 2000-03-04, 2000-03-11, 2000-03-18, 2000-03-25, 2…
## $ week    <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1…
## $ rank    <dbl> 87, 82, 72, 77, 87, 94, 99, 91, 87, 92, 81, 70, 68, 67, 66, 57…

Data Wrangling with Python

import pandas as pd
from datetime import timedelta
billboard = pd.read_csv("billboard.csv")

# Visualizing the structure of the data to understand it better
billboard.head()

##          artist                    track date.entered  wk1   wk2   wk3   wk4  ...  wk70  wk71  wk72  wk73  wk74  wk75  wk76
## 0         2 Pac  Baby Don't Cry (Keep...   2000-02-26   87  82.0  72.0  77.0  ...   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## 1       2Ge+her  The Hardest Part Of ...   2000-09-02   91  87.0  92.0   NaN  ...   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## 2  3 Doors Down               Kryptonite   2000-04-08   81  70.0  68.0  67.0  ...   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## 3  3 Doors Down                    Loser   2000-10-21   76  76.0  72.0  69.0  ...   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## 4      504 Boyz            Wobble Wobble   2000-04-15   57  34.0  25.0  17.0  ...   NaN   NaN   NaN   NaN   NaN   NaN   NaN
## 
## [5 rows x 79 columns]

Tidying data

In this step, the melt function is employed to transform the billboard dataset from a wide format to a long format.

id_vars: Specifies the columns that will remain as identifiers in the long-format dataset.
var_name = "week": Specifies that the names of the original set of columns (wk1 to wk76) are to be stored in a new column named week.
value_name = "rank": Signifies that the values of the original set of columns will be gathered into a new column named rank.

billboard2 = billboard.melt(id_vars = ["artist", "track", "date.entered"], 
                            var_name = "week", 
                            value_name = "rank")

billboard2

##                  artist                    track date.entered  week  rank
## 0                 2 Pac  Baby Don't Cry (Keep...   2000-02-26   wk1  87.0
## 1               2Ge+her  The Hardest Part Of ...   2000-09-02   wk1  91.0
## 2          3 Doors Down               Kryptonite   2000-04-08   wk1  81.0
## 3          3 Doors Down                    Loser   2000-10-21   wk1  76.0
## 4              504 Boyz            Wobble Wobble   2000-04-15   wk1  57.0
## ...                 ...                      ...          ...   ...   ...
## 24087       Yankee Grey     Another Nine Minutes   2000-04-29  wk76   NaN
## 24088  Yearwood, Trisha          Real Live Woman   2000-04-01  wk76   NaN
## 24089   Ying Yang Twins  Whistle While You Tw...   2000-03-18  wk76   NaN
## 24090     Zombie Nation            Kernkraft 400   2000-09-02  wk76   NaN
## 24091   matchbox twenty                     Bent   2000-04-29  wk76   NaN
## 
## [24092 rows x 5 columns]

Next, the transform function is utilized to create and modify variables within the long-format dataset created in the previous step.

lambda x: x.str.replace("wk", "").astype(int): Converts the week column to integer by removing the “wk” prefix from the values in the week column and then coercing them to integer.
pd.to_timedelta(7 * (billboard2["week"] - 1), unit = "D"): Calculates a new date column by adding the number of weeks (converted to days) to the date.entered column, allowing tracking of the specific date related to each week”s data.
drop(columns = ["date.entered"]): Removes the original date.entered column after the new date column has been created.

billboard2["week"] = billboard2["week"].transform(lambda x: x.str.replace("wk", "").astype(int))
billboard2["date"] = pd.to_datetime(billboard2["date.entered"]) + pd.to_timedelta(7 * (billboard2["week"] - 1), unit = "D")
billboard3 = billboard2.drop(columns = ["date.entered"])

billboard3

##                  artist                    track  week  rank       date
## 0                 2 Pac  Baby Don't Cry (Keep...     1  87.0 2000-02-26
## 1               2Ge+her  The Hardest Part Of ...     1  91.0 2000-09-02
## 2          3 Doors Down               Kryptonite     1  81.0 2000-04-08
## 3          3 Doors Down                    Loser     1  76.0 2000-10-21
## 4              504 Boyz            Wobble Wobble     1  57.0 2000-04-15
## ...                 ...                      ...   ...   ...        ...
## 24087       Yankee Grey     Another Nine Minutes    76   NaN 2001-10-06
## 24088  Yearwood, Trisha          Real Live Woman    76   NaN 2001-09-08
## 24089   Ying Yang Twins  Whistle While You Tw...    76   NaN 2001-08-25
## 24090     Zombie Nation            Kernkraft 400    76   NaN 2002-02-09
## 24091   matchbox twenty                     Bent    76   NaN 2001-10-06
## 
## [24092 rows x 5 columns]

Finally, the sort_values function is applied to organize the dataset based on the artist, track, and week columns. This operation ensures a coherent and ordered display of the dataset, making it more manageable and intuitive for subsequent analysis.

long_billboard_sorted = billboard3.sort_values(by = ["artist", "track", "week"])

long_billboard_sorted

##                 artist                    track  week  rank       date
## 0                2 Pac  Baby Don't Cry (Keep...     1  87.0 2000-02-26
## 317              2 Pac  Baby Don't Cry (Keep...     2  82.0 2000-03-04
## 634              2 Pac  Baby Don't Cry (Keep...     3  72.0 2000-03-11
## 951              2 Pac  Baby Don't Cry (Keep...     4  77.0 2000-03-18
## 1268             2 Pac  Baby Don't Cry (Keep...     5  87.0 2000-03-25
## ...                ...                      ...   ...   ...        ...
## 22823  matchbox twenty                     Bent    72   NaN 2001-09-08
## 23140  matchbox twenty                     Bent    73   NaN 2001-09-15
## 23457  matchbox twenty                     Bent    74   NaN 2001-09-22
## 23774  matchbox twenty                     Bent    75   NaN 2001-09-29
## 24091  matchbox twenty                     Bent    76   NaN 2001-10-06
## 
## [24092 rows x 5 columns]

song = billboard3[["artist", "track"]].drop_duplicates().reset_index(drop = True)
song["song_id"] = song.index + 1

song

##                artist                    track  song_id
## 0               2 Pac  Baby Don't Cry (Keep...        1
## 1             2Ge+her  The Hardest Part Of ...        2
## 2        3 Doors Down               Kryptonite        3
## 3        3 Doors Down                    Loser        4
## 4            504 Boyz            Wobble Wobble        5
## ..                ...                      ...      ...
## 312       Yankee Grey     Another Nine Minutes      313
## 313  Yearwood, Trisha          Real Live Woman      314
## 314   Ying Yang Twins  Whistle While You Tw...      315
## 315     Zombie Nation            Kernkraft 400      316
## 316   matchbox twenty                     Bent      317
## 
## [317 rows x 3 columns]

The song dataframe is then merged with the billboard3 dataframe to create a new dataframe rank that includes the song_id column.

rank = billboard3.merge(song, on = ["artist", "track"])[["song_id", "date", "week", "rank"]]

rank

##        song_id       date  week  rank
## 0            1 2000-02-26     1  87.0
## 1            2 2000-09-02     1  91.0
## 2            3 2000-04-08     1  81.0
## 3            4 2000-10-21     1  76.0
## 4            5 2000-04-15     1  57.0
## ...        ...        ...   ...   ...
## 24087      313 2001-10-06    76   NaN
## 24088      314 2001-09-08    76   NaN
## 24089      315 2001-08-25    76   NaN
## 24090      316 2002-02-09    76   NaN
## 24091      317 2001-10-06    76   NaN
## 
## [24092 rows x 4 columns]

Joins/Merges

In this section, we will join the flights dataset with the weather dataset from the nycflights13 R package to analyze how weather conditions might have affected the flights.

Make sure you have loaded in flights.csv and weather.csv!

library(nycflights13)  # R has built-in datasets that can be loaded directly from a library

## Warning: package 'nycflights13' was built under R version 4.3.3

import pandas as pd
flights = pd.read_csv("flights.csv")
weather = pd.read_csv("weather.csv")

Inner Join

In an inner join, only the rows with matching keys in both data frames are returned. Rows with non-matching keys are excluded from the result. It’s useful when you want to join datasets based on common key columns, and you are only interested in rows with matching keys in both datasets.

# R example
flights_weather_inner_joined <- inner_join(flights, weather,by=c("year", "month", "day", "hour", "origin"))

# Python example
flights_weather_inner_merged = pd.merge(flights, weather, how = "inner", on = ["year", "month", "day", "hour", "origin"])

Left Join

A left join returns all rows from the left dataset and the matched rows from the right dataset. If there is no match found in the right dataset, then the result will contain NA. Use a left join when you want to retain all records from the “left” dataset, and add matching records from the “right” dataset where available.

# R example
flights_weather_left_joined <- left_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

# Python example
flights_weather_left_merged = pd.merge(flights, weather, how = "left", on = ["year", "month", "day", "hour", "origin"])

Right Join

In a right join, all rows from the right dataset and the matched rows from the left dataset are returned. If there is no match found in the left dataset, then the result will contain NA. It is the opposite of a left join and is used when you want to retain all records from the “right” dataset.

# R example
flights_weather_right_joined <- right_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

# Python example
flights_weather_right_merged = pd.merge(flights, weather, how = "right", on = ["year", "month", "day", "hour", "origin"])

Full Join

A full join returns all rows when there is a match in either the left or right dataset. If there is no match found in either dataset, then the result will contain NA. It is useful when you want to retain all records from both datasets.

# R example
flights_weather_full_joined <- full_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))
head(flights_weather_full_joined)

## # A tibble: 6 × 29
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

# Python example
flights_weather_full_merged = pd.merge(flights, weather, how = "outer", on = ["year", "month", "day", "hour", "origin"])

flights_weather_full_merged.head()

##    year  month  day  dep_time  sched_dep_time  ...  wind_gust  precip  pressure  visib          time_hour_y
## 0  2013      1    1       NaN             NaN  ...        NaN     0.0    1012.0   10.0  2013-01-01 01:00:00
## 1  2013      1    1       NaN             NaN  ...        NaN     0.0    1012.6   10.0  2013-01-01 01:00:00
## 2  2013      1    1       NaN             NaN  ...    23.0156     0.0    1011.9   10.0  2013-01-01 01:00:00
## 3  2013      1    1       NaN             NaN  ...        NaN     0.0    1012.3   10.0  2013-01-01 02:00:00
## 4  2013      1    1       NaN             NaN  ...        NaN     0.0    1012.4   10.0  2013-01-01 02:00:00
## 
## [5 rows x 29 columns]

Using Joins to Analyze the Data

Use the inner joined dataset to calculate the average departure delay for flights with precipitation greater than 0.5.

# R example
average_delay_per_condition <- flights_weather_inner_joined %>%
  group_by(precip > 0.5) %>%
  summarise(Average_Departure_Delay = mean(dep_delay, na.rm = TRUE))

average_delay_per_condition

## # A tibble: 2 × 2
##   `precip > 0.5` Average_Departure_Delay
##   <lgl>                            <dbl>
## 1 FALSE                             12.6
## 2 TRUE                              48.8

# Python example
average_delay_per_condition = flights_weather_inner_merged.groupby(flights_weather_inner_merged["precip"] > 0.5).agg(Average_Departure_Delay = ("dep_delay", "mean"))

average_delay_per_condition

##         Average_Departure_Delay
## precip                         
## False                 12.619899
## True                  48.800000

Anti Join

An anti join returns rows from the left dataset where there are no matching keys in the right dataset. It’s useful for identifying records in one dataset that do not have a counterpart in another dataset.

# R example
flights_weather_anti_joined <- anti_join(flights, weather, by=c("year", "month", "day", "hour", "origin"))

head(flights_weather_anti_joined)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1     1153           1200        -7     1450           1529
## 2  2013     1     1     1154           1200        -6     1253           1306
## 3  2013     1     1     1155           1200        -5     1517           1510
## 4  2013     1     1     1155           1200        -5     1312           1315
## 5  2013     1     1     1157           1200        -3     1452           1456
## 6  2013     1     1     1158           1200        -2     1256           1300
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

In Python, the anti merge operation can be achieved by performing a left merge and then filtering out the rows that have no match in the right dataset. The indicator parameter is set to True to create the _merge column. The _merge column is used to identify the rows that have no match in the right dataset with the value left_only.

# Python example
flights_weather_left_merged = pd.merge(flights, weather, how = "left", on = ["year", "month", "day", "hour", "origin"], indicator = True) 
flights_weather_anti_merged = flights_weather_left_merged[(flights_weather_left_merged["_merge"] == "left_only")].drop("_merge", axis = 1)
flights_weather_anti_merged_cleaned = flights_weather_anti_merged.dropna(axis = 1, how = 'all')

flights_weather_anti_merged_cleaned.head()

##      year  month  day  dep_time  sched_dep_time  dep_delay  ...  dest  air_time  distance hour  minute          time_hour_x
## 292  2013      1    1    1153.0            1200       -7.0  ...   LAX     330.0      2475   12       0  2013-01-01 12:00:00
## 293  2013      1    1    1154.0            1200       -6.0  ...   BOS      40.0       200   12       0  2013-01-01 12:00:00
## 295  2013      1    1    1155.0            1200       -5.0  ...   LAX     353.0      2475   12       0  2013-01-01 12:00:00
## 298  2013      1    1    1155.0            1200       -5.0  ...   DCA      57.0       213   12       0  2013-01-01 12:00:00
## 301  2013      1    1    1157.0            1200       -3.0  ...   MCO     151.0       937   12       0  2013-01-01 12:00:00
## 
## [5 rows x 19 columns]

Exercises

We will use the weather and flights datasets from the nycflights13 package (also provided as .csv files for Python users) for the exercises.

Please see the nycflights13”s documentation for more information about the datasets.

Exercise 1 – Filtering and Summarizing

Task: Filter the flights dataset to include only flights with a delay of more than 12 hours. Group and count this output by origin and sort the result in descending order.
Expected Output: A data.frame showing the number of flights delayed over 12 hours by airport, ordered from most to least.

Exercise 2 – Filtering and Summarizing

Task: Calculate the average air time and the number of flights departing from JFK and arriving at LAX in the flights data set. Make sure to report this result in hours.
Expected Output: A data.frame with a single row showing the average air time in hours and the number of flights from JFK to LAX.

Exercise 3 – Wrangling Airport Data

Task: Using the airports dataset, report the frequency of the time zones of destinations in descending order. Additionally, find an example of an airport with a missing time zone and report the name of the airport, explaining how you checked for it.
Expected Output:
1. A data.frame listing the time zones by frequency in descending order.
2. The name of at least one airport with a missing time zone and the code used to identify it.

Exercise 4: More Wrangling

Task: Identify the top 3 months with the highest average departure delays in the flights dataset. For these months, calculate the average, minimum, and maximum departure delay.
Expected Output: A data.frame showing the top 3 months along with their respective average, minimum, and maximum departure delay values.

EEB330 – Precept 04: Data Wrangling

Scott Wolf, Michelle White

Oct. 3, 2024

GitHub Classroom Assignment

Data Wrangling with R

Tidying data

Data Wrangling with Python

Tidying data

Joins/Merges

Inner Join

Left Join

Right Join

Full Join

Using Joins to Analyze the Data

Anti Join

Exercises

Exercise 1 – Filtering and Summarizing

Exercise 2 – Filtering and Summarizing

Exercise 3 – Wrangling Airport Data

Exercise 4: More Wrangling