This week’s assignment can be found here.

Introduction

In this document, we will explore two significant aspects of data analysis: Data Input/Output (IO) and String Wrangling with Regular Expressions (Regex). Data IO is fundamental as it’s the first step in the data analysis pipeline, where we read or load data from various sources. String wrangling, particularly using Regex, is vital when we are dealing with text data and need to extract, replace, or modify strings based on specific patterns.

Data IO in R

Data can be read from various sources such as CSV files, Excel files, or databases. Below are some ways to read data into R.

Using `read.csv` from Base R

# Reading a CSV file using read.csv from base R
data_base_r <- read.csv("places_in_princeton.csv")

# Inspecting the first few rows of the data
head(data_base_r)

##   id                            name                                address
## 1  1                     Nassau Hall     1 Nassau Hall, Princeton, NJ 08544
## 2  2 Princeton University Art Museum            Elm Dr, Princeton, NJ 08544
## 3  3           Albert Einstein House     112 Mercer St, Princeton, NJ 08540
## 4  4        Princeton Public Library 65 Witherspoon St, Princeton, NJ 08542
## 5  5         McCarter Theatre Center  91 University Pl, Princeton, NJ 08540
## 6  6                   Marquand Park      68 Lovers Ln, Princeton, NJ 08540
##                                             comment
## 1                 Historical building in Princeton.
## 2 A place with a vast and varied collection of art.
## 3                      Albert Einstein's residence.
## 4                    The hub of community learning.
## 5            Famous for its performances and shows.
## 6        A peaceful place to walk and enjoy nature.

Using `read_csv` from `readr`

pak::pkg_install("readr")
library(readr)

# Reading a CSV file using read_csv from readr package
data_readr <- read_csv("places_in_princeton.csv")

## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, address, comment
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Inspecting the first few rows of the data
head(data_readr)

## # A tibble: 6 × 4
##      id name                            address                          comment
##   <dbl> <chr>                           <chr>                            <chr>  
## 1     1 Nassau Hall                     1 Nassau Hall, Princeton, NJ 08… Histor…
## 2     2 Princeton University Art Museum Elm Dr, Princeton, NJ 08544      A plac…
## 3     3 Albert Einstein House           112 Mercer St, Princeton, NJ 08… Albert…
## 4     4 Princeton Public Library        65 Witherspoon St, Princeton, N… The hu…
## 5     5 McCarter Theatre Center         91 University Pl, Princeton, NJ… Famous…
## 6     6 Marquand Park                   68 Lovers Ln, Princeton, NJ 085… A peac…

write_delim(data_readr, "places_in_princeton.tsv", delim = "\t")

Data IO in Python

Data can be read from various sources such as CSV files, Excel files, or databases. Below are some ways to read data into Python.

Using `csv.reader` from `csv`

# Reading a CSV file using csv.reader in Pythom
import csv
from itertools import islice

with open("places_in_princeton.csv", mode = "r") as file:
  data = csv.reader(file)

  # To print the data
  for lines in islice(data,6):
    print(lines) # notice how the first line here is the row of headers

## ['id', 'name', 'address', 'comment']
## ['1', 'Nassau Hall', '1 Nassau Hall, Princeton, NJ 08544', 'Historical building in Princeton.']
## ['2', 'Princeton University Art Museum', 'Elm Dr, Princeton, NJ 08544', 'A place with a vast and varied collection of art.']
## ['3', 'Albert Einstein House', '112 Mercer St, Princeton, NJ 08540', "Albert Einstein's residence."]
## ['4', 'Princeton Public Library', '65 Witherspoon St, Princeton, NJ 08542', 'The hub of community learning.']
## ['5', 'McCarter Theatre Center', '91 University Pl, Princeton, NJ 08540', 'Famous for its performances and shows.']

Using `read_csv` from `pandas`

# Reading a CSV file using pandas in Python
import pandas

data = pandas.read_csv("places_in_princeton.csv")

# Inspecting the first three rows of the data
data.head(6)

##    id  ...                                            comment
## 0   1  ...                  Historical building in Princeton.
## 1   2  ...  A place with a vast and varied collection of art.
## 2   3  ...                       Albert Einstein's residence.
## 3   4  ...                     The hub of community learning.
## 4   5  ...             Famous for its performances and shows.
## 5   6  ...         A peaceful place to walk and enjoy nature.
## 
## [6 rows x 4 columns]

A Quick Regex Guide

Regular Expressions (Regex) are a powerful tool to match and manipulate text and can be used across languages. Here is a brief guide on some of the basic regex symbols and patterns with examples:

\\d: Matches any digit.
- Example: "\\d" will match any single digit like 5 in "Price: $5".
\\w: Matches any word character (alphanumeric + underscore).
- Example: "\\w" will match any single word character like a in "apple".
\\b: Word boundary.
- Example: "\\bword\\b" will match the word "word" but not "swordfish".
+: Matches one or more of the preceding character or group.
- Example: "\\d+" will match one or more digits like 123 in "123 apples".
*: Matches zero or more of the preceding character or group.
- Example: "a*" will match any number of consecutive a’s, including none, as in "baaa!".
(?:): Non-capturing group.
- Example: "\\d(?:\\.\\d)?" will match 1 or 1.5 but not .5.
^: Start of a line.
- Example: "^Start" will match any line that begins with the word "Start".
$: End of a line.
- Example: "end$" will match any line that ends with the word "end".
[]: Defines a character class.
- Example: "[aeiou]" will match any vowel in "apple".
|: Acts as an OR operator.
- Example: "apple|orange" will match either "apple" or "orange" in a given text.
\\s: Matches any whitespace character.
- Example: "\\s" will match the space in "apple orange".
{n}: Matches exactly n occurrences of the preceding character or group.
- Example: "\\d{3}" will match 123 in "12345".
{n,}: Matches n or more occurrences of the preceding character or group.
- Example: "\\d{2,}" will match 123 in "12345", but not 1.
{n,m}: Matches between n and m occurrences of the preceding character or group.
- Example: "\\d{2,4}" will match 123 and 12 in "12345".

By understanding and combining these patterns, you can create complex expressions to match a wide range of strings within your text data.

String Wrangling in R

Installing and loading the `stringr` package

pak::pkg_install("stringr")
library(stringr)

`stringr` examples

Extracting digits from strings can be crucial to isolate specific numerical information such as prices or zip codes.

strings <- c("123 Main St", "Price: $200 0")
digits <- str_extract_all(strings, "\\b\\d+\\b") # notice how only one backslash is needed for regex in Python

digits

## [[1]]
## [1] "123"
## 
## [[2]]
## [1] "200" "0"

`tidyr` examples

Exampled pulled from: https://r4ds.hadley.nz/regexps#sec-extract-variables

pak::pkg_install("tidyr")
library(tidyr)

df <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45",
  "<Brandon>-N_33",
  "<Sharon>-F_38",
  "<Penny>-F_58",
  "<Justin>-M_41",
  "<Patricia>-F_84",
)

df |>
  separate_wider_regex(
    str,
    patterns = c(
      "<", # Match the literal character '<'.
      name = "[A-Za-z]+", # Match one or more alphabets (upper or lower case) and create a new column 'name' with the matched value.
      ">-", # Match the literal string '>-'.
      gender = ".", # Match any single character (except newline) and create a new column 'gender' with the matched value.
      "_", # Match the literal character '_'.
      age = "[0-9]+" # Match one or more digits and create a new column 'age' with the matched value.
    )
  )

## # A tibble: 7 × 3
##   name     gender age  
##   <chr>    <chr>  <chr>
## 1 Sheryl   F      34   
## 2 Kisha    F      45   
## 3 Brandon  N      33   
## 4 Sharon   F      38   
## 5 Penny    F      58   
## 6 Justin   M      41   
## 7 Patricia F      84

String Wrangling in Python

`re` examples

Extracting digits from strings can be crucial to isolate specific numerical information such as prices or zip codes.

import re

strings = ["123 Main St", "Price: $200 0"]
digits = re.findall(r"\b\d+\b", str(strings))

digits

## ['123', '200', '0']

`pandas` examples

import pandas as pd

# Create a pandas series
sr = pd.Series([
  "<Sheryl>-F_34",
  "<Kisha>-F_45",
  "<Brandon>-N_33",
  "<Sharon>-F_38",
  "<Penny>-F_58",
  "<Justin>-M_41",
  "<Patricia>-F_84",
])

# Pattern breakdown:
# <(.+)>: captures any characters (except newline character) inside the angle brackets
# -([FMN]): returns a match for any of the characters F, M, or N after the hyphen
# _(\d+): returns a match where the string contains digits after the underscore

df = sr.str.extract(r"<(.+)>-([FMN])_(\d+)")

# Assign column names
df.columns = ["name", "gender", "age"]

# Convert "age" column to integer type
df["age"] = df["age"].astype(int)

df

##        name gender  age
## 0    Sheryl      F   34
## 1     Kisha      F   45
## 2   Brandon      N   33
## 3    Sharon      F   38
## 4     Penny      F   58
## 5    Justin      M   41
## 6  Patricia      F   84

Exercises

Exercise 1: Reading TSV Files

A TSV file, places_in_princeton.tsv, uses a tab character as a delimiter between values. Your task is to read this file into R or Python using an appropriate reading function.

Exercise 2. Regex to English

Explain what this regex patters does: “\d+[-]\d+[-]\d+\s” in words. What is an example of something it matches? Leave your response as a comment block in your working script or file.

Exercise 3: Extracting the parts of an address

With the address column from the data from Exercise 1, we can extract ZIP codes as follows:

# in R
zip_codes <- str_extract(data_tsv$address, "\\b\\d{5}(?:-\\d{4})?\\b")

# in Python
zip_codes = data_tsv["address"].str.extract(r"\b(\d{5}(?:-\d{4})?)\b")

Now, extract the street address from the address string and search each comment for the word “Famous” to determine which locations are famous in Princeton. Follow this up by searching for “historic.” Make sure that your search is case-insensitive and captures when historic is a substring as well, like in “historical”.

Exercise 4: More string extraction

Identify locations with a number in their names and extract that number.
Extract all words starting with ‘a’ or ‘A’ from the comment column.

Exercise 5: Place names

Which Princeton place name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

Exercise 6: Extracting month, day, year from a date variable

Create a dataframe called flights from nycflights.csv and create the following variable date_strings:

date_strings <- paste(flights$year, flights$month, flights$day, sep = "-")

date_strings = flights[["year", "month", "day"]].agg("-".join, axis = 1)

Extract the months, days, and years from your new variable into three separate variables for each date component

EEB330 – Precept 03: Data IO and String Wrangling with Regex in R

Scott Wolf, Michelle White

Sept. 26th, 2024

Introduction

Data IO in R

Using `read.csv` from Base R

Using `read_csv` from `readr`

Data IO in Python

Using `csv.reader` from `csv`

Using `read_csv` from `pandas`

A Quick Regex Guide

String Wrangling in R

Installing and loading the `stringr` package

`stringr` examples

`tidyr` examples

String Wrangling in Python

`re` examples

`pandas` examples

Exercises

Exercise 1: Reading TSV Files

Exercise 2. Regex to English

Exercise 3: Extracting the parts of an address

Exercise 4: More string extraction

Exercise 5: Place names

Exercise 6: Extracting month, day, year from a date variable

EEB330 – Precept 03: Data IO and String Wrangling with Regex in R

Scott Wolf, Michelle White

Sept. 26th, 2024

Introduction

Data IO in R

Using read.csv from Base R

Using read_csv from readr

Data IO in Python

Using csv.reader from csv

Using read_csv from pandas

A Quick Regex Guide

String Wrangling in R

Installing and loading the stringr package

stringr examples

tidyr examples

String Wrangling in Python

re examples

pandas examples

Exercises

Exercise 1: Reading TSV Files

Exercise 2. Regex to English

Exercise 3: Extracting the parts of an address

Exercise 4: More string extraction

Exercise 5: Place names

Exercise 6: Extracting month, day, year from a date variable

Using `read.csv` from Base R

Using `read_csv` from `readr`

Using `csv.reader` from `csv`

Using `read_csv` from `pandas`

Installing and loading the `stringr` package

`stringr` examples

`tidyr` examples

`re` examples

`pandas` examples