class: center, middle, inverse, title-slide .title[ # Working with text data ] .subtitle[ ## Regular expressions and other tools ] .author[ ### Guillaume Falmagne, Scott Wolf, Diogo Melo ] .date[ ###
Sept.Β 25th, 2024 ] --- # Working with Strings ### What are strings again? - Strings are sequences of charcters - We use them to represent text ``` r string1 <- "This is a string" string2 <- 'If I want to include a "quote" inside a string, I use single quotes' ``` If you forget to close a quote, youβll see +, the continuation prompt: ```r > "This is a string without a closing quote + + + HELP I'M STUCK IN A STRING ``` #" - Special characters can be "escaped" with a `\` ``` r double_quote <- "\"" # or '"' single_quote <- '\'' # or "'" ``` --- # Other special characters ### New lines and tabs - `\n` is a new line - `\t` is a tab - `\b` is a word boundary Other unicode characters can be generated using the `\u` or `\U` prefix: ``` r x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604") x ``` ``` ## [1] "one\ntwo" "one\ttwo" "Β΅" "π" ``` ``` r library(stringr) str_view(x) ``` ``` ## [1] β one ## β two ## [2] β one{\t}two ## [3] β Β΅ ## [4] β π ``` --- # Basic string manipulation .pull-left[ Gluing strings together with the `paste()` function: ``` r s1 = "Hello" s2 = "Everyone!" paste(s1, s2) ``` ``` ## [1] "Hello Everyone!" ``` We can specify the separator with the `sep` argument: ``` r paste(s1, s2, sep = ".") ``` ``` ## [1] "Hello.Everyone!" ``` Use the `paste0()` function if you don't want any separator: #' ``` r paste0(s1, s2) # same as paste(s1, s2, sep = "") ``` ``` ## [1] "HelloEveryone!" ``` ] -- .pull-right[ We can pass vectors to `paste()` ``` r s1 = "Hi " s2 = c("Pol", "Mike", "Carol") paste0(s1, s2, "!") ``` ``` ## [1] "Hi Pol!" "Hi Mike!" "Hi Carol!" ``` We can use the `collapse` argument to merge strings in a vector: ``` r paste(s2, collapse = "-") ``` ``` ## [1] "Pol-Mike-Carol" ``` ] --- # Find and Replace ### Find The `grep` funtion can be used to search for a string in a vector: ``` r fruits <- c("apple", "banana", "cherry", "date") # Find indices of fruits containing 'an' grep("an", fruits) ``` ``` ## [1] 2 ``` ### Replace The `gsub` function can be used to replace part of a string: ``` r string2 <- 'If I want to include a "quote" inside a string, I use single quotes' gsub("quote", "banana", string2) ``` ``` ## [1] "If I want to include a \"banana\" inside a string, I use single bananas" ``` --- class: inverse, center, middle # Regular expressions --- # Regular expressions .pull-left[ - Think "super advanced find" - Very useful for data clean-up - The `stringr` package in R provides functions to work with regular expressions. - We use a weird syntax to specify the general strucure of the thing we are searching for. - This is called a `pattern` - Slides heavily draw from [https://r4ds.hadley.nz/regexps](https://r4ds.hadley.nz/regexps) ] .pull-right[  ] --- class: inverse, center, middle # Regular expressions are super confusing! --- # Basic matching .pull-left[ ``` r library(stringr) length(fruit) ``` ``` ## [1] 80 ``` ``` r fruit[1:4] ``` ``` ## [1] "apple" "apricot" "avocado" "banana" ``` ] .pull-right[ - Let's look for entries that match the pattern: `berry` #' ``` r str_view(fruit, "berry") ``` ``` ## [6] β bil<berry> ## [7] β black<berry> ## [10] β blue<berry> ## [11] β boysen<berry> ## [19] β cloud<berry> ## [21] β cran<berry> ## [29] β elder<berry> ## [32] β goji <berry> ## [33] β goose<berry> ## [38] β huckle<berry> ## [50] β mul<berry> ## [70] β rasp<berry> ## [73] β salal <berry> ## [76] β straw<berry> ``` ] --- # Special characters - Letters and numbers match exactly and are called **literal characters**. - Most punctuation characters, like `.`, `+`, `*`, `[`, `]`, and `?`, have special meanings and are called **metacharacters**. - Some common regex metacharacters: - `.`: Matches any character. - `*`: Matches 0 or more occurrences. - `+`: Matches 1 or more occurrences. - `?`: Matches 0 or 1 occurrence. - `[ ]`: Character class. - `|`: Alternation (OR). - `^`: Anchors to the start of the line. - `$`: Anchors to the end of the line. --- # The Wild Card character `.` Look for the letter `a`, followed by any 3 letters, followed by the letter `e` ``` r str_view(fruit, "a...e") ``` ``` ## [1] β <apple> ## [7] β bl<ackbe>rry ## [48] β mand<arine> ## [51] β nect<arine> ## [62] β pine<apple> ## [64] β pomegr<anate> ## [70] β r<aspbe>rry ## [73] β sal<al be>rry ``` --- # Quantifiers `?`, `+`, `*` and `{}` .pull-left[ Quantifiers are metacharacters that can be used to modulate how many matches we are interested in. - `?`: Matches 0 or 1 occurrence. - `+`: Matches 1 or more occurrences. - `*`: Matches 0 or more occurrences. - `{n}` matches exactly n times. - `{n,}` matches at least n times. - `{n,m}` matches between n and m times. ``` r string_vec = c("a", "ab", "abb") # ab? matches an "a", # optionally followed by a "b". str_view(string_vec, "ab?") ``` ``` ## [1] β <a> ## [2] β <ab> ## [3] β <ab>b ``` ] .pull-right[ ``` r # ab+ matches an "a", # followed by at least one "b". str_view(string_vec, "ab+") ``` ``` ## [2] β <ab> ## [3] β <abb> ``` ``` r # ab* matches an "a", # followed by any number of "b"s. str_view(string_vec, "ab*") ``` ``` ## [1] β <a> ## [2] β <ab> ## [3] β <abb> ``` ] --- # Character classes Character classes are defined by `[]` and let you match a set of characters .pull-left[ - `[abcd]` matches `a`, `b`, `c`, or `d` - `[^abc]` any character exept `a`, `b`, `c` ``` r str_view(words, "[aeiou]x[aeiou]") ``` ``` ## [284] β <exa>ct ## [285] β <exa>mple ## [288] β <exe>rcise ## [289] β <exi>st ``` ``` r x <- "abcd ABCD 12345 -!@#%." str_view(x, "[^abcd]+") ``` ``` ## [1] β abcd< ABCD 12345 -!@#%.> ``` ] -- .pull-right[ - `-` defines a range, e.g., `[a-z]` matches any lower case letter and `[0-9]` matches any number. ``` r str_view(x, "[a-z]+") ``` ``` ## [1] β <abcd> ABCD 12345 -!@#%. ``` ``` r str_view(x, "[a-z0-9]+") ``` ``` ## [1] β <abcd> ABCD <12345> -!@#%. ``` ] --- # Alternation using `|` Looking for one of 3 patterns: ``` r str_view(fruit, "apple|melon|nut") ``` ``` ## [1] β <apple> ## [13] β canary <melon> ## [20] β coco<nut> ## [52] β <nut> ## [62] β pine<apple> ## [72] β rock <melon> ## [80] β water<melon> ``` - Fruits with Double vowels? ``` r str_view(fruit, "aa|ee|ii|oo|uu") ``` ``` ## [9] β bl<oo>d orange ## [33] β g<oo>seberry ## [47] β lych<ee> ## [66] β purple mangost<ee>n ``` --- # Anchor characters `^` and `$` .pull-left[ Fruits that contain with `pl` ``` r str_view(fruit, "pl") ``` ``` ## [1] β ap<pl>e ## [28] β egg<pl>ant ## [62] β pineap<pl>e ## [63] β <pl>um ## [66] β pur<pl>e mangosteen ``` Fruits that start with `pl` ``` r str_view(fruit, "^pl") ``` ``` ## [63] β <pl>um ``` ] -- .pull-right[ Words that end with `es` ``` r str_view(words, "es$") ``` ``` ## [161] β clos<es> ## [976] β y<es> ``` Words that begin with `b` and end with `k` ``` r str_view(words, "^b.*k$") ``` ``` ## [67] β <back> ## [72] β <bank> ## [95] β <black> ## [103] β <book> ## [110] β <break> ``` ] --- # Escaping inside regex .center[ .font200[**Sorry**] ] .pull-left[ Inside regular expressions, we use the normal escape symbol `\` for several uses - to search for explicit instances of the metacharacters. Ex: To find the pattern `a.c` - In order to match a literal `.`, you need an escape which tells the regular expression to match metacharacters literally. - So, to match a `.`, you need the regexp `\.` - Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string `\\.` ] .pull-right[ ``` r # To create the regular expression \., # we need to use \\. dot <- "\\." # But the expression itself only contains one \ str_view(dot) ``` ``` ## [1] β \. ``` ``` r # And this tells R to look for an explicit . str_view(c("abc", "a.c", "bef"), "a\\.c") ``` ``` ## [2] β <a.c> ``` ] --- # Searching for `\` If we want to search for `\` it is even worse! - To match a literal `\`, you need to escape it, creating the regular expression `\\`. - To create that regular expression, you need to use a string, which also needs to escape `\`. - That means to match a literal `\` you need to write `\\\\` β **you need four backslashes to match one!** ``` r x <- "a\\b" str_view(x) ``` ``` ## [1] β a\b ``` ``` r str_view(x, "\\\\") ``` ``` ## [1] β a<\>b ``` --- # Why are you doing this to me? This is awful .center[] .pull-left[ We can also use literal strings to avoid this mess: - Literal strings are bound by `r"{` ... `}"` ``` r str_view(x, r"{\\}") ``` ``` ## [1] β a<\>b ``` ] .pull-right[ - For other metacharacters, we can use character classes: ``` r x = c("abc", "a.c", "a*c", "a c") str_view(x, "a[.]c") ``` ``` ## [2] β <a.c> ``` ] --- # Common short-cut characters .pull-left[ There is also a set of special characters that we can use to build our patterns: - `\d` matches any digit; - `\D` matches anything that isnβt a digit. - `\s` matches any whitespace (e.g., space, tab, newline); - `\S` matches anything that isnβt whitespace. - `\w` matches any βwordβ character, i.e. letters and numbers; - `\W` matches any βnon-wordβ character. ``` r x <- "abcd ABCD 12345 -!@#%." str_view(x, "\\d+") ``` ``` ## [1] β abcd ABCD <12345> -!@#%. ``` ] -- .pull-right[ .code80[ ``` r str_view(x, "\\D+") ``` ``` ## [1] β <abcd ABCD >12345< -!@#%.> ``` ``` r str_view(x, "\\s+") ``` ``` ## [1] β abcd< >ABCD< >12345< >-!@#%. ``` ``` r str_view(x, "\\S+") ``` ``` ## [1] β <abcd> <ABCD> <12345> <-!@#%.> ``` ``` r str_view(x, "\\w+") ``` ``` ## [1] β <abcd> <ABCD> <12345> -!@#%. ``` ``` r str_view(x, "\\W+") ``` ``` ## [1] β abcd< >ABCD< >12345< -!@#%.> ``` ] ] --- # Grouping and capturing Parentheses have two uses in regex: .pull-left[ (1) Setting precedence (like in math expressions) - `ab+` matches `a` followed by at least one `b` - `(ab)+` matches at least one `ab` ``` r str_view(c("aab", "abab", "abb"), "ab+") ``` ``` ## [1] β a<ab> ## [2] β <ab><ab> ## [3] β <abb> ``` ``` r str_view(c("aab", "abab", "abb"), "(ab)+") ``` ``` ## [1] β a<ab> ## [2] β <abab> ## [3] β <ab>b ``` ] -- .pull-right[ (2) Allowing the reference to previous matches, or **capturing** - Each parenthese gets a number: `\1` for the first, `\2` for the second... ``` r # Fruits with repeated pairs of letters: str_view(fruit, "(\\w{2})\\1") ``` ``` ## [4] β b<anan>a ## [20] β <coco>nut ## [22] β <cucu>mber ## [41] β <juju>be ## [56] β <papa>ya ## [73] β s<alal> berry ``` ] --- # Capturing and replacing - Capturing is exceptionally powerful for replacing text - We can use regex in the `str_replace()` function - For example: switching the order of the second and third words in sentences: ``` r sentences[1:5] |> str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> str_view() ``` ``` ## [1] β The canoe birch slid on the smooth planks. ## [2] β Glue sheet the to the dark blue background. ## [3] β It's to easy tell the depth of a well. ## [4] β These a days chicken leg is a rare dish. ## [5] β Rice often is served in round bowls. ``` --- # Other useful `stringr` functions - `str_detect()` : returns a logical vector of the same length as the initial vector ``` r str_detect(c("a", "b", "c"), "[aeiou]") ``` ``` ## [1] TRUE FALSE FALSE ``` - Example with the `filter()` function, which we will see in the Data Wrangling class ``` r library(tidyverse) ``` ``` ## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ ## β dplyr 1.1.4 β purrr 1.0.2 ## β forcats 1.0.0 β readr 2.1.5 ## β ggplot2 3.5.1 β tibble 3.2.1 ## β lubridate 1.9.3 β tidyr 1.3.1 ## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ ## β dplyr::filter() masks stats::filter() ## β dplyr::lag() masks stats::lag() ## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` ``` r library(babynames) babynames |> filter(str_detect(name, "x")) |> count(name, wt = n, sort = TRUE) ``` ``` ## # A tibble: 974 Γ 2 ## name n ## <chr> <int> ## 1 Alexander 665492 ## 2 Alexis 399551 ## 3 Alex 278705 ## 4 Alexandra 232223 ## 5 Max 148787 ## 6 Alexa 123032 ## 7 Maxine 112261 ## 8 Alexandria 97679 ## 9 Maxwell 90486 ## 10 Jaxon 71234 ## # βΉ 964 more rows ``` --- # Other useful `stringr` functions .pull-left[ - The next step up in complexity from `str_detect()` is `str_count()` - rather than a true or false, it tells you how many matches there are in each string. ``` r x <- c("apple", "banana", "pear") str_count(x, "p") ``` ``` ## [1] 2 0 1 ``` ] -- .pull-right[ - Pairs naturaly with `mutate()`, which again we will revisit in the Data Wrangling class. ``` r babynames |> count(name) |> mutate( vowels = str_count(name, "[aeiou]"), consonants = str_count(name, "[^aeiou]") ) ``` ``` ## # A tibble: 97,310 Γ 4 ## name n vowels consonants ## <chr> <int> <int> <int> ## 1 Aaban 10 2 3 ## 2 Aabha 5 2 3 ## 3 Aabid 2 2 3 ## 4 Aabir 1 2 3 ## 5 Aabriella 5 4 5 ## 6 Aada 1 2 2 ## 7 Aadam 26 2 3 ## 8 Aadan 11 2 3 ## 9 Aadarsh 17 2 5 ## 10 Aaden 18 2 3 ## # βΉ 97,300 more rows ``` ] --- # Replacing using regex We have two match-replace functions: - `str_replace()`: replaces the first match - `str_replace_all()`: replaces all matches ``` r x <- c("apple", "pear", "banana") str_replace_all(x, "[aeiou]", "-") ``` ``` ## [1] "-ppl-" "p--r" "b-n-n-" ``` --- class: inverse, center, middle # Regular expressions in Python --- # Regular expressions in general Regular expression syntax is shared between all programming languages. - Same basic rules (special characters, escape characters, etc.) - Same matching semantics Details of how to do string comparison, search, replacement vary language-to-language. - R: `stringr` - Python: `re` - Matlab: `regexp()` - Julia: `match()` --- # The `re` library Python also has regex utilities -- they're in the `re` library. `re` ships with Python (don't need to install it) ``` python import re ``` Python (`re`) and R (`stringr`) equivalents: | Function | R | Python| |----------|---|-------| | find (first) | `str_extract`|`re.search` | | find (all) | `str_extract_all` | `re.findall` | | find at beginning | `str_starts` | `re.match` | | replace (all) | `str_replace_all` | `re.sub` | --- # Using the `re` library Some examples ``` python import re email = "tony@tiremove_thisger.net" m = re.search("remove_this", email) print(m) ``` ``` ## <re.Match object; span=(7, 18), match='remove_this'> ``` ``` python email[:m.start()] + email[m.end():] ``` ``` ## 'tony@tiger.net' ``` ``` python import re re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') ``` ``` ## ['foot', 'fell', 'fastest'] ``` --- # Using `pandas` Can also do string extraction of `pandas` `Series` objects (think: dataframe columns) 2 functions to know: 1. `extract` : return first match 2. `extractall` : return all matches These functions return dataframes. ``` python import pandas as pd s = pd.Series(['a1', 'b2', 'c3']) print(s.str.extract(r'([ab])')) # 1 option, return 1 column ``` ``` ## 0 ## 0 a ## 1 b ## 2 NaN ``` ``` python s.str.extract(r'([ab])(\d)') # 2 options, return 2 columns ``` ``` ## 0 1 ## 0 a 1 ## 1 b 2 ## 2 NaN NaN ``` --- # Using `pandas` Can also do the same on dataframe columns. ``` python import pandas as pd iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') print(iris.species.unique()) ``` ``` ## ['setosa' 'versicolor' 'virginica'] ``` ``` python pd.concat([iris['species'], iris.species.str.extract(r'(^v)')], axis=1) ``` ``` ## species 0 ## 0 setosa NaN ## 1 setosa NaN ## 2 setosa NaN ## 3 setosa NaN ## 4 setosa NaN ## .. ... ... ## 145 virginica v ## 146 virginica v ## 147 virginica v ## 148 virginica v ## 149 virginica v ## ## [150 rows x 2 columns] ```