Extracting Required Words from Text Using Pattern Mapping with Regex and R

Text Capture Using Pattern R: Regular Expressions

Introduction

Regular expressions (regex) are a powerful tool for text manipulation and pattern matching. In this article, we will explore how to use regex to capture specific patterns in text data.

Problem Statement

The problem at hand is to extract required words from a given text using pattern mapping. We have a sample dataset with two columns: Unique_Id and Text. The Text column contains strings that may contain repeated values of the format “YYYY-XXXX”.

For example, one row in the data might look like this:

Unique_IdText
Ax23z12Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134

We are able to extract output from the text using the following regex pattern: \d+-\d+. However, this does not give us the desired format of having each year-value separated into two columns.

Solution

One approach for solving this problem is to use the stri_extract_all_regex function in R. This function returns a matrix containing all matches found in the text string. We can then convert this matrix to a data frame and bind it with the original dataset using the bind_cols function from the dplyr package.

Code

library(dplyr)
library(stringi)

# Sample data
foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.",
                                        "His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
                                       )),
                 .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"),
                 row.names = c(NA, -2L))

# Extract numbers with stri_extract_all_regex
out <- stri_extract_all_regex(str = foo$text, pattern = "\\d+-\\d+", simplify = TRUE) %>%
       data.frame(stringsAsFactors = FALSE) %>%
       bind_cols(foo, .)

# Modify column names
names(out) <- names(out) %>% gsub(pattern = "X", replacement = "Column")

# Display the output
print(out)

Output

Unique_IdTextColumn1Column2
12015-81342015-1111
22016-88882016-7777

Last modified on 2024-01-20