Tagged: text

Regular Expressions

We had a nice chat about the uses of regular expressions in R, and determined we use them mainly for dealing with messy data files, or mutating the file names of data files, and for doing some linguistics data analysis tasks. That doesn’t sound like much, but they’re really amazing, and once you’ve started to use them, you’ll wonder how you ever went without.

Check out some of the useful base R functions that make use of them with ?grep:


     ‘grep’, ‘grepl’, ‘regexpr’ and ‘gregexpr’ search for matches to
     argument ‘pattern’ within each element of a character vector: they
     differ in the format of and amount of detail in the results.

     ‘sub’ and ‘gsub’ perform replacement of the first and all matches

And if you use tidyr, you’ll love to use them with extract.

How do you get started? Check out RegexOne. Once you complete all the lessons you’ll be set for a good long while. There are many other resources on the Internet.

Note well for regular expression usage in R: You’ll learn that backslash (\) gets used a lot in regular expressions. Well, it’s also a special character in R (for example, newline is '\n'). For that reason, when you write regular expressions in R, you need to use 2 slashes – so '\w' should actually be '\\w'.