README.rmd

---
title: "README.rmd"
output:
  md_document:
    variant: markdown_github
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(re2r)
```

# re2r

[![Build Status](https://travis-ci.org/qinwf/re2r.svg?branch=master)](https://travis-ci.org/qinwf/re2r) [![Build status](https://ci.appveyor.com/api/projects/status/n34unrvurpv18si5/branch/master?svg=true)](https://ci.appveyor.com/project/qinwf/re2r/branch/master) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/re2r)](http://cran.r-project.org/package=re2r)  [![codecov](https://codecov.io/gh/qinwf/re2r/branch/master/graph/badge.svg)](https://codecov.io/gh/qinwf/re2r)

RE2 is a primarily DFA based regexp engine from Google that is very fast at matching large amounts of text.

## Installation

From CRAN:

```r
install.packages("re2r")
```

From GitHub:

```r
library(devtools)
install_github("qinwf/re2r", build_vignettes = T)
```

To learn how to use, you can check out the [vignettes](https://qinwenfeng.com/re2r_doc/).

## Related Work

[Google Summer of Code](https://github.com/rstats-gsoc/gsoc2016/wiki/re2-regular-expressions) - re2 regular expressions.

## Brief Intro

### 1. Search a string for a pattern

`re2_detect(string, pattern)` searches the string expression for a pattern and returns boolean result.

```{r}
test_string = "this is just one test";
re2_detect(test_string, "(o.e)")
```

Here is an example of email pattern.

```r
show_regex("\\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,4}\\b", width = 670, height = 280)
```

![email pattern](https://raw.githubusercontent.com/qinwf/re2r/master/inst/img/email.png)

```{r}
re2_detect("test@gmail.com", "\\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,4}\\b")
```

`re2_match(string, pattern)` will return the capture groups in `()`.

```{r}
(res = re2_match(test_string, "(o.e)"))
```

The return result is a character matrix. `.1` is the first capture group and it is unnamed group.

Create named capture group with `(?P<name>pattern)` syntax.

```{r}
(res = re2_match(test_string, "(?P<testname>this)( is)"))
is.matrix(res)
is.character(res)
res$testname
```

If there is no capture group, the matched origin strings will be returned.

```{r}
test_string = c("this is just one test", "the second test");
(res = re2_match(test_string, "is"))
```

`re2_match_all()` will return the all of patterns in a string instead of just the first one.

```{r}
res = re2_match_all(c("this is test", 
            "this is test, and this is not test", 
            "they are tests"), 
          pattern = "(?P<testname>this)( is)")
print(res)
is.list(res)
```

match all numbers

```{r}
texts = c("pi is 3.14529..",
          "-15.34 °F",
          "128 days",
          "1.9e10",
          "123,340.00$",
          "only texts")
(number_pattern = re2(".*?(?P<number>-?\\d+(,\\d+)*(\\.\\d+(e\\d+)?)?).*?"))

(res = re2_match(texts, number_pattern))
res$number
```

```r
show_regex(number_pattern)
```

![number pattern](https://raw.githubusercontent.com/qinwf/re2r/master/inst/img/number.png)

### 2. Replace a substring

```r
re2_replace(string, pattern, rewrite)
```

Searches the string "input string" for the occurence(s) of a substring that matches 'pattern' and replaces the found substrings with "rewrite text".

```{r}
input_string = "this is just one test";
new_string = "my"
re2_replace(new_string, "(o.e)", input_string)
```

mask the middle three digits of a US phone number

```{r}
texts = c("415-555-1234",
          "650-555-2345",
          "(416)555-3456",
          "202 555 4567",
          "4035555678",
          "1 416 555 9292")

us_phone_pattern = re2("(1?[\\s-]?\\(?\\d{3}\\)?[\\s-]?)(\\d{3})([\\s-]?\\d{4})")

re2_replace(texts, us_phone_pattern, "\\1***\\3")
```
### 3. Extract a substring

```r
re2_extract(string, pattern, replacement)
```

Extract matching patterns from a string.

```{r}
re2_extract("yabba dabba doo", "(.)")
```

```{r}
re2_extract("test@me.com", "(.*)@([^.]*)")
```

### 4. `Regular Expression Object` for better performance

We can create a regular expression object (RE2 object) from a string. It will reduce the time to parse the syntax of the same pattern. 

And this will also give us more option for the pattern. run `help(re2)` to get more detials.

```{r}
regexp = re2("test",case_sensitive = FALSE)
print(regexp)
```

```{r}
regexp = re2("test",case_sensitive = FALSE)
re2_match("TEST", regexp)
re2_replace("TEST", regexp, "ops")
```

### 5. Multithread

Use `parallel` option to enable multithread feature. It will improve performance for large inputs with a multi core CPU.

```r
re2_match(string, pattern, parallel = T)
```