re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

9,372

1,170

9,372

View on GitHub

Top Related Projects

ripgrep

54,157

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

hyperscan

5,066

High-performance regular expression matching library

Quick Overview

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library that provides a regular expression API and implements regular expression matching using a finite-state machine approach, which guarantees linear-time performance.

Pros

Fast and efficient, with linear-time performance guarantees
Thread-safe and suitable for concurrent use
Memory-safe, avoiding stack overflows and other issues common in backtracking engines
Supports most Perl-style regular expression syntax

Cons

Lacks support for some advanced features found in other regex engines (e.g., backreferences, lookahead)
May require more memory than backtracking engines for complex patterns
Learning curve for C++ developers not familiar with the library's API
Limited language bindings compared to some other regex libraries

Code Examples

Basic pattern matching:

#include <re2/re2.h>
#include <iostream>

int main() {
    RE2 pattern("\\w+@\\w+\\.com");
    std::string text = "Contact us at info@example.com";
    
    if (RE2::PartialMatch(text, pattern)) {
        std::cout << "Email found!" << std::endl;
    }
    return 0;
}

Extracting matched groups:

#include <re2/re2.h>
#include <iostream>

int main() {
    std::string text = "Date: 2023-04-15";
    int year, month, day;
    
    if (RE2::PartialMatch(text, "Date: (\\d{4})-(\\d{2})-(\\d{2})", &year, &month, &day)) {
        std::cout << "Year: " << year << ", Month: " << month << ", Day: " << day << std::endl;
    }
    return 0;
}

Global replacement:

#include <re2/re2.h>
#include <iostream>

int main() {
    std::string text = "The quick brown fox jumps over the lazy dog";
    RE2::GlobalReplace(&text, "\\b\\w{4}\\b", "****");
    std::cout << "Censored: " << text << std::endl;
    return 0;
}

Getting Started

To use RE2 in your C++ project:

Install RE2 (e.g., sudo apt-get install libre2-dev on Ubuntu)
Include the RE2 header in your source file: #include <re2/re2.h>
Compile with the RE2 library: g++ -std=c++11 your_file.cpp -lre2

Example usage:

#include <re2/re2.h>
#include <iostream>

int main() {
    RE2 pattern("Hello, (\\w+)!");
    std::string text = "Hello, World!";
    std::string name;
    
    if (RE2::PartialMatch(text, pattern, &name)) {
        std::cout << "Matched name: " << name << std::endl;
    }
    return 0;
}

Competitor Comparisons

ripgrep

54,157

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

Pros of ripgrep

Faster performance for searching large codebases
Built-in support for various file types and gitignore rules
User-friendly command-line interface with colored output

Cons of ripgrep

Limited to text search and not a general-purpose regex library
Less flexibility for complex pattern matching compared to RE2

Code Comparison

RE2:

#include <re2/re2.h>
RE2 pattern("\\w+");
RE2::FullMatch(text, pattern);

ripgrep:

use grep_regex::RegexMatcher;
let matcher = RegexMatcher::new(r"\w+").unwrap();
matcher.is_match(text.as_bytes()).unwrap()

Key Differences

RE2 is a C++ library focused on efficient regular expression matching, while ripgrep is a command-line search tool written in Rust. RE2 provides a more comprehensive regex engine suitable for various applications, whereas ripgrep excels in fast, user-friendly code searching.

RE2 offers better support for complex pattern matching and can be integrated into larger applications. ripgrep, on the other hand, is optimized for searching through large codebases quickly and provides a more intuitive interface for developers working directly from the command line.

hyperscan

5,066

High-performance regular expression matching library

Pros of Hyperscan

Designed for high-performance, multi-pattern matching
Supports advanced features like stream matching and vectorized processing
Optimized for Intel architectures, leveraging hardware acceleration

Cons of Hyperscan

Limited platform support (primarily x86)
More complex API and usage compared to RE2
Larger memory footprint for pattern compilation

Code Comparison

RE2:

re2::RE2 pattern("\\w+");
re2::StringPiece input("Hello, world!");
re2::StringPiece match;
while (re2::RE2::FindAndConsume(&input, pattern, &match)) {
    std::cout << match << std::endl;
}

Hyperscan:

hs_database_t *database;
hs_compile_error_t *compile_err;
hs_compile("\\w+", 0, HS_FLAG_DOTALL, NULL, &database, &compile_err);
hs_scratch_t *scratch;
hs_alloc_scratch(database, &scratch);
hs_scan(database, "Hello, world!", 13, 0, scratch, onMatch, NULL);
hs_free_scratch(scratch);
hs_free_database(database);

Both RE2 and Hyperscan are powerful regular expression engines, but they cater to different use cases. RE2 focuses on simplicity and safety, while Hyperscan prioritizes high-performance matching for large sets of patterns. RE2 is more portable and easier to use, making it suitable for general-purpose regex needs. Hyperscan excels in scenarios requiring fast processing of multiple patterns simultaneously, particularly in network security and content inspection applications.

oniguruma

2,464

regular expression library

Pros of Oniguruma

Supports a wider range of regular expression syntax, including Perl, Python, and Emacs styles
Offers better support for Unicode and multi-byte character sets
Provides more advanced features like look-behind assertions and recursive patterns

Cons of Oniguruma

Generally slower performance compared to RE2, especially for complex patterns
Less memory-efficient, particularly for large-scale text processing tasks
Not as well-suited for high-throughput, production environments

Code Comparison

RE2:

RE2 pattern("\\w+");
re2::StringPiece input("Hello, World!");
std::string word;
while (RE2::FindAndConsume(&input, pattern, &word)) {
    std::cout << word << std::endl;
}

Oniguruma:

regex_t* reg;
OnigRegion* region;
onig_new(&reg, (UChar*)"\\w+", (UChar*)"\\w+"+4, ONIG_OPTION_DEFAULT, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT, NULL);
region = onig_region_new();
const char* str = "Hello, World!";
while (onig_search(reg, (UChar*)str, (UChar*)(str + strlen(str)), (UChar*)str, (UChar*)(str + strlen(str)), region, ONIG_OPTION_NONE) >= 0) {
    // Process matches
}

Both RE2 and Oniguruma are powerful regular expression libraries, but they cater to different use cases. RE2 focuses on speed and safety, making it ideal for high-performance applications. Oniguruma offers more extensive feature support and flexibility, suitable for complex pattern matching scenarios.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

RE2, a regular expression library

RE2 is an efficient, principled regular expression library that has been used in production at Google and many other places since 2006.

Safety is RE2's primary goal.

RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budgetâfailing gracefully when exhaustedâand they avoid stack overflow by eschewing recursion.

It is not a goal to be faster than all other engines under all circumstances. Although RE2 guarantees a running time that is asymptotically linear in the length of the input, more complex expressions may incur larger constant factors; longer expressions increase the overhead required to handle those expressions safely. In a sense, RE2 is pessimistic where a backtracking engine is optimistic: A backtracking engine tests each alternative sequentially, making it fast when the first alternative is common. By contrast RE2 evaluates all alternatives in parallel, avoiding the performance penalty for the last alternative, at the cost of some overhead. This pessimism is what makes RE2 secure.

It is also not a goal to implement all of the features offered by Perl, PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

For more information, please refer to Russ Cox's articles on regular expression theory and practice:

Syntax

In POSIX mode, RE2 accepts standard POSIX (egrep) syntax regular expressions. In Perl mode, RE2 accepts most Perl operators. The only excluded ones are those that require backtracking (and its potential for exponential runtime) to implement. These include backreferences (submatching is still okay) and generalized assertions. The Syntax wiki page documents the supported Perl-mode syntax in detail. The default is Perl mode.

C++ API

RE2's native language is C++, although there are ports and wrappers listed below.

Matching Interface

There are two basic operators: RE2::FullMatch requires the regexp to match the entire input text, and RE2::PartialMatch looks for a match for a substring of the input text, returning the leftmost-longest match in POSIX mode and the same match that Perl would have chosen in Perl mode.

Examples:

assert(RE2::FullMatch("hello", "h.*o"))
assert(!RE2::FullMatch("hello", "e"))

assert(RE2::PartialMatch("hello", "h.*o"))
assert(RE2::PartialMatch("hello", "e"))

Submatch Extraction

Both matching functions take additional arguments in which submatches will be stored. The argument can be a string*, or an integer type, or the type absl::string_view*. (The absl::string_view type is very similar to the std::string_view type, but for historical reasons, RE2 uses the former.) A string_view is a pointer to the original input text, along with a count. It behaves like a string but doesn't carry its own storage. Like when using a pointer, when using a string_view you must be careful not to use it once the original text has been deleted or gone out of scope.

Examples:

// Successful parsing.
int i;
string s;
assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s, &i));
assert(s == "ruby");
assert(i == 1234);

// Fails: "ruby" cannot be parsed as an integer.
assert(!RE2::FullMatch("ruby", "(.+)", &i));

// Success; does not extract the number.
assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s));

// Success; skips NULL argument.
assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", (void*)NULL, &i));

// Fails: integer overflow keeps value from being stored in i.
assert(!RE2::FullMatch("ruby:123456789123", "(\\w+):(\\d+)", &s, &i));

Pre-Compiled Regular Expressions

The examples above all recompile the regular expression on each call. Instead, you can compile it once to an RE2 object and reuse that object for each call.

Example:

RE2 re("(\\w+):(\\d+)");
assert(re.ok());  // compiled; if not, see re.error();

assert(RE2::FullMatch("ruby:1234", re, &s, &i));
assert(RE2::FullMatch("ruby:1234", re, &s));
assert(RE2::FullMatch("ruby:1234", re, (void*)NULL, &i));
assert(!RE2::FullMatch("ruby:123456789123", re, &s, &i));

Options

The constructor takes an optional second argument that can be used to change RE2's default options. For example, RE2::Quiet silences the error messages that are usually printed when a regular expression fails to parse:

RE2 re("(ab", RE2::Quiet);  // don't write to stderr for parser failure
assert(!re.ok());  // can check re.error() for details

Other useful predefined options are Latin1 (disable UTF-8) and POSIX (use POSIX syntax and leftmost longest matching).

You can also declare your own RE2::Options object and then configure it as you like. See the header for the full set of options.

Unicode Normalization

RE2 operates on Unicode code points: it makes no attempt at normalization. For example, the regular expression /Ã¼/ (U+00FC, u with diaeresis) does not match the input "uÌ" (U+0075 U+0308, u followed by combining diaeresis). Normalization is a long, involved topic. The simplest solution, if you need such matches, is to normalize both the regular expressions and the input in a preprocessing step before using RE2. For more details on the general topic, see https://www.unicode.org/reports/tr15/.

Additional Tips and Tricks

For advanced usage, like constructing your own argument lists, or using RE2 as a lexer, or parsing hex, octal, and C-radix numbers, see re2.h.

Installation

RE2 can be built and installed using GNU make, CMake, or Bazel. The simplest installation instructions are:

make
make test
make benchmark
make install
make testinstall

Building RE2 requires a C++17 compiler and the Abseil library. Building the tests and benchmarks requires GoogleTest and Benchmark. To obtain those:

Linux: apt install libabsl-dev libgtest-dev libbenchmark-dev
macOS: brew install abseil googletest google-benchmark pkg-config-wrapper
Windows: vcpkg install abseil gtest benchmark
or vcpkg add port abseil gtest benchmark

Once those are installed, the build has to be able to find them. If the standard Makefile has trouble, then switching to CMake can help:

rm -rf build
cmake -DRE2_TEST=ON -DRE2_BENCHMARK=ON -S . -B build
cd build
make
make test
make install

When using CMake, with benchmarks enabled, make test builds and runs test binaries and builds a regexp_benchmark binary but does not run it. If you don't need the tests or benchmarks at all, you can omit the corresponding -D arguments, and then you don't need the GoogleTest or Benchmark dependencies either.

Another useful option is -DRE2_USE_ICU=ON, which adds a dependency on the ICU Unicode library but also extends the list of property names available in the \p and \P patterns.

CMake can also be used to generate Visual Studio and Xcode projects, as well as Cygwin, MinGW, and MSYS makefiles.

Visual Studio users: You need Visual Studio 2019 or later.
Cygwin users: You must run CMake from the Cygwin command line, not the Windows command line.

If you are adding RE2 to your own CMake project, CMake has two ways to use a dependency: add_subdirectory(), which is when the dependency's sources are in a subdirectory of your project; and find_package(), which is when the dependency's binaries have been built and installed somewhere on your system. The Abseil documentation walks through the former here versus the latter here. Once you get Abseil working, getting RE2 working will be a very similar process and, either way, target_link_libraries(â¦ re2::re2) should Just Workâ¢.

If you are using Bazel, it will handle the dependencies for you, although you still need to download Bazel, which you can do with Bazelisk.

go install github.com/bazelbuild/bazelisk@latest
# or on mac: brew install bazelisk

bazelisk build :all
bazelisk test :all

If you are using RE2 from another project, you need to make sure you are using at least C++17. See the RE2 .bazelrc file for an example.

Ports and Wrappers

RE2 is implemented in C++.

The official Python wrapper is in the python directory and published on PyPI as google-re2. Note that there is also a PyPI re2 but it is not by the RE2 authors and is unmaintained. Use google-re2.

There are also other unofficial wrappers:

A C wrapper is at https://github.com/marcomaggi/cre2/.
A D wrapper is at https://github.com/ShigekiKarita/re2d/ and on DUB.
An Erlang wrapper is at https://github.com/dukesoferl/re2/ and on Hex.
An Inferno wrapper is at https://github.com/powerman/inferno-re2/.
A Node.js wrapper is at https://github.com/uhop/node-re2/ and on NPM.
An OCaml wrapper is at https://github.com/janestreet/re2/ and on OPAM.
A Perl wrapper is at https://github.com/dgl/re-engine-RE2/ and on CPAN.
An R wrapper is at https://github.com/girishji/re2/ and on CRAN.
A Ruby wrapper is at https://github.com/mudge/re2/ and on RubyGems (rubygems.org).
A WebAssembly wrapper is at https://github.com/google/re2-wasm/ and on NPM (npmjs.com).

RE2J is a port of the RE2 C++ code to pure Java, and RE2JS is a port of RE2J to JavaScript.

The Go regexp package and Rust regex crate do not share code with RE2, but they follow the same principles, accept the same syntax, and provide the same efficiency guarantees.

Contact

The issue tracker is the best place for discussions.

There is a mailing list for keeping up with code changes.

Please read the contribution guide before sending changes. In particular, note that RE2 does not use GitHub pull requests.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot