Gorp

Library for building efficient log-line extractor from a multi-regexp input definition, starting with an ordered set of regular expressions (a subset of the usual Java regexp definition; more on this below), building a big DFA using excellent Automaton, as well as complementary extractors for actual value extraction.

In addition to Automaton itself, code relies on part of multiregexp package for actual combination of multiple expressions. For more information on multiregexp, please see this blog post.

Gorp itself builds further upon these 2 libraries, adding:

Ability to not only match regular expressions, but extract data
Defines extractors using a Domain-Specific Language (DSL) that allows modular, (more) readable source definitions

Basic operation

To use Gorp, you need three things:

This library: comes in a single jar which includes shaded dependencies (so-called "uber-jar")
Extraction definition: often a File, either stand-alone or a resource from within bigger jar; or possibly read from external storage like Amazon S3
Input, in form of java.lang.Strings, often coming from a line-oriented input source like (a set of) log file(s).

Extractor input definition

Extractor input definition is a line-oriented text document, consisting of 3 kinds of declarations:

Pattern declarations, which define low-level building blocks that consists of snippets of Regular Expressions and/or references to other patterns
Template declarations, which define intermediate building blocks that consist of references to patterns, literal text segments, references to named templates, inlined patterns and extractors
Extraction declarations, named matching rules that associate a named template with output, possible augmented by additional properties

In addition to these declarations, individual "extractors" are declared as part of templates of extractions.

Simple example declarations would be:

# Patterns
pattern %num \d+
pattern %hostname [a-zA-Z0-9_\-\.]+
pattern %status \w+

# Templates
@endpoint %hostname:%num
# or, if we want to create a parametric template:
@extractEndpoint() $1(%hostname):$2(%num)

# Extraction
extract HostDefinition {
  template @extractEndpoint($srcHost,$srcPort) $status(%status)
}

which shows both a simple template (no parameters), @endpoint, and parametric variant @extractEndpoint.

Basic usage

Assuming you have file extractions.xtr which contains extraction definition (2), and wanted to extract values out of it, you could use:

import com.salesforce.gorp.DefinitionReader;
import com.salesforce.gorp.ExtractionResult;
import com.salesforce.gorp.Gorp;

// ...

DefinitionReader r = DefinitionReader.reader(new File("extractions.xtr"));
Gorp gorp = r.read();
final String TEST_INPUT = "prefix: time=12546778 verb=PUT";
ExtractionResult result = gorp.extract(TEST_INPUT);
if (result == null) { // no match, handle
   throw new IllegalArgumentException("no match!");
}
Map<String,Object> properties = asMap();
// and then use extracted property values

and a sample extraction definition could be something like (note that this is not the only, or even the simplest, way to define it):

pattern %num \d+
pattern %word \w+
# define both simple and parametric just for fun; either one would work
template @extractTime time=$time(%num)
template @extractVerb() verb=$1(%word)
extract SimpleEntry {
   template prefix: @extractTime @extractVerb($verb)
}

and as a result you would get Map like:

{
  "time" : "12546778",
  "verb" : "PUT"
}

Sample Extractor Input Definition

Let's try another example, this time for input of form:

102456879: GET 123ms 200 /rest-service/v1/endpoint?foo=bar

and we could use definition like:

pattern %num \d+
pattern %word \w+
pattern %phrase \S+

extract PutRequest {
   # It's ok to: (a) extract constant value; (b) concatenate physical lines with backslash
   template [$timestamp(%num)]: $verb(PUT) $timeTakenInMsec(%num)ms\
 $path(%phrase)
   append { "marker" : "EXTRACTED" }
}
extract GetRequest {
   template [$timestamp(%num)]: $verb(GET) $timeTakenInMsec(%num)ms\
 $path(%phrase)
   append { "marker" : "EXTRACTED" }
}
extract OtherRequest {
   template [$timestamp(%num)]: $verb(%word) $timeTakenInMsec(%num)ms\
 $path(%phrase)
   append { "marker" : "EXTRACTED" }
}

which would be one way to define a multi-matcher. Things to note include:

Multiple physical lines may be concatenated into a single logical line by ending the physical line with backslash ("")
Comments are allowed; may have optional leading white-space but the first character has to be '#'
Order of declarations matters: first match will be taken; in this case this means that "OtherRequest" match needs to come after both "GetRequest" and "PutRequest"
You can append arbitrary key/value pairs by using "append" property for extraction

It is also worth noting that we did not use templates here; but we could simplify things a bit by doing something like:

template @timeAndPath $timeTakenInMsec(%num)ms $path(%phrase)
# ...
extract OtherRequest {
   template [$timestamp(%num)]: $verb(%word) @timeAndPath
   append { "marker" : "EXTRACTED" }
}

that is, by creating reusable templates to reduce amount of duplication.

Further simplication may be possible by using parametric templates, where you can use parameters to refer to either other templates (by name passed as parameter) and different extractor names (similarly pass name of property to extract to), for example:

template @extractBracketed() [$1(@2)]

where you would invoke it by something like:

template @endpointDef %hostname:%port
extract OtherRequest {
   template Connection: @extractBracketed($src,@endpointDef) ->\
 @extractBracketed($dst,@endpointDef)
}

to match lines like:

Connection: [foobar.com:8080] -> [barfoo.internal.org:80]

and extract values foobar.com:8080 (as src) and barfoo.internal.org:80 (as dst)

Regular Expressions supported

Expressions supported for named and inline patterns can be thought of either as a subset of the full java.util.regexp.Pattern, or as a superset of what Automaton RegExp implementation supports (see Automaton RegExp Javadocs).

Additions to Automaton base

Additions above and beyond Automaton RegExp are:

Quoted control characters like \\t (stock Automaton does NOT allow those, only literal tabs!)
Addition of pre-defined character classes \d/\D, s/S, \w/\W

Basic Automaton supports

Simple character classes (one level of brackets, optionally starting with ^ for negation`
Basic repetition markers * (Kleene star), +, ?, {n}
Grouping ((....))
Literal escaping with \ (that is, character immediately following is used as-is)
- NOTE: due to extension here, literal quoting is ONLY used for-alphanumeric characters!
Concatenation, union (|)

but none of the extension features are enabled, to make it more likely that the same input patterns can be used with both Automaton and the regexp-based extractors.

Features missing from `java.util.regex`

Of all the features, explicitly not supported features include:

"Advanced" character class features like:
- Advanced combinations of character classes (subtraction, intersection)
- POSIX, java.lang.Character etc classes (anything of form \p{...})
Boundary matchers (except for implicit ^ and $)
Reluctant or Possessive quantifiers (all matching is greedy by default)
Back-references
Named matching groups (instead, extractors are used to same effect)
Special constructors (matching that starts with (?
- NOTE: internally non-matching group markers are used to only capture groups define via extractors, as optimization)

Some of these features may be potentially supportable, if Automaton package adds such support; or, in case of named character classes, by adding conversion within Gorp itself. But some features (like back-references) are unlikely to be supportable.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
gorp-core		gorp-core
samples		samples
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gorp

Basic operation

Extractor input definition

Basic usage

Sample Extractor Input Definition

Regular Expressions supported

Additions to Automaton base

Features missing from `java.util.regex`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

salesforce/gorp

Folders and files

Latest commit

History

Repository files navigation

Gorp

Basic operation

Extractor input definition

Basic usage

Sample Extractor Input Definition

Regular Expressions supported

Additions to Automaton base

Features missing from java.util.regex

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Features missing from `java.util.regex`

Packages