EES Advanced Topics

Aliases (macros)

Instead of repeating a common set of matching rules you can define the pattern once and then form an alias. This alias can then be used for matching.

Example:

Copy

//Create an alias GREETING:
#alias GREETING =
Token<text()="hello"> |
Token<text() ="welcome"> #

//this can then be used in several rules for example:

@GREETING
Token<text()="world">
> HelloWorld
@GREETING
Token<text()="universe">
> HelloUniverse

In projects you frequently want to create a document processing configuration that will work well across a range of types of document. Often each document type requires a different EES file. Unfortunately, the EES rules intended for one document type may fire when processing another, causing errors or misses. To avoid this type of problem and to keep the task of writing EESs as simple as possible, you can use Sintelix's built-in document classifier to classify the different document types and give them identifying tags. These tags can then be used to trigger individual entity extraction scripts.

To make the operation of an EES on a document conditional on the presence of a document tag (for example, "MyDocType" in category "DocTypes"), insert the following command before the first rule in the entity extraction script:

Copy

#cond document.tag<category="DocTypes", name="MyDocType"> #

Conditional execution #section

It may be that you have rules in an EES that should only be applied in a specific context, like a particular section of a document.

For example, consider a document with an Executive Summary at the beginning, followed by an Introduction and other normal document sections. In this example, the document sections are clearly labelled and a simple dictionary has been used to insert text references with name "section_marker" and feature "key = [document section]". The key has values like "executive_summary", "introduction" etc.

Example:

The example below shows the use of #section to apply a rule in the EES.

Copy

//Create a section called EXEC that only executes in the Executive Summary context:
#section EXEC = section_marker<key = "executive_summary">, section_marker<key != "executive_summary"> #
Token > InSection
#sectionend EXEC #

The effect of this snippet would be to tag every Token in the executive summary with tag "InSection".

Syntax:

Syntax for the #section label is:

Copy

#section section-name = matching-pattern1, matching-pattern2 #

where:

section-name is an arbitrary name for the section which must be unique within the EES.
matching-pattern1 is a matching pattern that determines the start of conditional processing, that is, it turns the section on.
matching-pattern2 is a matching pattern that determines the end of conditional processing, that is, it turns the section off.

Matching-pattern1 and matching-pattern2 follow the EES syntax for Matching Patterns.

The dash symbol may be substituted for either matching-pattern. It acts as a wildcard with a meaning of "any".

Example:

Copy

#section BOD = - , section_marker #

Replacing matching-pattern1 with dash means that section BOD applies from the beginning of the document to the first occurrence of tag section_marker.

Sections may be nested.

Syntax:

Syntax for the #sectionend label is:

Copy

#sectionend section-name #

where section-name matches the corresponding section label.

The use of the #sectionend label is optional. If omitted, all of the rules to the end of the EES are considered to be part of the section. The recommended practise is to always use a #sectionend label.

Creating high-performance Entity Extraction Scripts

Sintelix’s EES rule engine runs very fast - but it is still possible to write rules that are very slow to execute.

To make rules run fast, use the rarest and most specific pattern elements in matching patterns.

If any text graph (and therefore any text block) doesn't contain a pattern element required by a rule the entire rule is skipped for that graph.

You could match an exclamation mark (!) with either of the rules below.

Syntax:

Copy

Token<string="!">

or

Copy

Token.punctuation.exclamation

The first rule is a slow rule because every node contains a token. This rule requires that each token is tested to see if its text is an exclamation mark.

The second rule is faster because Token.punctuation.exclamation is much rarer and the number of times the rule is run is therefore drastically reduced.

The first link of any sequence is the most important. You should try not to start sequences with very common pattern elements:. Choose the rarest first, if you can, and then work along to the most common.

EES Advanced Topics

Aliases (macros)

Example:

Conditional execution #cond

Conditional execution #section

Example:

Syntax:

Example:

Syntax:

Creating high-performance Entity Extraction Scripts

Syntax: