Matching Patterns

Matching patterns
The basic syntax for a matching pattern which matches a sequence of pattern elements on the graph. Each pattern element usually matches a link. For example, a matching pattern might have three elements:
pattern_element1
pattern_element2
pattern_element3
Links in a matching pattern can be listed across the page without changing its meaning:
pattern_element1 pattern_element2 pattern_element3
Example:
For example the token string "The happy dog" is matched by the following sequence:
Token<text()="The">
Token<text()="happy">
Token<text()="dog">
or
Token<text()="The"> Token<text()="happy"> Token<text()="dog">
Entity Extraction Script matching patterns find the longest possible match consistent with the pattern.
This capability can be very useful if you have several partial matches for a word list with one covering the whole entity. For example, consider the word list:
#wordlist numbers
one
one two
one two three
For the text "one two three" this would result in the markup:
However, we only want the longest fit. We achieve this by changing the name space of the word list from "tag" (to say, "wordlist") and then creating the output text reference using the EES rule:
numbers
> numbers
We now get:
as required.

Dot notation
To organise your link names, you can use dot notation; for example:
root.level1.level2.leaf
There is no limit to the number of dots you can use.
The value of dot notation in matching is that a link names can be partially specified; so for example,
root.level1
matches
root.level1.a
root.level1.b
root.level1.c.e.f.g
[There is an implicit wildcard after "level1".]

Grouping
You can arbitrarily group matches using brackets.
So in order to match a token with "text()" equal to "hello" or "welcome" and then follow it with term.symbolic you would use:
(
Token<text()="hello">
|Token<text()="welcome">
)
Term.symbolic

Repeating matches
You can specify how many times a match should be repeated in a sequence too. The syntax is as follows:
Syntax:
pattern_element{number} // used to match exactly the number of times specified
pattern_element{min,max} // used to match in a range between the min and max specified
pattern_element* // used to match any number of times
pattern_element+ // used to match at least once
pattern_element? // used to match either once or zero times (this is equivalent to a_match {0,1})
So to match a token with "text()" equal to "hello" and followed by any number term.symbolic links greater than one you would use:
Token<text()="hello">
Term.symbolic+
If you wanted to make the term.symbolic optional instead you would use:
Token<text()="hello">
Term.symbolic?

wildcards
Entity Extraction Scripts permit the use of an explicit wildcard only in the following circumstance: any namespaced link without a root level may be referred to via the wildcard "*"; for example:
* // matches all links in the "tag" namespace.
But
level3* // means any number of repeats of "tag:level3"

Access to Learned Entities (Person, Location, Organisation etc.)
EES matching patterns can only find learned entities after they have been created. Learned entities are created after the "Early" EES stage in the processing pipeline but before the "Late" EES stage - so only "Late" EESs can refer to learned entities (see The Document Processing Workflow - Sintelix).
Learned entity types include:
- tag:Person
- tag:Person-locational...
- tag:Location
- tag:Location-indicator....
- tag:Organisation
- tag:Job-or-activity
- tag:Person-title
- tag:Ethnicity
Trying to match learned entities in EESs in the Early position in the workflow doesn't work (they're not there yet).

Rewind
EESs allow you to rewind (go back to the beginning of a matched section) and continue using another matching sequence via the (~) operator.
This is useful for example, when you are creating a link within another link and you want the new link to inherit all the features of the containing link.
For example:
containing_link[*] ~
(element1
element2
element3) = $for_output
>$for_output = out_link
Rewind can operate repeatedly (you can use ~ several times within a matching pattern).
The return point of the rewind is the beginning of the bracket in which it is situated, or the beginning of the matching pattern if it is not within a bracket.