The Text Graph

Nodes and tokens

Nodes and Tokens are the backbone of the graph. Each node is linked to its adjacent nodes by Tokens. Tokens are the most basic links in the graph. Nodes and Tokens are created automatically during document processing and cannot be added to or deleted subsequently.

Nodes contain any white space characters (spaces, carriage returns, etc.) between the Tokens. Each conventional word becomes a token. Alphanumeric sequences are divided into tokens where letter sequences join number sequences.

Like other graph elements, Nodes and Tokens also have features (see Features). These are key-value pairs that contain more information about the element's state and position.

A good way to find out the kind and subkind descriptors available is to put some text relevant to your project into the graph analyser UI, and see what is listed. In the example below, the cursor is hovering over "Term.symbolic" and the instances are shown highlighted on the text graph above.

More about links

Links can be made between any pair of nodes in the graph.

Each link contains:

The link name. The name is permanent. No processing module can alter a link name after its creation.
The start and end nodes of the link. These are also permanent and cannot be altered following link creation.
Features. The features of any link can be altered by further processing.
Text. That is the full text covered by the link excluding the text of edge nodes.

Links represent spans of text.

In EES, rules that affect links make changes to the graph immediately after the rule has fired.

Entity extraction scripts as formal grammars

Entity extraction scripts are a scripting language based on an extended form of Context Sensitive Grammar (CSG), the third level within the Chomsky hierarchy, pictured here.

The Text Graph

Example:

Nodes and tokens

More about links

The link hierarchy and implicit wildcards

Notable link hierarchies

Entity extraction scripts as formal grammars