Document Processing

Introduction

Document processing is an integral part of the Ingestion configuration. It is used to define what information is extracted from your documents and how it is marked up in the document output.

Early, Mid and Late Stage

Dictionaries, Entity Extraction Scripts and Document Processing Scripts can be run at different stages of the process:

Early stage runs added dictionaries and entity extraction scripts before Sintelix’s Learned Entity Extractor
Mid stage runs Document Processing Scripts immediately after the Learned Entity Extractor.
Late Stage runs after everything else has been run.

As a general guide, Document Processing is run as listed on the configuration page.

Process

To configure Ingestion:

Select Configurations > Document Processing
Select the configuration you want to modify.

See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.
Complete or modify each section's settings, as described below.
Select the Save button.

Enable built-in Entity Extraction

In its default state, document processing has a built in Entity Extraction that will extract common entities such as people, organisations and locations from your documents.

You can unselect the Enable Built-in Entity Extraction checkbox to disable the built-in entity extraction.

When disabled, the Dictionaries, Entity Extraction Scripts and Document Processing Scripts added below are used to apply entity extraction.

Phrase Chunker

The Phrase Chunker is an advanced feature for dividing a sentence into sequences of semantically-related words. Selecting the Phase Chunker checkbox will generate a new annotation type on the Text Graph.

Machine Learning

When editing a document, you can manually modify marked up to text references and connections.

You can save these edits so they can be applied to future documents processed by this configuration, by selecting the Enable Machine Learning checkbox.

Sintelix will save the text references in a Machine Learning dictionary and the connections in a Machine Learning entity extraction script.

How Machine Learning Works

Machine Learning applies when you make edits to the markups in a document (entities, text references and links). See Edit Documents.

A message at the top of the Document pane confirms that Machine Learning is active.

Sintelix does the following:

When you create a new text reference, the exact text will be added to the ‘learned’ namespace in the Machine Learning dictionary (unless an identical entry already exists).
When you delete a text reference:

if it was in the Machine Learning dictionary in the ‘Learned’ namespace it will be deleted
if it was not in the ‘Learned’ namespace it is added to the ‘removed’ namespace

When Sintelix applies the dictionary during document processing, any matching ‘removed’ namespace text references are removed first then any matching ‘learned’ text references are added (if they don't overlap existing markup). This removes any bad text references before new ones are added and classes are changed.
The dictionary will be applied immediately before Late Stage entity extraction scripts, so a custom script can be used to interact with and refine the result.
When you create a new connection between text references, Machine Learning will update the Machine Learning script (which is an entity extraction script). For each connection that is created this way, a very simple rule is added that matches the exact text of the connection.
If you remove a connection, Machine Learning will remove the matching rule from the script (if it was created by Machine Learning).
When Sintelix applies the script during document processing, the rules will add the Machine Learning connections but not remove any that you have deleted.

Script rules created by Machine Learning are very basic. They simply consist of existing text references and the tokens between them. They will work as they are but it is recommended that you use them as a template for developing custom entity extraction scripts.

Dictionaries (Early Stage)

Dictionaries added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Dictionary in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Entity Extraction Scripts (Early Stage)

Entity Extraction Scripts added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Entity Extraction Script in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Learned Entity Extraction Configuration

This section enables you to exclude specific Text References from the document output. You may either enter the name of the Text Reference class, or select it from an Ontology to add to the exclusion list.

Document Processing Scripts (Mid Stage)

Document Processing Scripts added here will run after Sintelix's Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Dictionaries (Late Stage)

Dictionaries added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Entity Extraction Scripts (Late Stage)

Entity Extraction Scripts added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Scripts added here can refer to Text References created by the Learned Entity Extractor. Scripts added here can 0modify or delete existing Text References.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Document Processing Scripts

This is an advanced feature that rarely needs to be used. It exists to cover any marginal use cases that may require a modification of the standard Document Processing workflow.

Select the Save button.