Document Processing

Introduction

Document processing is an integral part of the Ingestion configuration. It is used to define what information is extracted from your documents and how it is marked up in the document output.

Early, Mid and Late Stage

Dictionaries, Entity Extraction Scripts and Document Processing Scripts can be run at different stages of the process:

  • Early stage runs added dictionaries and entity extraction scripts before Sintelix’s Learned Entity Extractor

  • Mid stage runs Document Processing Scripts immediately after the Learned Entity Extractor.

  • Late Stage runs after everything else has been run.

As a general guide, Document Processing is run as listed on the configuration page.

Process

To configure Ingestion:

  1. Select Configurations > Document Processing

  2. Select the configuration you want to modify.

    See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.

  3. Complete or modify each section's settings, as described below.

  4. Select the Save button.

Enable built-in Entity Extraction

In its default state, document processing has a built in Entity Extraction that will extract common entities such as people, organisations and locations from your documents.

You can unselect the Enable Built-in Entity Extraction checkbox to disable the built-in entity extraction.

When disabled, the Dictionaries, Entity Extraction Scripts and Document Processing Scripts added below are used to apply entity extraction.

Phrase Chunker

The Phrase Chunker is an advanced feature for dividing a sentence into sequences of semantically-related words. Selecting the Phase Chunker checkbox will generate a new annotation type on the Text Graph.

Machine Learning

When editing a document, you can manually modify marked up to text references and connections.

You can save these edits so they can be applied to future documents processed by this configuration, by selecting the Enable Machine Learning checkbox.

Sintelix will save the text references in a Machine Learning dictionary and the connections in a Machine Learning entity extraction script.

Dictionaries (Early Stage)

Dictionaries added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Dictionary in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Entity Extraction Scripts (Early Stage)

Entity Extraction Scripts added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Entity Extraction Script in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Learned Entity Extraction Configuration

This section enables you to exclude specific Text References from the document output. You may either enter the name of the Text Reference class, or select it from an Ontology to add to the exclusion list.

Document Processing Scripts (Mid Stage)

Document Processing Scripts added here will run after Sintelix's Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Dictionaries (Late Stage)

Dictionaries added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Entity Extraction Scripts (Late Stage)

Entity Extraction Scripts added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Scripts added here can refer to Text References created by the Learned Entity Extractor. Scripts added here can 0modify or delete existing Text References.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Document Processing Scripts

This is an advanced feature that rarely needs to be used. It exists to cover any marginal use cases that may require a modification of the standard Document Processing workflow.

Select the Save button.