Ingestion

The ingestion configuration manages what happens to documents, files, URLs, and their textual content after the data is uploaded to Sintelix.

The Ingestion configuration also assigns other configurations to ingestion processing, including:

Document Processing, which also links to other configurations:
- Dictionaries
- Entity Extraction Scripts
- Document Processing Scripts
PDF Form Extractors (when licensed)
Ontologies, which uses Icon Sets
Classification
Tagging
Structured Classifiers, and
Network Creation.

To configure Ingestion:

Select Configurations > Ingestion
Select the configuration you want to modify.

See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.
Complete or modify each section's settings, as described below.
Select the Save button.

Allows you to enable and choose settings for Optical Character Recognition (OCR) Processing to convert images and scanned documents into text.

Configure OCR

You must have Sintelix connected to an OCR server to use this feature.

If Sintelix is connected to:

no OCR server, the OCR options will be greyed out or the Optical Character Recognition Processing section will be unavailable. See Connect OCR.
Sintelix OCR server, all OCR options will be available.
ABBYY FineReader OCR server, only the first three OCR options will be available - the remainder will be greyed out.

Enable OCR by choosing one or more of the following options:

Perform OCR processing on PDF files
Perform OCR processing on Images files
Perform OCR processing on images embedded in other documents

Select the required OCR processing options:

Correct page skew	Correct for any rotation that occurred during scanning
Correct page geometry	Correct for any warping or distortion that occurred during scanning
Perform Fast Mode	Choose faster processing
Extract text only	Ignores images or other non-text elements.
Perform aggressive table detection	Applies additional processing algorithms to determine if text is organised in a table format, to recognise and extract the individual cells of a table, including column and row headings.
Specify type of field marking for documents with forms	If a document is a form, you can choose to how fields are marked up in documents.
Language	You can allow Sintelix to automatically detect the language or choose the language.
Forbidden Characters	Enter characters you do not want to be extracted, for example special characters.

Enables Sintelix to transcribe audio and video files into text.

Configure Audio and Video Processing

Select the required options:

Enable audio and video files to be played in Sintelix.

The Unavailable message - - indicates that the connection to media processing server is not set up. See Audio-Video Transcription for information about connecting this capability.
Enable transcripts to be generated - to convert the audio/video to text.
Select transcript language code. Click the lightbulb and select the language required for transcription.

Select speaker recognition mode. When this option is selected, each Speaker (person talking) will be identified. Select the required option from the dropdown:
- None (default)
- Voice
- Channel
- Voice and Channel

Additional vocabulary dictionaries

When transcribing audio/video files using industry-specific terminology, the accuracy of the transcription can be improved by including a dictionary of terms. For example, when lots of abbreviations are used or specialist wording used.
- From the drop-down select Additional vocabulary dictionaries based on the context of the audio/video files being transcribed. You can add multiple dictionaries.
- Select Create a Vocabulary Dictionary to create and open a sample dictionary called Transcription Vocabulary. See Dictionaries for more information on creating and editing dictionaries.
  
  If there is already a dictionary created, you will get this error:

This option allows you to ingest files with a specific content type as a plain text source. For example, when the text/html content type is added, ingested web pages would display its full raw HTML markup, and not the parsed HTML content.

Configure Document parsing exclusions

Content Types

Below are a few examples of content types, and the types of files that normally use them:

text/html - HTML documents and web pages.
application/xml - XML documents and pages.
message/rfc822 - Email messages.
text/csv - spreadsheet.
application/json - data format.

You can add the content types in two ways:

selecting the Add Content-Type option to add each content type individually, or
selecting the Edit All option to enter multiple content types.

Add Content-Type

To add the content types individually:

Click on the Add content-type option -

Result: An empty field is displayed.
Click the lightbulb and select the content type required.
Repeat the above steps until all content types have been added.
Click the red x icon to remove any unwanted option.

Edit All

To add the content types all at once:

Click on the Edit All option -

Result: A dialog box is displayed.
Enter each content-type on a line.

Warning: This method does not check for incorrect or misspelt entries.
Select OK.

Result: The entered options are updated on screen.

When HTML cleaning is selected, non-content related and hidden elements in web pages, such as unwanted social media links, ads and navigation links, will automatically be detected and removed. This is particularly useful for extracting only the content of a news or blog article, for example.

You can choose which PDF Form Extractor Configurations to include in this ingestion. PDF form Extractor configurations define how PDF form fields are identified, extracted and marked up.

The PDF Form Extractors option is only visible if active on the user license.

Configure settings that improve the quality of regular PDF ingestion.

Creates additional content in a text block at the beginning of an ingested document. The content can be created from document properties or XML elements.

Generate Content from Document Properties

To create content from document properties, select the Add (Content Generation Rule) button to add a row and complete the required fields.

Category	Click the lightbulb and select the category of content you want to include: Ingestion Property Metadata Native External
Name	Click the lightbulb and select the property you want to include. The options displayed vary depending on what Category is selected and what properties or data are available. You can use a wildcard to include multiple properties with similar names. For example, Harvest* will include all data with a name starting with Harvest, such as HarvestType and Harvester Rule Set.
Content in Output	Choose to include either: Name and Value, or Value Only.
Enabled	Select the checkbox to either enable or disable the content.
(Remove )	Select the remove icon to remove an content option.

Example of Configuration

Example of Generated Content

Below is an example of content generated from document properties and inserted at the top of the document content.

Removes duplicate (identical) documents. Does not remove different versions of the same document. For example, if there are minor variations between documents, the documents will be ingested.

Allows you to choose the types of source documents are stored in Sintelix: archives (e.g. zip files), images and all other documents.

Creates a list of files contained in a zip archive file and stores it in Sintelix.

When ingesting files, if a file fails to be processed, Sintelix does not save the document to the Collection. With Failure Handling enabled, documents that fail processing are saved to the Collection with the ingestion property "is_failed" set to true. This allows users to find and inspect failed documents.

Ingestion rules are used to decide what should be done to a document depending on the characteristics of a document. There can be more than one rule, and the order of the rules is important, as the first matched rule is applied for a document.

Choose the language to apply to the ingested documents. The default is Auto Detect.

Choose the Classification configurations to apply during Ingestion, if required.

Choose the Tagging configurations to apply during Ingestion, if required.

You can choose actions to apply to documents during Ingestion before the Document Processing. For example, if a Document Property is present then create a Document Tag.

Configure pre-processing

You can check for when a condition applies to a document and then select the action required, as defined in the table below.

When (condition applies)	Take an action
a Document Property is present an Ingestion Property is present a Document Tag is present Always	Create Document Property Create Document Tag Stop further pre-processing

Procedure

To configure a pre-processing action:

Select the Add Processor button.

Result: A new row is added.
Click on the When dropdown, choose the required condition and complete the fields displayed (click on the link to view more detailed help).
Click on the Action dropdown, choose the required action and complete the fields displayed.
The actions are carried out in sequence, so the order of the rules is important. You can use the move up and down arrows the change the order of the rules.
Select the Stop further pre-processing checkbox when you don't want any further actions carried out on the documents matching the current condition.

Example

See Ingestion Report Example for an example of combining Pre-processing and Document processing configuration settings.

Select the dropdown arrow to select an ontology to be used for document processing. Extracted text references can be turned into entities by placing a text reference’s class under Entity Classes in the selected ontology.

Allows you to select the Document Processing configuration to apply during Ingestion. You can create rules to apply different Document Processing Configurations based on document characteristics, such as tags and metadata. If no rules have been created, or no rules match a document, the default processing rule action will be applied.

Configure document processing

You can choose to:

apply the Default Processing Rule only or
add Custom Document Processing Rules to apply before the Default Processing Rule.

Conditions and Actions

You can set conditional rules for documents and then select the action required for documents matching the rule, as defined in the table below.

When (condition applies)	Take an action
a Document Property is present an Ingestion Property is present a Document Tag is present	Don't Ingest Extract Metadata Only Extract Metadata and Text Extract Metadata, Text and Entities

Procedure

To configure a document processing action:

Select the Add Custom Processing Rule button.

Result: A new row is added.
Click on the When dropdown, choose the required condition and complete the fields displayed.
Click on the Action dropdown and choose the required action.
If the action Extract Metadata, Text and Entities is selected, the Document Processing dropdown is displayed. Select the Document Processing configuration you want to apply to documents matching the When condition.
The actions are carried out in sequence, so the order of the rules is important. You can use the move up and down arrows the change the order of the rules.

Example

See Ingestion Report Example for an example of combining Pre-processing and Document processing configuration settings.

Choose Structured Classifiers configurations to the document ingestion process, if required.

The network update configuration is used to automatically generate or update a network every time a collection is processed. More than one network can be updated through this setting.

By default, ingested documents are stored in a Collection and the entities and links extracted from these documents are stored in a Network (when a Network Update Configuration is included in Ingestion). This allows you to view and modify the documents from which the entities and links were extracted.

However, you can choose not to keep the documents in a Collection and only keep the entities and links extracted from the documents in the Network.

This does not affect existing documents in the Collection.

For example, if you ingest 100 documents with the default document persistence set and then change the ingestion configuration to remove the default setting and ingest 50 documents, then the original 100 documents will remain in the collection but the additional 50 documents ingested will not be in the collection.

Document persistence - keeping this enabled stores the ingested documents in its assigned collection. If this is disabled, the newly ingested documents will not persist in the collection. This will not affect the existing documents in the collection.

Ingestion

Introduction

Process

Document Pre-processing

Optical Character Recognition (OCR) Processing

Audio and Video Processing

Document parsing exclusions

HTML Cleaning

PDF Form Extractors

PDF Ingestion

Content Generation

Storage Options

Document Deduplication

Storage rules

Archive Handling

Failure Handling

Document Ingestion Stages

Language detection

Document classifiers

Document taggers

Pre-processing

Ontology

Document processing

Structured classifiers

Network update configuration

Document persistence

Ingestion

Introduction

Process

Document Pre-processing

Optical Character Recognition (OCR) Processing

Configure OCR

Audio and Video Processing

Configure Audio and Video Processing

Document parsing exclusions

Configure Document parsing exclusions

Content Types

Add Content-Type

Edit All

HTML Cleaning

Enable HTML Cleaning

PDF Form Extractors

Configure PDF Form Extractors

PDF Ingestion

Configure PDF Ingestion

Content Generation

Generate Content from Document Properties

Example of Configuration

Example of Generated Content

Generate from XML Elements

Storage Options

Document Deduplication

Enable Document Deduplication

Storage rules

Storage rules

Archive Handling

Configure Archive Handling

Failure Handling

Enable Failure Handling

Document Ingestion Stages

Language detection

Configure language detection

Document classifiers

Configure Document classifiers

Document taggers

Configure Document taggers

Pre-processing

Configure pre-processing

Procedure

Example

Ontology

Document processing

Configure document processing

Conditions and Actions

Procedure

Example

Structured classifiers

Configure Structured Classifiers

Network update configuration

Configure Network Update

Document persistence