Ingestion

Introduction

The ingestion configuration manages what happens to documents, files, URLs, and their textual content after the data is uploaded to Sintelix.

The Ingestion configuration also assigns other configurations to ingestion processing, including:

Process

To configure Ingestion:

  1. Select Configurations > Ingestion

  2. Select the configuration you want to modify.

    See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.

  3. Complete or modify each section's settings, as described below.

  4. Select the Save button.

Document Pre-processing

Optical Character Recognition (OCR) Processing

Allows you to enable and choose settings for Optical Character Recognition (OCR) Processing to convert images and scanned documents into text.

Audio and Video Processing

Enables Sintelix to transcribe audio and video files into text.

Document parsing exclusions

This option allows you to ingest files with a specific content type as a plain text source. For example, when the text/html content type is added, ingested web pages would display its full raw HTML markup, and not the parsed HTML content.

HTML Cleaning

When HTML cleaning is selected, non-content related and hidden elements in web pages, such as unwanted social media links, ads and navigation links, will automatically be detected and removed. This is particularly useful for extracting only the content of a news or blog article, for example.

PDF Form Extractors

You can choose which PDF Form Extractor Configurations to include in this ingestion. PDF form Extractor configurations define how PDF form fields are identified, extracted and marked up.

The PDF Form Extractors option is only visible if active on the user license.

PDF Ingestion

Configure settings that improve the quality of regular PDF ingestion.

Content Generation

Creates additional content in a text block at the beginning of an ingested document. The content can be created from document properties or XML elements.

Storage Options

Document Deduplication

Removes duplicate (identical) documents. Does not remove different versions of the same document. For example, if there are minor variations between documents, the documents will be ingested.

Storage rules

Allows you to choose the types of source documents are stored in Sintelix: archives (e.g. zip files), images and all other documents.

Archive Handling

Creates a list of files contained in a zip archive file and stores it in Sintelix.

Failure Handling

When ingesting files, if a file fails to be processed, Sintelix does not save the document to the Collection. With Failure Handling enabled, documents that fail processing are saved to the Collection with the ingestion property "is_failed" set to true. This allows users to find and inspect failed documents.

Document Ingestion Stages

Ingestion rules are used to decide what should be done to a document depending on the characteristics of a document. There can be more than one rule, and the order of the rules is important, as the first matched rule is applied for a document.

Language detection

Choose the language to apply to the ingested documents. The default is Auto Detect.

Document classifiers

Choose the Classification configurations to apply during Ingestion, if required.

Document taggers

Choose the Tagging configurations to apply during Ingestion, if required.

Pre-processing

You can choose actions to apply to documents during Ingestion before the Document Processing. For example, if a Document Property is present then create a Document Tag.

See Ingestion Report Example for an example of combining Pre-processing and Document processing configuration settings.

Ontology

Select the dropdown arrow to select an ontology to be used for document processing. Extracted text references can be turned into entities by placing a text reference’s class under Entity Classes in the selected ontology.

Document processing

Allows you to select the Document Processing configuration to apply during Ingestion. You can create rules to apply different Document Processing Configurations based on document characteristics, such as tags and metadata. If no rules have been created, or no rules match a document, the default processing rule action will be applied.

See Ingestion Report Example for an example of combining Pre-processing and Document processing configuration settings.

Structured classifiers

Choose Structured Classifiers configurations to the document ingestion process, if required.

Network update configuration

The network update configuration is used to automatically generate or update a network every time a collection is processed. More than one network can be updated through this setting.

Document persistence

By default, ingested documents are stored in a Collection and the entities and links extracted from these documents are stored in a Network (when a Network Update Configuration is included in Ingestion). This allows you to view and modify the documents from which the entities and links were extracted.

However, you can choose not to keep the documents in a Collection and only keep the entities and links extracted from the documents in the Network.

This does not affect existing documents in the Collection.

For example, if you ingest 100 documents with the default document persistence set and then change the ingestion configuration to remove the default setting and ingest 50 documents, then the original 100 documents will remain in the collection but the additional 50 documents ingested will not be in the collection.

Document persistence - keeping this enabled stores the ingested documents in its assigned collection. If this is disabled, the newly ingested documents will not persist in the collection. This will not affect the existing documents in the collection.