Concept: Harvester Gold Standard

What is it?

A Gold Standard Test comes from scientific testing, which seeks to find the best possible method of testing.

To assess the accuracy of a test, its outcome must be compared with an independently established Gold Standard.

In other words, a test sample is compared to a "perfect" sample (the gold standard), and the results are compared and evaluated.

An ideal gold standard test has a sensitivity of 100% (it identifies correctly) and a specificity of 100% (it does not falsely identify).

Harvester Gold Standard

To test and evaluate a Harvester Rule Set, you can prepare a collection of model "documents" that you can learn from and test with, which we call a Gold Standard Collection.

In Sintelix Harvester, a Gold Standard Collection is a collection of documents that contain documents harvested from web pages for the specific purpose of creating, evaluating and modifying a rule set.

The web pages you harvest must be representative of the content you want to harvest with the rule set. For example, articles from a targeted news site or profiles from a specific social media site.

Gold Standard Collection

When you create a new Rule Set, Sintelix automatically creates a Collection with the same name and the suffix 'GS', to indicate that it is a gold standard collection.

For example, you create a new Rule Set "Event News", Sintelix creates a Collection called "Event News GS".

How many?

The number of web pages you need to harvest to create a gold standard collection will vary according to the uniformity of the HTML elements on the pages.

For example, if you are creating a rule set to harvest content from a single news site and there is very little variation in the elements that are used from one article to the next (such as a heading, subheading, author, date and paragraphs), you may only need to harvest three or four pages.

For sites with greater variations, or to create a rule set that can be applied to multiple sites, you may need to harvest many more pages to gather an adequate sample of the different variants that may be found across the multiple sites.

Versions of documents?

The Gold Standard Collection needs to contain two versions of each web page:

  • Gold Standard document: A harvested document which only includes the content required.
  • Full Page document: The full original document from the website with no changes.

You can harvest both versions simultaneously by ticking the:

The Goal

The goal is to refine the gold standard documents to the point where each is a perfect, or near-perfect, example of all the text you want to harvest from the corresponding web pages.

Evaluating a Pane

By comparing the Gold Standard document (on the right ) with the Full Page document with the Rule Set applied (in the middle ), you can easily see any missing or incorrectly harvested elements (highlighted using Colour Coding).

The table to the left evaluates the reliability by counting the correct, spurious or missed elements and calculates an over all score (see Rule Set Scoring), as illustrated in the screen below.

You can then create, edit and refine the rules to improve the reliability of the Rule Set, see Evaluate and Modify the Rule Set.

Colour Coding

Elements are colour coded to indicate their status:

Colour Gold Standard Rule Fix

Correct (Green) elements

Correct Negative

Included Covered None Required

Spurious (Orange) elements

Spurious Negative

Missing Covered Add to Gold Standard

Missing (Pink) elements

Missing Negative

Included  Missing Remove from Gold Standard

Unselected (no colour) elements

Not Included

None Included

None Required

A quick trick to remember the colours: The Golden colour (orange) is missing from the Gold Standard document.

Rule Set Scoring

The F1 score indicates the precision with which the rule set is selecting the text you want and the level of recall it is achieving (that is, whether it’s missing a few or many elements). An F1 score of 1 indicates perfect precision and recall.

Click on a document title in the table to display the Full Page document with correct, spurious and missed elements highlighted.