Harvest to a Gold Standard Collection

Background

When you create a new Harvester Rule Set, a Gold Standard collection is created so you can store documents for the sole purpose of evaluating and modifying the Rule Set.

Before evaluating and modifying a Rule Set, you need to harvest web pages to the Gold Standard collection, to create a representative sample of documents to be processed by the Rule Set in the future.

If the Rule Set works for this representative sample, then it should work for all web pages harvested using this Rule Set.

If you make changes to the Rule Set in the future, you can re-evaluate the Rule Set against the Gold Standard collection, to make sure it is still working as expected and that any modifications made the rules are effective.

Two Versions collected

We collect two types of documents:

  • the Gold Standard document (harvested document) and

  • the Full Page document (full original document).

Video: Create a New Rule Set

Click on the image below to view the video. The video uses the Sintelix Extension to create a new rule set, add another document to the Gold Standard Collection, and finally it shows you to create and modify rules to get a perfect score.

Harvest Documents

You can collect documents for a Gold Standard collection in two ways:

If you create a new collection, give it the same name as the rule set you plan to create, and add the suffix ‘GS’ to identify it as a gold standard collection. For example, if the rule set you want to create will be titled ‘The Atlantic’, name the collection ‘The Atlantic GS’.