Harvest to a Gold Standard Collection
Background
When you create a new Harvester Rule Set, a Gold Standard collection is created so you can store documents for the sole purpose of evaluating and modifying the Rule Set.
Before evaluating and modifying a Rule Set, you need to harvest web pages to the Gold Standard collection, to create a representative sample of documents to be processed by the Rule Set in the future.
If the Rule Set works for this representative sample, then it should work for all web pages harvested using this Rule Set.
If you make changes to the Rule Set in the future, you can re-evaluate the Rule Set against the Gold Standard collection, to make sure it is still working as expected and that any modifications made the rules are effective.
Two Versions collected
We collect two types of documents:
-
the Gold Standard document (harvested document) and
-
the Full Page document (full original document).
Video: Create a New Rule Set
Click on the image below to view the video. The video uses the Sintelix Extension to create a new rule set, add another document to the Gold Standard Collection, and finally it shows you to create and modify rules to get a perfect score.
Harvest Documents
You can collect documents for a Gold Standard collection in two ways:
-
Use the Sintelix Extension, selecting the Advanced checkbox and leaving the Full Page checkbox selected (see Harvest via Sintelix Extension).
(Recommended) You have greater control over the sample pages you select and harvest. Starting with a small collection to begin with makes creating and evaluating rules quicker and easier.
-
From the Harvester tab to run a Harvester job, making sure the Harvest Full Page checkbox is selected (Harvest via Search Engines or Harvest via URLs or Perform Batch/Scheduled Harvest)
If you create a new collection, give it the same name as the rule set you plan to create, and add the suffix ‘GS’ to identify it as a gold standard collection. For example, if the rule set you want to create will be titled ‘The Atlantic’, name the collection ‘The Atlantic GS’.