Harvest Web Pages for Gold Standard
There are four main steps to building an effective rule set. View the workflow here.
This section covers the second step in the workflow, how to harvest web pages to create a gold standard A set of model data that you can learn from and test on. For example, in Sintelix, this would be a collection of documents that have been created with specific, preferred properties such as correct document tags and text references. In Sintelix Harvester this would be a collection of documents harvested from web pages where only the correct elements have been selected (that is, only the content you want)..
To harvest web pages for a gold standard:
To find web pages that are representative of the content In Sintelix Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). you ultimately want to harvest using the rule set, use either of the following:

Using Sintelix Harvester
- Select a Project, navigate to Harvester > and under the Query tab, select URL List as the query type.
- Do one of the following:
- Select the gold standard collection that Sintelix created when you created the rule set.
- Create a new collection. See Create a new Collection for more information.
If you create a new collection, give it the same name as the rule set you plan to create, and add the suffix ‘GS’ to identify it as a gold standard collection. For example, if the rule set you want to create will be titled ‘The Atlantic’, name the collection ‘The Atlantic GS’.
- Enter a valid URL specific to your search. Any invalid URL is ignored.
- Under Harvest Parameter, select Harvest full page.
- Select Harvest.

Using Sintelix Extension
- Open a Chrome browser, then open a web page you want to harvest.
- Click the Sintelix Extension icon
and then click Manual Harvest. See Harvest via Sintelix Extension for more information.
- On the Sintelix Harvester dialog, click Next, and select the Advanced check box.
- Select a project and collection from the dropdown.
If you create a new collection, give it the same name as the rule set you plan to create, and add the suffix ‘GS’ to identify it as a gold standard collection. For example, if the rule set you want to create will be titled ‘The Atlantic’, name the collection ‘The Atlantic GS’.
- Select the Full Page check box and select Harvest.