Harvest Web Pages for Gold Standard

There are four main steps to building an effective rule set. View the workflow here.

This section covers the second step in the workflow, how to harvest web pages to create a gold standard A set of model data that you can learn from and test on. For example, in Sintelix, this would be a collection of documents that have been created with specific, preferred properties such as correct document tags and text references. In Sintelix Harvester this would be a collection of documents harvested from web pages where only the correct elements have been selected (that is, only the content you want)..

To harvest web pages for a gold standard:

To find web pages that are representative of the content In Sintelix Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). you ultimately want to harvest using the rule set, use either of the following: