Harvester Rule Sets
What is it?
A Harvester Rule Set is a group of rules applied when harvesting a web page to:
-
select the content In Sintelix Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). elements on a web page - such as headings, authors, dates, captions and paragraphs, and
-
ignore (exclude) the boilerplate element Elements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements. (menus, headers, footers, advertising, etc).
See Concept: Harvesting.
Requirements
To create/modify Rule Sets, you need to install the Sintelix Extension.
See About the Sintelix Extension for Harvesting and Install Sintelix Extension.
Default Rule Sets
Sintelix provides a number of default rule sets A rule set is a group of rules designed to select the elements on a web page that are most likely to contain useful content—such as headings, authors, dates, captions and paragraphs—and not the boilerplate elements..
Default Rule Sets will have a DEF symbol next to the Rule Set name.
If you modify a default Rule Set, a MOD symbol is shown next to the Rule Set.
You can restore the default Rule Set using the Revert option.
An Admin user can update the global default Harvester Rule Sets (See Configure Harvester Settings).
Apply Rule Sets
When you create a harvesting task, you can select which Rule Sets to apply. By default, all Rule Sets are selected.
From the rule sets selected by the user, Sintelix automatically applies the most relevant rule set to the web page you want to harvest.
They are applied in the order of Rule Set priority, set in the Rule Set Configuration.
Manage Rule Sets
You can:
-
All Rule Sets: copy, export, import and modify
-
Default Rule Sets: If a default Rule Set has been modified, you can revert back to the system default.
-
Created Rule Sets: create, rename, and delete Rules Sets created for this Project.
See Manage Rule Sets.
Effective Rule Sets
There are three stages to establishing an effective rule set.
How Upgrades affect Rule Sets?
When you upgrade Sintelix:
- default rule sets that you have not modified will be upgraded.
- default rule sets that you have modified will be retained and not upgraded or overwritten.
- rule sets you have created will be retained.
- new default rule sets may be added to the Harvest Default rule sets configuration.
An Admin user can update the global default Harvester Rule Sets (See Configure Harvester Settings).