Rule Set Configuration Settings
Field or option |
Description |
||||||
---|---|---|---|---|---|---|---|
Name |
The Rule Set Name is displayed on the left. In the example above, it is "CNN Politics". |
||||||
Description |
The Description of the Rule Set is auto filled when the Rule Set is created. You can edit this field to describe the purpose and proposed use of this Rule Set. |
||||||
Batch Harvest Parameters |
|||||||
URL Patterns |
This is used by Sintelix to determine which Rule Set to apply to which web pages based on the web page URL. Selecting will populate this field with URLs found in the Gold Standard Collection.Enter the domain or URL patterns you want the rule set to harvest. If the rule set is for a specific domain, enter the domain, using wild cards if necessary, for example *nytimes.com* If the rule set is more generic, enter specific domains and/or parts of URLs, using wild cards to make them as generic as possible. For example. *.*blog*.com* *.*news*.com* *.*post*.com* *.*press*.com* *.abc.* */article* */news* */story* |
||||||
Persona Domains |
Determines the URLs to use when creating a persona. Selecting will populate this field with URLs found in the Gold Standard Collection. |
||||||
Pre-append Text |
You can add a label or title to the top of each harvested document. For example, if you want to record the name of the rule set used to harvest the content of the document, enter it in this field. Optional parameters are:
|
||||||
Rule Set Priority |
You can give the Rule Set a priority. Sintelix sorts the Rule Sets in priority order, from highest priority to lowest priority. As it goes to harvest a page, it will run through the Rule Sets in priority order under it finds a matching rule. As a guide:
For example, you may have a rule set for a specific news site such as The New York Times, and a more generic rule set for other news sites. Giving The New York Times rule set a higher priority ensures that it will be applied to pages on The New York Times website and not the generic rule set for news sites. |
||||||
Max Harvest Depth |
Enter the number of hyperlink levels you want Harvester to follow from the main URL. If you do not want Harvester to follow any hyperlinks enter 1. Too high a number, may result in a the harvest task taking a long time. Generally, 3-5 is considered a balanced approach. |
||||||
Wait Before Harvest |
If you want Harvester to wait a specific time period before harvesting—to enable pages to load completely—select the time period. This waiting period is indicated by the status ‘Waiting (rule set)’. |
||||||
Harvest Links Only |
If you do not want the rule to harvest content from the main URL but to follow hyperlinks and harvest the content from those links, check the Harvest Links Only box. If you select this option, enter the maximum number of hyperlink levels you want Harvester to follow in the Max Harvest Depth field. |
||||||
Hide rule set in Harvester |
Check this box to exclude the rule set from the Harvester rule set search option. This is useful when running large harvests not targeted by this Rule Set. Hiding the Rule Set means the Harvester has one less check to run during a Harvesting task. |
||||||
Harvest All IMGs |
Check this box to automatically harvest all images inside a selected element. |
||||||
Harvest 'Alt' from Extracted IMGs |
Automatically extracts the Alt attribute of all images (IMG tags). Alternatively, you can extract the Alt attribute of selected images using the Add attributes within specific rules. |
||||||
Scroll to bottom |
Used to define the number of times to scroll to the bottom of a web page during a harvest. This is useful for web pages where new content is loaded only when the user scrolls down. |
||||||
Duplicate URLs |
Defines how to handle duplicate URLs. There are 3 options that can be selected:
|
||||||
Display colour |
Defines the text colour of the rule name as it appears on the Harvester page under rule set options. |
||||||
Search Engine Driver |
Selected if the rule set provides search engine results. The Google and Duck Duck Go default rule sets are examples of rules that utilise this option. When selected, the following additional options appear:
|
||||||
Override Preview CSS |
Used to modify the CSS of the gold standard preview document. This is useful when sections of the content cannot be clicked on due to display issues. |
||||||
|
After the required configuration changes have been made, select Save. A success message will be displayed. |