Rules Tab: Modify Rules

Background
Once you have created a Rule Set and added documents to a Gold Standard Collection, you can create and modify the rules.
Refer to Create a Rule Set: Sintelix Extension, Harvest to a Gold Standard Collection and Evaluate and Modify the Rule Set for information.
When you select Configurations > Harvester Rule Sets, and open a Rule Set, the Rules tab is displayed in the right pane.
What you can do
When you open a Rule Set, the Rules tab is displayed in the right pane.
The Rules tab lists the rules in the Rule Set. In the Rules Tab you can:
-
Select all rules by selected the checkbox at the top left of the rules tab.
-
Copy a rule by selecting the copy icon
next to the rule.
-
Edit a rule by clicking on a rule to open the Rules Dialog to edit a rule. See Rules: Fields and Options
-
Change the Order of the rules, by clicking and dragging a rule.
-
Test Selected Rules by selecting the checkbox next to the rule, which will remove all unselected rules from the Full Page Document pane - showing only the impact of the select rule(s).
-
Delete Selected Rules by selecting the checkbox and selecting the Delete Selected Rule button
-
Automatically simplify every rule in the set by selecting the button.
-
Update all Entity Tags by selecting the button - This will update the full collection with all Entity Tags which have been associated with a Rule.
Copy a rule
To copy a rule, select the copy icon next to the rule.
Result: The copied rule will be added just below the current rule, and will have the same name.
This can be useful when you want to:
-
test alternative settings on a rule while keeping a backup of the original rule
-
create rules that are similar with slight variations in tags or classes
-
create a negative rule to work with the copied rule.
Edit a rule
To edit a rule, simple click on the rule to open the Rules dialog. See Rules: Fields and Options.
Change the Order
The position of the rules in a Rule Set Configuration is important, rules at the bottom will overwrite those at the top if the Rule Paths overlap.
You can change the order of the rules, by clicking and dragging the rule up or down the list.
Order Rules are Applied
The order in which rules are applied to a URL is:
- If the setting Wait random time before harvesting is selected, the rule set waits a random time.
- The page is loaded.
- If the Pre-click before other rules option is selected in any rules, the buttons to which these rules apply (for example 'Show more') are clicked and Harvester waits for the duration of the 'rule set wait' period to allow this content to be loaded.
- 'rule set wait', where you can configure individual rule sets to wait a specified amount of time (up to 60 seconds) before harvesting to enable pages to load completely
- 'random wait', where Sintelix Harvester goes to the websites in the Harvest Queue then waits a random amount of time (up to 60 seconds) before it begins harvesting text to mimic patterns of human interaction with websites
- Positive rules are applied.
- Negative rules are applied (overriding positive rules where there is contention).
- Rules that require a previous element to be selected (h1, h*, p etc) are applied.
- Rules that require (any) previous element to be selected are applied.
- Links to be pushed to the search queue, if the current depth <= harvest depth, are selected.
There are two 'wait' settings related to rule sets:
Test Selected Rules
You can test the effect selected rules by selecting the checkbox next to the rule(s) you want to test.
This can useful when it is not clear which elements are affected by a rule.
Result: This will:
-
make all unselected rules inactive, and
-
update the Full Page document to colour code only those elements impacted by the selected rule(s).
Delete Selected Rules
You can delete one or more rules:
-
select the checkbox next to the rules you want to delete, and
-
select the
button (which is only displayed when at least one rule is selected).
Result: The selected rule(s) are removed from the list.
Auto Simplify All Rules
To automatically remove extraneous tags and classes from the path of every rule, select
.Simplified rules are more generic and run faster than rules with more complex paths.
However, Auto Simplify may make the rule too generic.
If the rule selects too many elements when simplified, you can choose to delete the rule and recreate the rule by clicking on the element you want to harvest and selecting
or and then manually deleting excess tags or ignoring unnecessary classes to determine the most effective combination.Update all Entity Tags
Update all Entity Tags by selecting the
button - This will update the full collection with all Entity Tags which have been associated with a Rule.