Evaluate and Modify the Rule Set
Background
Rule Sets are used to harvest web pages. Concept: Harvesting and Harvester Rule Sets
Refer to the Concept: Harvester Gold Standard for information on the concept of a Gold Standard.
Goal
The ultimate purpose of a Gold Standard collection is to provide a tool for testing and evaluating the effectiveness of a Rule Set.
By running the current rules on the Full Page Document and comparing it to the Gold Standard document, you can identify any issues with the Rule Set.
For a Gold Standard collection to be most effective, each Gold Standard document should contain only the elements you want the rule set to harvest.
However, when a Gold Standard document is first harvested it may contain some missing or spurious content.
Therefore, you need to refine the set of rules, which means:
-
adding/removing content from the Gold Standard document until it represents what you want harvested
-
creating/modifying rules to add the elements you want selected, and
-
removing/modifying rules to make sure unwanted elements are not selected.
This is a manual process which is simplified by being able to view the gold standard document (which contains only text) beside the corresponding full page document A document in a Sintelix Harvester gold standard collection that contains every harvestable element (content and boilerplate) from a web page. (which contains all the elements from the original web page). As you select elements on the Full Page document to add to the Gold Standard (or remove from it), you immediately see the effect on the gold standard.
To create or modify a rule set you need a solid understanding of HTML, including nested elements and classes.
Access
To evaluate and modify a Rule Sets, select:
-
Configurations > Harvester Rule Sets, and
-
the Rule Set you want to modify.
Result: The Rule Set panes are displayed.
Once you have selected the Rule Set, you can select the collapse icon to collapse the Configurations and Harvester Rule Sets panes to maximise the view of the remaining three panes.
Rule Set Panes
When configuring the Harvester Rule Sets, there are three panes displayed.
You can resize the 3 panes, by hovering over the pane separators, clicking and dragging to resize.
You can also select the collapse icon to collapse the Configurations, Harvester Rule Sets, and Collection Evaluation panes.
Rule Set Panes and Tabs
There are three pane areas:
-
Left pane has two tabs:
-
Collection Evaluation tab (default) displays the Evaluation Table identifying the number of correct, spurious and missed elements based on the Gold Standard Collection selected.
-
Rule Set Configuration - allows you to modify the Rule Set settings (See Rule Set Configuration Settings).
-
-
Middle pane always displays the Full Page Document pane. You select the Full Page document displayed by clicking on the document title in Evaluation Table the left pane.
-
Right pane has four tabs:
-
Rules tab - selected whenever you add a new or modify a rule (See Rules: Fields and Options).
-
Gold Standard tab - displays the Gold Standard document so you can see the resulting document once the rules have been applied.
-
Selected by Rule Set - displays Pre-click events, harvested content including content tagged as an entity, and follow hyperlinks identified.
-
Errors tab - lists the spurious and misses in the document and provides a quick way to correct the errors. See Errors tab: Quickly Fix Errors.
-
Collection Evaluation tab
When you open a Rule Set, the last Collection linked to the Rule Set will be displayed in the Collection Evaluation tab.
On this pane, you can perform three key tasks:
-
Select the Gold Standard Collection
-
Update the Evaluation Table, and
-
Select a Document to evaluate and test.
See Collection Evaluation tab for more details.
Full Page Document pane
The Full Page pane displays the Full Page document containing all the original html.
Using the Full Page document, you can:
-
visualise the effect of selected rules on the Full Page document
-
add and modify rules by selecting elements so you can:
-
create a new rule for selected elements
-
add the selected elements to the Gold Standard document
-
remove the selected elements from the Gold standard document
-
-
Change the zoom on the pane from small, medium to large
-
Switch Views between visual mode and plain document mode, which can be useful when the visual mode is not displaying correctly.
See Full Page Document pane for more details.
Rules tab
The Rules tab lists the rules in the Rule Set. In the Rules tab you can:
-
Copy a rule by selecting the copy icon
next to the rule.
-
Edit a rule by clicking on a rule to open the Rules Dialog to edit a rule. See Rules: Fields and Options
-
Change the order of the rules, by clicking and dragging a rule.
-
Test selected rules by selecting the checkbox next to the rule, which will remove all unselected rules from the Full Document pane - showing only the impact of the select rule(s).
-
Delete selected rules by selecting the checkbox and selecting the
button. -
Automatically simplify every rule in the set by selecting the
button. -
Copy a rule by selecting the
button - This will update the full collection with all Entity Tags which have been associated with a Rule.
See Rules Tab: Modify Rules for more details.
Gold Standard tab
The Gold Standard tab displays the Gold Standard document matching the Full Page document in the middle pane.
You can:
-
- which effectively creates a blank document. You can then re-apply the rules to the document using the Errors tab. This can be useful when you are troubleshooting rules.
-
Open the Sintelix document - this shows the Gold Standard document with the entities and links applied.
Selected by Rule Set
The Selected by the Rule Set tab displays Pre-click events, harvested content including content tagged as an entity and follow hyperlinks identified.
This is useful for evaluating rules which are not reflected in the Gold Standard document, for example, following hyperlinks.
Errors tab
The Errors pane lists the spurious and misses for this document. On the errors pane you can:
-
Select an error to find it in the document.
-
Select one or more errors to take corrective action, by either Removing or Adding content to the Gold Standard document.
This is useful when you have made a lot of changes to the rules and you want to quickly update the Gold Standard document with the latest rules applied.
-
Select all Spurious text errors and Add them to the Gold Standard.
-
Select all Misses and Remove them from the Gold Standard.