Rule Set Configuration Settings

Field or option

Description

Name

The Rule Set Name is displayed on the left. In the example above, it is "CNN Politics".

Description

The Description of the Rule Set is auto filled when the Rule Set is created.

You can edit this field to describe the purpose and proposed use of this Rule Set.

Batch Harvest Parameters

URL Patterns

This is used by Sintelix to determine which Rule Set to apply to which web pages based on the web page URL.

Selecting Infer from documents will populate this field with URLs found in the Gold Standard Collection.

Enter the domain or URL patterns you want the rule set to harvest.

If the rule set is for a specific domain, enter the domain, using wild cards if necessary, for example *nytimes.com*

If the rule set is more generic, enter specific domains and/or parts of URLs, using wild cards to make them as generic as possible. For example.

*.*blog*.com*

*.*news*.com*

*.*post*.com*

*.*press*.com*

*.abc.*

*/article*

*/news*

*/story*

Persona Domains

Determines the URLs to use when creating a persona.

Selecting Infer from documents will populate this field with URLs found in the Gold Standard Collection.

Pre-append Text

You can add a label or title to the top of each harvested document. For example, if you want to record the name of the rule set used to harvest the content of the document, enter it in this field.

Optional parameters are:

  • %fromurl%

  • %url%

Rule Set Priority

You can give the Rule Set a priority.

Sintelix sorts the Rule Sets in priority order, from highest priority to lowest priority.

As it goes to harvest a page, it will run through the Rule Sets in priority order under it finds a matching rule.

As a guide:

  • Links Only: Search sites or News sites featuring primarily links to other sites can be set to harvest links only, adding more urls to the harvesting task. These should have a higher priority so the links are captured first.

  • Specific URLs: Rules Sets targeting specific sites, for example Facebook, should come next with medium to high priority.

  • Generic Rules: Rules Sets with more generic rules should come with low to medium priority, catching those not caught in the more targeted rule sets.

  • The last Rule Set is named Last Resort with a priority of 0 (the lowest). This is a very generic rule set and is only used when no other rule sets match any given web page. However, because it is so generic the results may not be as successful as rule sets created for specific domains or URLs.

For example, you may have a rule set for a specific news site such as The New York Times, and a more generic rule set for other news sites. Giving The New York Times rule set a higher priority ensures that it will be applied to pages on The New York Times website and not the generic rule set for news sites.

Max Harvest Depth

Enter the number of hyperlink levels you want Harvester to follow from the main URL. If you do not want Harvester to follow any hyperlinks enter 1. Too high a number, may result in a the harvest task taking a long time. Generally, 3-5 is considered a balanced approach.

Wait Before Harvest

If you want Harvester to wait a specific time period before harvesting—to enable pages to load completely—select the time period. This waiting period is indicated by the status ‘Waiting (rule set)’.

Harvest Links Only

If you do not want the rule to harvest content from the main URL but to follow hyperlinks and harvest the content from those links, check the Harvest Links Only box.

If you select this option, enter the maximum number of hyperlink levels you want Harvester to follow in the Max Harvest Depth field.

Hide rule set in Harvester

Check this box to exclude the rule set from the Harvester rule set search option. This is useful when running large harvests not targeted by this Rule Set. Hiding the Rule Set means the Harvester has one less check to run during a Harvesting task.

Harvest All IMGs

Check this box to automatically harvest all images inside a selected element.

Harvest 'Alt' from Extracted IMGs

Automatically extracts the Alt attribute of all images (IMG tags).

Alternatively, you can extract the Alt attribute of selected images using the Add attributes within specific rules.

Scroll to bottom

Used to define the number of times to scroll to the bottom of a web page during a harvest. This is useful for web pages where new content is loaded only when the user scrolls down.

Duplicate URLs

Defines how to handle duplicate URLs. There are 3 options that can be selected:

  • No Limit – Will harvest the same URL even if it already exists in the target collection.
  • No Duplicates – Will not harvest the same URL if already exists in the target collection.
  • Rule Based Filter – Will only harvest the same URL if the content of the web page has changed. Rules can be configured to ignore changed text, e.g. dynamically changing content such as date.

Display colour

Defines the text colour of the rule name as it appears on the Harvester page under rule set options.

Search Engine Driver

Selected if the rule set provides search engine results. The Google and Duck Duck Go default rule sets are examples of rules that utilise this option.

When selected, the following additional options appear:

  • Search URL – This defines the Base URL with the Search term injection place holder. Below is an example of how the Google search engine point is formatted:

    Base URL Search term injection place holder (mandatory) Search URL
    http://google.com?q= %terms% http://google.com?q=%terms%
  • Search Parameters allow you to define specific requirements for search results. For example, search parameters can be used to limit the search results based on attributes like their language and time of creation.

    Parameters are injected into the URL wherever %param% is defined, and are normally appended to the end of the base URL.

    • Select Add Parameter and enter a name in the Option 1 field.
    • Enter a parameter, or leave as default.
    • From the dropdown select from Text, Integer, Range, Step Range, Multiple Texts, Lists, as required.
    • If adding more parameters, select Add Parameter, then repeat as many times as required.

    The default Google Rule Set provides a good example of how this capability is applied.

Override Preview CSS

Used to modify the CSS of the gold standard preview document. This is useful when sections of the content cannot be clicked on due to display issues.

Save

After the required configuration changes have been made, select Save. A success message will be displayed.