Harvest via URLs
Using a specific URL, you can harvest individual web pages, and if required, add the URL using the Sintelix Extension. See Add URL to Store for more information
To harvest using URL:
- Select the project into which you want to harvest the documents.
- Navigate to Harvester > URL List
- Enter a valid URL specific to your search. Note that any invalid URL is ignored.
Add additional URLs on separate lines.
- Select a collection from the Select Collection dropdown
If do not have a collection, create a Collection at this stage by clicking the create
icon. See Create a Collection for more details.
- Click Harvest

Optional Tasks
- Online Persona - If a persona is created, select it from the Persona dropdown list.
A persona is needed only for harvesting content from sites that require you to log in. See Create a login Persona for more information.
- Rule Set Options - Expand the Rule Set Options to view the default the rule sets.
- You can define the Depth of the rule set entering a number in the field.
- You can the clear the Enabled check box against the rule sets to remove the default selection.
- You can configure the rule sets, if you have required permissions. See Configuring Harvester rule sets for more information.
- Save - To save this search for future reference, select Save.
To view your saved search, open the project, click Harvester > Saved > Open Query.
- Copy into Batch - To add this search to a batch job, click Copy into Batch, do one of the following, then click Save.
- Select a job from the Existing Batch Job dropdown.
- Type the new batch job name in the Create a new Batch Job text box.
See Batch Harvest for more information.
- To set up a parameter select one, or all the check boxes from the Harvester Parameters section. These are the parameters and what they do:
- Harvest Full Page - Harvests the complete web page, its content and the boilerplate elements. Each full page is saved in a separate document in the same collection. Select this option if harvesting a web page to create a gold standard A set of model data that you can learn from and test on. For example, in Sintelix, this would be a collection of documents that have been created with specific, preferred properties such as correct document tags and text references. In Sintelix Harvester this would be a collection of documents harvested from web pages where only the correct elements have been selected (that is, only the content you want)..
- Capture Screenshots - Creates the screenshots of the websites that are harvested.
- Disable Adblocker - Disables the installed adblock capability.
-
Random Wait - To add a delay between the pages requested to the same domain, select Random Wait. On the Random Wait Time (per Domain) dialog, do the following:
- Enter the domain name in the Domain Group section. Use a separate line for each domain entry.
- Enter the number of minimum and/or maximum delay in second, and click Save.
Adding a delay at this stage overrides the delay settings added by your Admin.
- Set as Default - To set your selections made in the search engines and their parameters to default, in online persona, rule set options, Harvest Parameters, select Set as Default. Default settings apply to your current project only.