Concept: Harvesting

What is the Harvester?

The Sintelix Harvester can collect content from the web.

The content is added to a collection and processed as a Sintelix document:

Elements of a Web page

A web page is constructed from a number of elements.

Different sites may have different design layouts and elements, depending on the purpose of the website.

However, a website will have certain standards defining how each page is put together.

Harvesting process

When harvesting a page, you may only be interested in some content on the page, not all.

For example, you want to include the content of a news article, but don't want to include the navigation menu, advertising, footer or links to other news articles.

What you do want to include is the article title, author, date of the article, the content of the article including the images contained in the content of the article.

So harvesting is the process of selecting the elements to include and ignoring unwanted elements from a page and then saving the wanted elements into a Sintelix document.

You can harvest:

Harvesting Rule Sets

Harvesting Rule Sets capture rules to identify the elements to include and exclude when extracting content from web pages.

Rule Sets may be defined for different websites that use different standards for organising and presenting their content, for example, different News websites.

Sintelix can check the URL for a website to determine the best matching rule set to harvest from that website.

For more information about Rule Sets, see .