Advanced Crawling website

Note : You can use the standard configuration on Datafari for crawling a website (pages and files).

If you want to crawl only webpages (without files), you can customize the part of the page to be indexed. This configuration is done for you.

Starting from Datafari 3.2 version, we add the possibility to crawl easily a website into Datafari.

Indeed, if you use the normal crawling with the web connector from MCF, the raw stream of the page is into your Solr document : it means that it is the same content that if you display the HTML source code of the page, which is not very nice for the users.

In order to easily improve that, we add the JSOUP library to the Datafari project. JSOUP is basically a HTML parser. You can indicate with a CSS selector which part of the content to be gathered. Thanks to that, you can control the parts of the webpages that you want to index. The tags are also removed to keep only the content.

Let's follow this tutorial to use it in Datafari in few steps :

1) Go to MCF admin page

Add an output connector :
Click on List Output Connectors.
Then click on the button Add a ne output connector.
The configuration is exactly the same that for the standard output connector that already exists named DatafariSolr so you can copy/paste all the configuration except the Paths tab configuration and obviously the name of the output connector : in this example Solr. Change the update handler field to /update/website (instead of /update/extract).
We will see at the end of the tutorial in what consists the configuration of that handler in Solr.
Add a repository connector
Now that we have the configuration for the Solr handler we have to add the Web repository connector and the associated job. So click on List Repository Connectors. Then click on Add new connection.
Name : what you want, in this example Web.
Connection Type : Web
Authority group : None
For the other values, I invite you to read the MCF documentation about it. You can leave the default configuration, just add an email address to identify the crawler in the website that you will crawl.
Add a job connector
We can now add the job linked to the repository connector and the output connector that we just added.
Let's see the configuration :
Name : choose whatever you want. In this example : DatafariWebSite
Connection : choose the repository and the output connection that we just created previously
Seeds : the website that you want to index. Here we enterd http://www.datafari.com
Inclusions : we only want to parse HTML pages so we enter ".*" for the textfield Include in Crawl and ".html$" for the textfield Include in index.

For the others tab you can let the default parameters.

2) Optional : customize the parts of the webpage to be gathered : go to Solr admin page
In the default configuration that we add into Solr, we tell to JSOUP to extract the part in the body tag to the field content and the part that is included in the tag title to the field title.
You can change that in solr_home/FileShare/conf/solrconfig.xml, in the /update/website handler :

So edit the lines f.content.selector and f.title.selector. Let's say that for the title I want the content of the tag h1:first-child and for content the content of the tag pane-node-body the configuration will be :

After that, you have to relaunch Datafari or to reload the Solr core FileShare :
Click on Core admin and then on the button Reload.

3) Go to MCF and launch the job
Finally you can go back to the MCF admin page and then click on Status dans Job management and then on the button start front of your new job.
You can after that go to to the search page and see your new Solr documents :

Be aware that this handler only works for webpages, if you want to index documents like PDF documents, Microsoft Office documents, etc... you have to add an additional job with an other Web repository connector and to choose the standard Output connector. With this configuration Tika will be used to extract the content of the documents and add the associated metadata. In the job configuration, you have to exclude the webpages that are indexed by your "JSOUP job".