Advanced Crawling website
Valid from 6.0
The documentation below is valid from Datafari v6.0.0 upwards
You have two possibilities to crawl a website: either using the standard way with embedded Tika (or Tika server) or use a specific request handler if you need specific configuration (by JSOUP).
1.Standard way with embedded Tika or Tika server
If you use the standard way with the web connector from MCF, you index all the content contained in the body tag into a Solr document. Note that all the HTML tags or Javascript code will be excluded from the extracted text.
For gather the correct extension of the documents, we recommend to change this parameter in the datafari update processor configuration located in solrconfig.xml (/opt/datafari/solr/solrcloud/FileShare/conf/solrconfig.xml) :
<str name="extension.fromname">false</str>
After that you need to send the new configuration to Zookeeper. Go to the admin UI then in Search Engine Configuration → Zookeeper. Click on the Upload button and finally on the Reload button.
You can now launch your MCF job.
2.Use custom request handler with JSOUP for specific configuration
Let's say that you want to index only a subpart of the webpages of your website. The first way for indexing will not be enough flexible in order to do that. For that we add the JSOUP library to the Datafari project. JSOUP is basically a HTML parser. You can indicate with a CSS selector which part of the content to be gathered. Thanks to that, you can control the parts of the webpages that you want to index. The tags are also removed to keep only the content.
Let's follow this tutorial to use it in Datafari in few steps :
a)Go to MCF admin page
Add an output connector :
Click on List Output Connectors.
Then click on the button Add an output connector.
The configuration is exactly the same that for the standard output connector that already exists named DatafariSolr so you can copy/paste all the configuration except the Paths tab configuration and obviously the name of the output connector : in this example Solr. Change the update handler field to /update/website (instead of /update/extract).
We will see at the end of the tutorial in what consists the configuration of that handler in Solr.Add a repository connector
Now that we have the configuration for the Solr handler we have to add the Web repository connector and the associated job. So click on List Repository Connectors. Then click on Add new connection.
Name : what you want, in this example datafari_e5e9b22cde684f858a4dcdb.
Connection Type : Web
Authority group : None
For the other values, I invite you to read the MCF documentation about it. You can leave the default configuration, just add an email address to identify the crawler in the website that you will crawl.
Bandwith tab configuration
Your web crawling can strongly depend on the web sources you intend to crawl. On the connector side, you can start your tests with the following values, but beware that you may be temporarily banned by the sources if they deem you to be too aggressive :
Go in the Bandwidth tab of your repository connector:
Max connections: 30
Max KBytes/sec: 1000
Max fetches/min: 240
Add a job connector
We can now add the job linked to the repository connector and the output connector that we just added.
Let's see the configuration :
Name : choose whatever you want. In this example : DatafariWebSite
Connection : choose the repository and the output connection that we just created previously
Seeds : the website that you want to index. Here we enterd http://www.datafari.com
Inclusions : we only want to parse HTML pages so we enter ".*" for the textfield Include in Crawl and ".html$" for the textfield Include in index.The “Include only hosts matching seeds” option allows you to restrict the crawled files to the files discovered that share a host name with at least one of the seeds.
Since Datafari 6.0, the "Force the inclusion of redirection” options allows you to include hosts redirected from original seeds. You might want to use this option if the site you are crawling is subject to redirections. Note that it is not required if the previous option is not checked. Here are the possible behaviors:If the admin checks the “Include only hosts”, but not the “Force the inclusion” option, then the redirected files will be filtered if their new URL doesn’t match the seed.
If the admin checks the Include only hosts, and checks the Force the inclusion option, then when the job finds a url that is not in the same domain, it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in the document queue.
If the admin does NOT check the include only hosts, but checks the Force the inclusion option, then the job will crawl any url found, even if it is originated by a 301 or 302 redirection.
If the admin does not check anything, then the behavior is the same as the previous case.
For the others tab you can let the default parameters.
Here are some further examples for the regex to include/exclude pages, in case you need inspiration
Limits of the URLs depth
The following regex will allow you to exclude at fetch time all the urls beyond a certain depth, to be set in the exclusions tab of your job: (/[^\/]+){6}/?.* (change 6 with the integer of your choice, the higher the deeper the crawl will be in terms of slashes in the url)
This regex has to be interpreted with a shift of 2 from the number X declared here (/[^\/]+){X}/?.* , for instance:
If you crawl http://www.datafari.com and use (/[^\/]+){2}/?.* , Datafari will only fetch one page, the seed url www.datafari.com/ (and NOT any page exposed at the root), so a depth of 0.
If you crawl http://www.datafari.com and use (/[^\/]+){3}/?.* , Datafari will be fetching pages that are directly at the root, such as Solutions << Datafari << France Labs , so a depth of 1.
If you crawl http://www.datafari.com and use (/[^\/]+){4}/?.* , Datafari will be fetching pages described above, plus those one slash deeper, such as Big Data << Innovation << France Labs , so a depth of 2.
In the screenshot below, we specify that we do not want URL with CSS extension and we exclude from index all the URL which contain in their path the layout directory or vti bin or SiteAssets. The difference between exclude from crawl and exclude from index is that for example if ManifoldCF finds an URL with the layouts folder it will be crawled, then MCF will search into it if it contains other URL to be fetched but this URL will not be included in the Solr index.
b) Optional : customize the parts of the webpage to be gathered : go to Solr admin page
In the default configuration that we add into Solr, we tell to JSOUP to extract the part in the body tag to the field content and the part that is included in the tag title to the field title.
You can change that in solr_home/FileShare/conf/solrconfig.xml, in the /update/website handler :
So edit the lines f.content.selector and f.title.selector. Let's say that for the title I want the content of the tag h1:first-child and for content the content of the tag pane-node-body the configuration will be :
After that, you have to relaunch Datafari or to reload the Solr core FileShare :
Click on Core admin and then on the button Reload.
c) Go to MCF and launch the job
Finally you can go back to the MCF admin page and then click on Status dans Job management and then on the button start front of your new job.
You can after that go to to the search page and see your new Solr documents :
Be aware that this handler only works for webpages, if you want to index documents like PDF documents, Microsoft Office documents, etc... you have to add an additional job with an other Web repository connector and to choose the standard Output connector. With this configuration Tika will be used to extract the content of the documents and add the associated metadata. In the job configuration, you have to exclude the webpages that are indexed by your "JSOUP job".