...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Info |
---|
...
...
Info | |
---|---|
title | Valid from 4.0The documentation below is valid from Datafari v4.0.0 upwards |
...
Let's follow this tutorial to use it in Datafari in few steps :
a)Go to MCF admin page
Add an output connector :
Click on List Output Connectors.
Then click on the button Add a ne output connector.
The configuration is exactly the same that for the standard output connector that already exists named DatafariSolr so you can copy/paste all the configuration except the Paths tab configuration and obviously the name of the output connector : in this example Solr. Change the update handler field to /update/website (instead of /update/extract).
We will see at the end of the tutorial in what consists the configuration of that handler in Solr.Add a repository connector
Now that we have the configuration for the Solr handler we have to add the Web repository connector and the associated job. So click on List Repository Connectors. Then click on Add new connection.
Name : what you want, in this example Web.
Connection Type : Web
Authority group : None
For the other values, I invite you to read the MCF documentation about it. You can leave the default configuration, just add an email address to identify the crawler in the website that you will crawl.
Info | |
---|---|
title | Bandwith tab configurationYour web crawling can strongly depend on the web sources you intend to crawl. On the connector side, you can start your tests with the following values, but beware that you may be temporarily banned by the sources if they deem you to be too aggressive : Go in the Bandwidth tab of your repository connector:
|
Add a job connector
We can now add the job linked to the repository connector and the output connector that we just added.
Let's see the configuration :
Name : choose whatever you want. In this example : DatafariWebSite
Connection : choose the repository and the output connection that we just created previously
Seeds : the website that you want to index. Here we enterd http://www.datafari.com
Inclusions : we only want to parse HTML pages so we enter ".*" for the textfield Include in Crawl and ".html$" for the textfield Include in index.
For the others tab you can let the default parameters.
Here are some further examples for the regex to include/exclude pages, in case you need inspiration
Info | |
---|---|
title | Limits of the URLs depthThe following regex will allow you to exclude at fetch time all the urls beyond a certain depth, to be set in the exclusions tab of your job: (/[^\/]+){6}/?.* (change 6 with the integer of your choice, the higher the deeper the crawl will be in terms of slashes in the url) This regex has to be interpreted with a shift of 2 from the number X declared here (/[^\/]+){X}/?.* , for instance:
|
In the screenshot below, we specify that we do not want URL with CSS extension and we exclude from index all the URL which contain in their path the layout directory or vti bin or SiteAssets. The difference between exclude from crawl and exclude from index is that for example if ManifoldCF finds an URL with the layouts folder it will be crawled, then MCF will search into it if it contains other URL to be fetched but this URL will not be included in the Solr index.
...
b) Optional : customize the parts of the webpage to be gathered : go to Solr admin page
In the default configuration that we add into Solr, we tell to JSOUP to extract the part in the body tag to the field content and the part that is included in the tag title to the field title.
You can change that in solr_home/FileShare/conf/solrconfig.xml, in the /update/website handler :
...