Info |
---|
Valid from 6.0The documentation below is valid from Datafari v6.0.0 upwards |
...
In the screenshot below, we specify that we do not want URL with CSS extension and we exclude from index all the URL which contain in their path the layout directory or vti bin or SiteAssets. The difference between exclude from crawl and exclude from index is that for example if ManifoldCF finds an URL with the layouts folder it will be crawled, then MCF will search into it if it contains other URL to be fetched but this URL will not be included in the Solr index.
...
b) Optional : customize the parts of the webpage to be gathered : go to Solr admin page
In the default configuration that we add into Solr, we tell to JSOUP to extract the part in the body tag to the field content and the part that is included in the tag title to the field title.
You can change that in solr_home/FileShare/conf/solrconfig.xml, in the /update/website handler :
...
c) Go to MCF and launch the job
Finally you can go back to the MCF admin page and then click on Status dans Job management and then on the button start front of your new job.
You can after that go to to the search page and see your new Solr documents :
Be Be aware that this handler only works for webpages, if you want to index documents like PDF documents, Microsoft Office documents, etc... you have to add an additional job with an other Web repository connector and to choose the standard Output connector. With this configuration Tika will be used to extract the content of the documents and add the associated metadata. In the job configuration, you have to exclude the webpages that are indexed by your "JSOUP job".
...
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Starting from Datafari 4 version, you have two possibilities to crawl a website. Either use the standard way with embedded Tika (or Tika server) or use a specific request handler if you need specific configuration (by JSOUP). 1.Standard way with embedded Tika or Tika server If you use the standard way with the web connector from MCF, you index all the content contained in the body tag into a Solr document. Note that all the HTML tags or Javascript code will be excluded from the extracted text. For gather the correct extension of the documents, we recommend to change this parameter in the datafari update processor configuration located in solrconfig.xml (/opt/datafari/solr/solrcloud/FileShare/conf/solrconfig.xml) :
After that you need to send the new configuration to Zookeeper. Go to the admin UI then in Search Engine Configuration → Zookeeper. Click on the Upload button and finally on the Reload button. You can now launch your MCF job. 2.Use custom request handler with JSOUP for specific configuration Let's say that you want to index only a subpart of the webpages of your website. The first way for indexing will not be enough flexible in order to do that. For that we add the JSOUP library to the Datafari project. JSOUP is basically a HTML parser. You can indicate with a CSS selector which part of the content to be gathered. Thanks to that, you can control the parts of the webpages that you want to index. The tags are also removed to keep only the content. Let's follow this tutorial to use it in Datafari in few steps : a)Go to MCF admin page
Here are some further examples for the regex to include/exclude pages, in case you need inspiration
In the screenshot below, we specify that we do not want URL with CSS extension and we exclude from index all the URL which contain in their path the layout directory or vti bin or SiteAssets. The difference between exclude from crawl and exclude from index is that for example if ManifoldCF finds an URL with the layouts folder it will be crawled, then MCF will search into it if it contains other URL to be fetched but this URL will not be included in the Solr index. b) Optional : customize the parts of the webpage to be gathered : go to Solr admin page So edit the lines f.content.selector and f.title.selector. Let's say that for the title I want the content of the tag h1:first-child and for content the content of the tag pane-node-body the configuration will be : After that, you have to relaunch Datafari or to reload the Solr core FileShare : c) Go to MCF and launch the job
|