Web Connectors

Valid from Datafari X.X

We rely on the web connector provided by Apache ManifoldCF.

Note: for debug or network purposes, here is the user agent that ManifoldCF uses when crawling a website

172.17.0.3 - - [03/Feb/2021:14:21:05 +0000] "GET /en/practice.html HTTP/1.1" 200 16512 "-" "Mozilla/5.0 (ApacheManifoldCFWebCrawler; olivier[AT]francelabs.com)" "-"

This can be useful if you need to ask your web source admin to add a rule for traffic management.

Replace olivier[AT]francelabs.com by the name of the email that you entered in the web repository connector.

Inclusion tab

The “Include only hosts matching seeds" option allows you to restrict the crawled files to the files discovered that share a host name with at least one of the seeds.
Since Datafari 6.0, the "Force the inclusion of redirection" options allows you to include hosts redirected from original seeds. You might want to use this option if the site you are crawling is subject to redirections. Note that it is not required if the previous option is not checked. Here are the possible behaviors:

  1. If the admin checks the “Include only hosts", but not the "Force the inclusion" option, then the redirected files will be filtered if their new URL doesn't match the seed.

  2. If the admin checks the Include only hosts, and checks the Force the inclusion option, then when the job finds a url that is not in the same domain, it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in the document queue.

  3. If the admin does NOT check the include only hosts, but checks the Force the inclusion option, then the job will crawl any url found, even if it is originated by a 301 or 302 redirection.

  4. If the admin does not check anything, then the behavior is the same as the previous case.

When the first option is checked, the job will fill a Set of Strings containing allowed host. Then any URL found by the job that is not matching any host in this set will be filtered.

If the admin checks the second option AND if the first option is checked, then the job will check any host added in the Set. If a host is subject to redirection, then we add the destination URL in the Set.