Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

Valid from Datafari X.X

We rely on the web connector provided by Apache ManifoldCF.

Info

Note: for debug or network purposes, here is the user agent that ManifoldCF uses when crawling a website

Code Block
172.17.0.3 - - [03/Feb/2021:14:21:05 +0000] "GET /en/practice.html HTTP/1.1" 200 16512 "-" "Mozilla/5.0 (ApacheManifoldCFWebCrawler; olivier[AT]francelabs.com)" "-"

This can be useful if you need to ask your web source admin to add a rule for traffic management.

Replace olivier[AT]francelabs.com by the name of the email that you entered in the web repository connector.

Inclusion tab

The “Include Include only hosts matching seeds” seeds" option allows you to restrict the crawled files to the files discovered that share a host name with at least one of the seeds.
Since Datafari 6.0, the "Force the inclusion of redirection” redirection" options allows you to include hosts redirected from original seeds. You might want to use this option if the site you are crawling is subject to redirections. Note that it is not required if the previous option is not checked. Here are the possible behaviors:

  1. If the admin checks the “Include Include only hosts”hosts", but not the “Force "Force the inclusion” inclusion" option, then the redirected files will be filtered if their new URL doesn’t doesn't match the seed.

  2. If the admin checks the Include only hosts, and checks the Force the inclusion option, then when the job finds a url that is not in the same domain, it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in the document queue.

  3. If the admin does NOT check the include only hosts, but checks the Force the inclusion option, then the job will crawl any url found, even if it is originated by a 301 or 302 redirection.

  4. If the admin does not check anything, then the behavior is the same as the previous case.

...