Info |
---|
Valid from Datafari X.X |
We rely on the web connector provided by Apache ManifoldCF.
Info | ||
---|---|---|
Note: for debug or network purposes, here is the user agent that ManifoldCF uses when crawling a website
This can be useful if you need to ask your web source admin to add a rule for traffic management. |
Replace olivier[AT]francelabs.com by the name of the email that you entered in the web repository connector.
Inclusion tab
The “Include “Include only hosts matching
seeds” seeds
" option allows you to restrict the crawled files to the files discovered that share a host name with at least one of the seeds.
Since Datafari 6.0, the "Force the inclusion of
redirection” redirection
" options allows you to include hosts redirected from original seeds. You might want to use this option if the site you are crawling is subject to redirections. Note that it is not required if the previous option is not checked. Here are the possible behaviors:
If the admin checks the “Include “
Include only
hosts”hosts
", but not the “Force "Force the
inclusion”inclusion
" option, then the redirected files will be filtered if their new URL doesn’t doesn't match the seed.If the admin checks the
Include only hosts
, and checks theForce the inclusion
option, then when the job finds a url that is not in the same domain, it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in the document queue.If the admin does NOT check the
include only hosts
, but checks theForce the inclusion
option, then the job will crawl any url found, even if it is originated by a 301 or 302 redirection.If the admin does not check anything, then the behavior is the same as the previous case.
...