Access credentials
Valid from Datafari 5.0
When one wants to crawl a web site that implements security, two cases can occur : either the pages to crawl are protected by page access credentials (typically .htaccess files), or by session-based access credentials.
To know which type of security is implemented, here is a tip : pages protected by page access credentials have a return code of 40x (ex 401) when one tries to access them without being authenticated and usually also have a direct prompt to enter credentials, whereas pages protected by a session-based access credentials usually have a return code of 30x (ex 301 or 302) and usually redirect to the login page of the web site.
Once you have identified the type of security you will have to deal with, you will have to correctly configure the web repository connector in the MCF administration UI.
Create or edit a web repository connector and go to the access credentials tab :
You will notice that this tab contains two sections, the Page access credentials section, and the Session-based access credentials section. According to the type of security you will use, refer to the correct section configuration
Page access credentials
This is the easiest section to configure. It consists in the creation of a list of URL regular expressions associated to their credentials.
For example, let assume that I want to crawl the website http://www.mypersonalwebsite.com This website has a section protected by page access credentials and the URL of this section is http://www.mypersonalwebsite.com/protected It means that every page under that section can only be accessed using a Basic authentication.
Here is the configuration I will set in the connector :As you can see, the URL regular expression does not contains the full URL of the website neither a .* pattern at the beginning nor at the end of the regular expression. The reason is that during the crawl, every encountered page that contains this regex will match and thus the specified credentials will be used. So, for example, if the crawl encounter the page http://www.mypersonalwebsite.com/protected/personal-data.php, as the URL contains mypersonalwebsite.com/protected, the connector will use the specified credentials to retrieve the page.
You need to think your URL regular expression in order to match a maximum, if not all, pages protected by the same username/password pair, but also avoid to match non protected pages. If it is not possible to match all the protected pages with a single URL regular expression, then do it using several rules. The only thing you need to be really careful with, is to click on the "+" button after having fulfilled a line before clicking on the "Save" button of the connector ! Otherwise you will lose your rule !
Since your connector may be used to crawl several websites, you will have to create as many rules as you have page access credentials protected websites sections !Session-based access credentials
This section is more complex to configure than the page access credentials section but if you know how the connector works, it will not be a problem for you !
The best way to understand how this section works is to have an example. Let's assume I want to crawl the website http://www.test-website.com that have a section under /users that can only be accessed by an authenticated user. The security is a session-based security, if I try to access a page under that section, for example http://www.test-website.com/users/info.html, without being authenticated, I am redirected to the login form of the website http://www.test-website.com/login.jsp
How to crawl the pages of that user section ? It is simple, we need to tell the web connector that every encountered pages containing test-website.com/users in the URL, needs a session cookie and we will tell the connector where and how to authenticate to obtain the required cookie.
Let's start by declaring a URL regular expression that every protected page of the website will match :
Once I click on the "Add" button, a new section should appear:
If it is not the case, it means that the section is hidden and you need to click on "URL regular expression: test-website\.com\/users" to make it appear.
At this step we declared a "rule" that every crawled page containing "test-website.com/users" will use.
In this rule we will first need to tell the connector that, if it is not authenticated, it will be redirected by the website to the login form page, and it is normal. Because if the connector is redirected during a crawl and does not have a rule that tells him where it will be redirected or if the specified redirect page in the rule does not match the real case, it will consider it as a problem and will not crawl the page :
Here is the first "trap", to specify the regular expression of the URL where the connector will be redirected, the page type must be set to "Redirection to" and the ONLY field that must be used to specify the regular expression is the "Identification regular expression" field. The other fields must stay empty, except the "Override target URL" field if you want to redirect the connector to another page than the one specified by the website, but we will explain it later.
Now click on the "Add" button.
At this point, the rule that we have set tells the connector that every page containing "test-website.com/users" in the URL will redirect, if not authenticated, to a new URL that must contain "test-website.com/login.jsp". If it is not the case during the crawl, the connector will consider the rule as wrong and will ignore the rest of the rule, but if it is true, we now need to enrich the rule to tell the connector how to login.
Here we specify in the "Login URL regular expression" field the regular expression that the login page URL must contain (logically it is the same as the one we specified in the "Identification regular expression" field of the previous sub-rule), the page type is "Form name/is/action" as the page contains a form to fulfill with a username and a password, and the "Identification regular expression" field must contain the id/name of the form element of the login page. In our case the form of our login page is like this:<form id="formLogin" class="box login" method="POST" action="./login.jsp?lang=fr" accept-charset="utf-8"> <fieldset class="boxBody"> <label id="loginLabel">Nom d'utilisateur :</label> <input type="text" tabindex="1" name="j_username" required=""> <label id="passwordLabel">Mot de passe :</label> <input type="password" tabindex="2" required="" name="j_password"> </fieldset> <input type="submit" id="loginBtn" class="btn btn-primary col-sm-3" value="Login" tabindex="4"> </form>
So you understand why we put "formLogin" in the "Identification regular expression" field
Click on the "Add" button, a new sub-section "Override form parameters" should appear under:
Here you need to specify the NAME of the username and password fields.
Another trap here is that you MUST specify the name attribute of the user and password parameter and NOT the id ! Because the web connector will search for the name attribute in the form, not the id !
For exemple if your form looks like this:<form id="formLogin" class="box login" method="POST" action="./login.jsp?lang=fr" accept-charset="utf-8"> <fieldset class="boxBody"> <label id="loginLabel">Nom d'utilisateur :</label> <input type="text" tabindex="1" name="j_username" id="id_username" required=""> <label id="passwordLabel">Mot de passe :</label> <input type="password" tabindex="2" required="" name="j_password" id="id_password"> </fieldset> <input type="submit" id="loginBtn" class="btn btn-primary col-sm-3" value="Login" tabindex="4"> </form>
You MUST enter ‘j_username’ for the username parameter regular expression, and ‘j_password’ for the password parameter regular expression ! NOT ‘id_username’ and ‘id_password’ otherwise it will not work !
Now for both parameter you have to indicate which value to enter for the username field, and which password for the password field.
At the end, DO NOT FORGET to click on the "Add" button after specifying each parameter:
Then click on the "Save" button and you are done ! You can check in the repository view your configuration:
So basically what will happen during the crawl phase ?
If the connector encounters a page URL matching "test-website.com/users" it will load the rule.
The rule is of type session-based access credential and so work with cookie, so the connector will check if it already has a cookie for this rule and if it is still valid and not expired (the connector maintains a list of cookies for session-based access credentials rules).
If a cookie is found it is used, otherwise, if there is no available cookie or if the available cookie does not work or is expired, the connector tries to get the page.
When trying to get the page, the connector is redirected by the website to the http://www.test-website.com/login.jsp page, this page URL matches the one indicated by the rule so it continues to this login page.
In the login page, the connector follows the rule and searches for a form element having "formLogin" as id/name. It finds it and then it searches for two parameters named j_username and j_password. It finds them and fills them in with what is indicated in the rule, then it submits the form.
The form authentication succeeds and the connector retrieves a session cookie that is stored to its cookies list, associated to the rule "test-website.com/users"
With the newly acquired cookie, the connector retries to open the original page (that redirected it to the login page) and now succeeds to get the page (as it is now authenticated with a valid session cookie)
From now on, for the next pages URL encountered by the connector, matching "test-website.com/users", the newly acquired session cookie will be used until it expires, and when it happens, the connector will again be redirected and will have to authenticate to "renew" the session cookie.
Specific case Override target URL
As said during the configuration of the rule to specify a redirection to the login page, you may want or need to override the redirection URL specified by the website. For example, if the website does not redirect to the login page of the website in case you are not authenticated, but to another page instead, it is useful to be able to override the original URL redirection by the one to the login page !
Let assume the http://www.test-website.com/users/info.html page used in the configuration example, redirect on http://www.test-website.com/access-rules.html instead of http://www.test-website.com/login.jsp. The configuration of the redirection sub-rule will then be like this:
This time we indicate the original redirection page in the "Identification regular expression" field but we also tell the system that this redirection must be replaced by the one indicated in the "Override target URL" field. Notice that the override target URL is not a regular expression and must be a complete URL !
The rest of the rule stays unchanged as the login page and form are still the same.