Crawling a local file system

The date of LibreOffice documents is not crawled correctly with the local file system connector.

We detail here how to kick off quickly in Datafari with a crawling of a local file folder. YET for security reasons, it is NOT recommended to use it in production environment.

Local and shared files crawling

The local file crawler is a DEMO connector that shouldn't be used for production environments.

With the local file crawler, Date metadata is not extracted from plain text documents.

A recommended alternative to the local file crawler is the "Windows share" (a.k.a. JCIFS connector), that may be used to crawl files in a file shared directory (it works under Linux as well ). You can check its official documentation on the Apache ManifoldCF version 2.5 website.
Documentation on how to setup a samba share on Linux here.

Look at Add the JCIFS Connector to Datafari to learn how to use the recommended JCIFS connector.

From the admin menu, click on Connectors and then on "Administration of MCF".

ManifoldCF then asks you for your credentials, which you have setup either in the config xml file, or in the install phase if you used it.

You are now in the MCF admin page. Since this is the MCF of Datafari, an output connection is already configured to point to the Solr of Datafari. You can check that by clicking on "List Output Connections" on the left menu. You will then see on the right, the DatafariSolr output connection.

Since we want to crawl local files, we first need to declare a new repository connection (to define the type of connection), and then we will instantiate a job that user this repository connection, pointing a the desired local folder. To declare a new repository connection, click on List Repository Connections on the left menu, then click the Add new connection button on the right panel.

Type in the name and the description, then click on the Type tab.

In the type tab, you pick in the dropdown list File System (note that this is a DEMO connector with known bugs, don't use it for production environments), and leave the Authority group to None, since it is quick start exemple.
Again, refer to the full documentation of Apache ManifoldCF to dig deeper into these options.

Click on Continue, and a Throttling tab appears, as well as a Save and a Cancel buttons. Click on Save.

Clicking on save displays the View Repository Connection Status. You should have something that looks like the illustration below.

In the menu, click again on List Repository Connections to check that your new connection is now listed. You should have something like this:

Now that we have declared a repository connection of type local file system, we can now "instantiate" it through a job. For that, click on the menu on "List all Jobs". The right panel will display an empty job list. Click on the button "Add a new job"

You then need to declare a name. Do it and click on the Connection tab

In this tab, click on the Output connection dropdown and select the Datafari specific Output connection "DatafariSolr". On the right, click on the Repository connection dropdown and pick the one we just created, in our case it is "LocalFiles". Click on Add Output, and the click on Continue.

A list of tabs then appears. Check the ManifoldCF official documentation to get an understanding of all of them.

This is where you define the actual location in your local system, that you want the job to crawl and send to Datafari Solr for indexing. For our example, click on the Repository Paths tab. In the illustration below, we enter the local path "/var/enron" and click on Add (not Save yet!).

A list of potential regex appears on the right. Here is some info about the regex, taken from the manifoldcf 2.5 documentation:

Using the regex option

For each included path, a list of rules is displayed which determines what folders and documents get included with the job. These rules will be evaluated from top to bottom, in order. Whichever rule first matches a given path is the one that will be used for that path.

Each rule describes the path matching criteria. This consists of the file specification (e.g. "*.txt"), whether the path is a file or folder name, and whether a file is considered indexable or not by the output connection. The rule also describes the action to take should the rule be matched: include or exclude. The file specification character "*" is a wildcard which matches zero or more characters, while the character "?" matches exactly one character. All other characters must match exactly.

Remember that your specification must match all characters included in the file's path. That includes all path separator characters ("/"). The path you must match always begins with an initial path separator. Thus, if you want to exclude the file "foo.txt" at the root level, your exclude rule must match "/foo.txt".

To add a rule for a starting path, select the desired values of all the pulldowns, type in the desired file criteria, and click the "Add" button. You may also insert a new rule above any existing rule, by using one of the "Insert" buttons.

Click now on the Save button.

You get the confirmation that the job is succesfully created when you see the summary of the created job, in the "View a Job" window. This should like the following illustration:

To ensure things really went ok, go on the left menu and select "List all Jobs". You should see on the right panel a job list with only one entry, the job you just created, as in the illustration below.

It is now time start the actual crawling of the folder. For that, click in the left on menu on "Status and Job Management". You should get a Jobs status list on the right pane, with only one entry, which is the job you recently created. Click on Start to trigger the crawling, and wait for the magic to happen.

If you want, you can click several time on the Refresh button to get a feeling that things are being done. This can be observed by looking the last 3 fields of the list: "Documents", "Active" and "Processed", whose numbers should be evolving as you click on Refresh. Wait for a few documents to be counted in the "Processed" field, and you can then start searching for it in the Search view, which is accessible at any time through the Search link on the upper right of the Datafari header.