Detect duplicates configuration

Valid from version 5.4

This documentation is valid from Datafari 5.4 upwards

You can configure Datafari to detect duplicated files and display them in a dedicated section of the admin UI. The default algorithm used to determine duplicates is the one provided by this Solr update processor : TextProfileSignature - Solr - Apache Software Foundation . It performs near duplicates detection and works best on longer text. We explain further down in this documentation how you can change the algorithm. For Datafari, by default it operates on the "content" field of documents.

Option 1: Using the simplified job creation to enable duplicate detections

This is the most straightforward way of enabling duplicate detections, if you intend to use the off-the-shelf configuration, including the default algorithm mentioned above. For this, as explained in the MCF Simplified UI configuration , connect as a datafari admin, go to the admin UI, and in the left menu, click on Connectors => Data Crawlers Simplified Mode. Select the connector type of your choice. Once in the configuration interface, you will see a checkbox corresponding to the duplicate detection functionality. Just check the box, and you are done, the functionality will be activated once you start this job.

Before speaking of that feature, we strongly suggest you to take a look at the “2.1 Manual configuration/modification of a job” section to better understand how the simplified job creation will configure the job, so that you will be able to fine tune the job if needed.

Starting from Datafari 6.0, each simplified job now has a selectable option named “Duplicates detection”:

When enabling this option, the created job will be automatically configured the way it is stated in the “2.1 Manual configuration/modification of a job” section to fill the “Duplicates” Solr index. It does the configuration in a single job (you will see in 2.1 that you can also create a separate job dedicated to duplicats). So it is highly recommended to:

  • Read the next documentation section “2.1 Manual configuration/modification of a job”

  • Use this automatic feature whenever possible to avoid mistakes. You only need to manually create jobs when there is no simplified jobs creation that fits your need (e.g. the repository connector you want to use is not listed)

Option 2: Manual configuration of the duplicate detector

This option gives you much more flexibility for the configuration of the detector, but requires more attention.

2.1 Manual configuration/modification of a job

You will need, for each crawling job you have, to add the output connector “Duplicates”, or you can also create an equivalent job on an https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2181922821 using the output connector “Duplicates” instead of the “DatafariSolrNoTika” one. The advantage of creating a separate jobs on an annotator MCF node is that the standard jobs will not be affected by the time processing of the indexation in the “Duplicates” index, and the disadvantage is that you have to maintain the separated jobs which can be very convenient depending on the number of jobs you have. Also, the gain in terms of time processing strongly depends on the type of jobs, of the environment etc and it may not be worth. Our recommendations is to consider separated jobs only if you are not happy with the jobs time processing when having the output connector “Duplicates” !

To add the “Duplicates” output connector to a job, edit the concerned job in the MCF admin interface and go to the “Connection” tab. Follow those steps:

  • Select the output “Duplicates” and in the dropdown its “Precedent” column, pick the number that corresponds to the number in the “Stage” column of the last transformation connector before the output “DatafariSolrNoTika”

    In the above screen you can see that the Transformation connector before the “DatafariSolrNoTika” connector is the “MetadataCleaner” that correspond to the stage “5”, so we selected the stage “5” as precedent for the “Duplicates” output

  • Click on the “Add output” button

  • Select the transformation connector “DocFilter” and insert it before the output “Duplicates”

  • Go to the “Doc Filter” tab, empty the include and exclude filters and set the “Minimum document size” to 1000. This steps is very important because the hash algorithms of Solr do not work well with documents with very few characters. If you decided to use other another algorithm (for instance just doing file name comparison), you may not need this Doc Filter.

  • Save the job

If you decide to create a separated job on an Annotator node, the configuration is the same except you only need the “Duplicates” output and not the “DatafariSolrNoTika” !

Now you can run your jobs and the “Duplicates” index will be filled automatically. The section below is optional, but in case you want to do it, you must run the job AFTER doing the below configuration.

2.2 OPTIONAL: Configure Solr

This “Configure Solr” / “Duplicates Configuration” section serves to activate an UpdateProcessor that synchronize the “FileShare” collection with the “Duplicates” collection during document deletion. This feature is there as an “insurance” but it is not necessary as the “Duplicates” index will always be up to date thanks to the jobs.

To detect near duplicates, Solr needs to calculate a hash of each document content. This processus is time consuming and the required time per document is proportional to the document size. So, in order to avoid to increase the indexation time when this feature is enabled, the hashes are calculated and stored in their own index named by default "Duplicates". 

The "Duplicates" index will be filled either by a specific job that you can configure on an Annotator Node, or by adding the output connector “Duplicates” to your jobs.

Before configuring and running the jobs, you may want to configure the hash algorithm used by Solr. You can do this through the admin UI of Datafari, Search Engine Administration → Duplicates Configuration:

In the "Algorithm configuration" section, you can also OPTIONALLY change the parameters of the algorithm used for the hashes generation. By default Datafari uses the TextProfileSignature algorithm and allow the graphical modification of two parameters:

  • The calculation fields : it is the list of fields on which the algorithm bases its hash calculation. By default we only use the "content" field as duplicates of a document may have different titles but same content, but up to you if you want to change or add more fields in the hashes calculation. The fields must be coma separated with no spaces !

  • The quant rate : it is a per cent number that will represent a variation factor. To better understand this number, it is better to refer to the official Solr documentation

If the TextProfileSignature algorithm does no fit to your needs, you can replace it by another one provided by Solr. But Datafari does not provide an admin UI for that and you will need to modify the solrconfig.xml file of the "Duplicates" index. It is located in DATAFARI_HOME/solr/solrcloud/Duplicates/conf/solrconfig.xml :

<updateRequestProcessorChain name="signature"> <processor class="solr.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <str name="signatureField">signature</str> <bool name="overwriteDupes">false</bool> <str name="fields">content</str> <str name="signatureClass">solr.update.processor.TextProfileSignature</str> <str name="quantRate">.1</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain>

Above, you can see the default configuration

After having modified the solrconfig.xml file, you will need to push it to Zookeeper by running the following command on the main Solr node (replace the ${DATAFARI_HOME} parameter by the path of your installation):

"${DATAFARI_HOME}/solr/server/scripts/cloud-scripts/zkcli.sh" -cmd upconfig -zkhost localhost:2181 -confdir "${DATAFARI_HOME}/solr/solrcloud/Duplicates/conf" -confname Duplicates

Valid from version 5.1 up to 5.3 - Enterprise Edition Only

This documentation is valid from Datafari 5.1 upwards

You can configure Datafari to detect duplicated files and display them in a specific admin UI. The default algorithm used to determine duplicates is the one provided by this Solr update processor : TextProfileSignature - Solr - Apache Software Foundation . It performs near duplicates detection and works best on longer text. We will explain later in this documentation how you can change the algorithm. For Datafari, by default it operates on the "content" field of documents.

1. Configure Solr

To detect near duplicates, Solr needs to calculate a hash of each document content. This processus is time consuming and the required time per document is proportional to the document size. So, in order to avoid to increase the indexation time when this feature is enabled, the hashes are calculated and stored in their own index named by default "Duplicates". 

The "Duplicates" index will be filled by a specific job that you can either configure on an Annotator Node, or by adding the output connector “Duplicates” to your jobs.

Before configuring and running the jobs, you will have to configure the hash algorithm used by Solr. You can do this through the admin UI of Datafari, Search Engine Administration → Duplicates Configuration:

The “Duplicates Configuration” section serves to activate an UpdateProcessor that synchronize the “FileShare” collection with the “Duplicates” collection during document deletion. This feature is there as an “insurance” but it is not necessary in Datafari 5.1 upward as the “Duplicates” index will always be up to date thanks to the jobs.

In the "Algorithm configuration" section, you can also OPTIONALLY change the parameters of the algorithm used for the hashes generation. By default Datafari uses the TextProfileSignature algorithm and allow the graphical modification of two parameters:

  • The calculation fields : it is the list of fields on which the algorithm bases its hash calculation. By default we only use the "content" field as duplicates of a document may have different titles but same content, but up to you if you want to change or add more fields in the hashes calculation. The fields must be coma separated with no spaces !

  • The quant rate : it is a per cent number that will represent a variation factor. To better understand this number, it is better to refer to the official Solr documentation

If the TextProfileSignature algorithm does no fit to your needs, you can replace it by another one provided by Solr. But Datafari does not provide an admin UI for that and you will need to modify the solrconfig.xml file of the "Duplicates" index. It is located in DATAFARI_HOME/solr/solrcloud/Duplicates/conf/solrconfig.xml :

<updateRequestProcessorChain name="signature"> <processor class="solr.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <str name="signatureField">signature</str> <bool name="overwriteDupes">false</bool> <str name="fields">content</str> <str name="signatureClass">solr.update.processor.TextProfileSignature</str> <str name="quantRate">.1</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain>

Above, you can see the default configuration

After having modified the solrconfig.xml file, you will need to push it to Zookeeper by running the following command on the main Solr node (replace the ${DATAFARI_HOME} parameter by the path of your installation):

2. Configure the jobs

As said in the introduction, you will need, for each crawling job you have, to add the output connector “Duplicates”, or you can also create an equivalent job on an https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2181922821 using the output connector “Duplicates” instead of the “DatafariSolrNoTika” one. The advantage of creating a separate jobs on an annotator node is that the standard jobs will not be affected by the time processing of the indexation in the “Duplicates” index, and the disadvantage is that you have to maintain the separated jobs which can be very convenient depending on the number of jobs you have. Also, the gain in terms of time processing strongly depends on the type of jobs, of the environment etc and it may not be worth. Our recommendations is to consider separated jobs only if you are not happy with the jobs time processing when having the output connector “Duplicates” !

To add the “Duplicates” output connector to a job, edit the concerned job in the MCF admin interface and go to the “Connection” tab. Follow those steps:

  • Select the output “Duplicates” and in the dropdown its “Precedent” column, pick the number that corresponds to the number in the “Stage” column of the last transformation connector before the output “DatafariSolrNoTika”

    In the above screen you can see that the Transformation connector before the “DatafariSolrNoTika” connector is the “MetadataCleaner” that correspond to the stage “5”, so we selected the stage “5” as precedent for the “Duplicates” output

  • Click on the “Add output” button

  • Select the transformation connector “DocFilter” and insert it before the output “Duplicates”

  • Go to the “Doc Filter” tab, empty the include and exclude filters and set the “Minimum document size” to 1000. This steps is very important because the hash algorithms of Solr do not work well with documents with very few characters. If you decided to use other another algorithm (for instance just doing file name comparison), you may not need this Doc Filter.

  • Save the job

If you decide to create a separated job on an Annotator node, the configuration is the same except you only need the “Duplicates” output and not the “DatafariSolrNoTika” !

Now you can run your jobs and the “Duplicates” index will be filled automatically.

3. Use admin UI to browse through duplicates

Once your jobs are done, you can consult the list of detected duplicates in the admin UI of Datafari, Extra Functionnalities → Duplicate files

As mentioned in the UI, by clicking on a file/document name in the first table, you will display a second table containing the full list of similar documents corresponding to the clicked one:

In this second table, you will be able to directly open documents by clicking on them, assuming you have the rights to and that your browser is correctly configured

The list of detected duplicates will evolve with your Datafari index but be aware that it will not be instantly up to date when a job is done. The hash calculation time per document depends on the document size. It can vary from few ms with documents which the content size is less than 1ko to almost a minute with content reaching the 1Mo limit of Datafari !


You can configure Datafari to detect duplicated files and display them in a specific admin UI. The default algorithm used to determine duplicates is the one provided by this Solr update processor : https://cwiki.apache.org/confluence/display/solr/TextProfileSignature. It performs near duplicates detection and works best on longer text. We will explain later in this documentation how you can change the algorithm. For Datafari, by default it operates on the "content" field of documents.

1. Configure Solr

To detect near duplicates, Solr needs to calculate a hash of each document content. This processus is time consuming and the required time per document is proportional to the document size. So, in order to avoid to increase the indexation time when this feature is enabled, the hashes are calculated and stored in their own index named by default "Duplicates". 
The "Duplicates" index will be filled by the Annotator, and the deletes will be managed by an update processor named "DuplicatesDeleteUpdateProcessor" that is configured by default in the update processor chain of the main "FileShare" index. This way, the "Duplicates" index is perfectly synchronized with the main "FileShare" index.
Note that the "DuplicatesDeleteUpdateProcessor" does nothing by default because it relies on a parameter named "duplicates.enabled" to do something or not, and this parameter is set to false on a vanilla installation. You will need to set it to true through the admin UI of Datafari, Search Engine Administration → Duplicates Configuration:


Toggle the "Activate deletion synchronization with Datafari main collection" switch button to the "On" position then click on "Save"

This is equivalent to executing the following curl command on the Solr main node :

The others parameters of the "Duplicates Configuration" section should not be changed but are here in case you want to isolate the 'Duplicates' index on its own Solr cluster and if (highly not recommended) you want to change the default index name.

In the "Algorithm configuration" section, you can also OPTIONALLY change the parameters of the algorithm used for the hashes generation. By default Datafari uses the TextProfileSignature algorithm and allow the graphical modification of two parameters:
- The calculation fields : it is the list of fields on which the algorithm bases its hash calculation. By default we only use the "content" field as duplicates of a document may have different titles but same content, but up to you if you want to change or add more fields in the hashes calculation. The fields must be coma separated with no spaces !
- The quant rate : it is a per cent number that will represent a variation factor. To better understand this number, it is better to refer to the official Solr documentation

If the TextProfileSignature algorithm does no fit to your needs, you can replace it by another one provided by Solr. But Datafari does not provide an admin UI for that and you will need to modify the solrconfig.xml file of the "Duplicates" index. It is located in DATAFARI_HOME/solr/solrcloud/Duplicates/conf/solrconfig.xml :

Above, you can see the default configuration
After having modified the solrconfig.xml file, you will need to push it to Zookeeper by running the following command on the main Solr node (replace the ${DATAFARI_HOME} parameter by the path of your installation):

2. Configure the "DuplicatesAnnotator"

To fill the "Duplicates" index, you will need to add the "DuplicatesAnnotator" transformation connector to all your MCF jobs. The "DuplicatesAnnotator" exists by default in MCF after an installation of Datafari and you can check it in the MCF admin UI:









Now that you have checked that the annotator is correctly configured, add it to all of your jobs, right before the output in the pipeline:

In the DummyAnnotator tab, you can add a minimum document size limit to 1000 bytes as, with the default algorithm, any document with less content will result in very bad duplication detection:

Now before running your jobs, ensure that the annotator batch is up and running, thanks to the admin UI - Search Engine Administration → Annotator Configuration:

Just toggle the switch button to the "On" position.
If you want/have to do it manually, you need to set the BATCH parameter to true in the DATAFARI_HOME/tomcat/conf/datafari.properties file of the main Datafari node : 

Then you will need to either re-start Datafari, or to manually start the batch by executing the following commands:

  1. Then you can run your jobs !

3. Use admin UI to browse through duplicates

Once your jobs are done, you can consult the list of detected duplicates in the admin UI of Datafari. System Analysis → Duplicate files : 

As mentioned in the UI, by clicking on a file/document name in the first table, you will display a second table containing the full list of similar documents corresponding to the clicked one:

In this second table, you will be able to directly open documents by clicking on them, assuming you have the rights to and that your browser is correctly configured

The list of detected duplicates will evolve with your Datafari index but be aware that it will not be instantly up to date when a job is done. The hash calculation time per document depends on the document size. It can vary from few ms with documents which the content size is less than 1ko to almost a minute with content reaching the 1Mo limit of Datafari !