[DEPRECATED] Deduplication : Technical note

The deduplication functionality is one of the functionalities that is simplified since Solr 1.4 and the above versions, as it proposes to enable it through its config files. Datafari used this functionality to implement it and make it available in the front-side.

  • For the backend side, we have applied the wiki that is available here :  https://cwiki.apache.org/confluence/display/solr/De-Duplication . The hash we generate is stored in a solr field labelled signature. By default, the hash is computed on the solr content field.

  • we set the overwriteDupes parameter to false :

    <bool name="overwriteDupes">false</bool>

    in the solrConfig.xml. You can set it to true if you want the index to contain only one instance per set of duplicated documents. This is not our goal here (we want to let the search admin the possibility to remove duplicates after the fact).

  • we have set the fields parameter of the solrConfig.xml to: 

    <str name="fields">content,content_en,content_fr</str>

    so that the hash will take in consideration only the content of each document. You can change this parameter to put the fields you want, for instance the title so that the hash will take into consideration the title.

  • The processors of the dedupe are in :

    <updateRequestProcessorChain name="datafari">

    We used also - for max precision - the MD5 Algorithm of solr for hashing. Still, you can always change it by changing the parameter:

    <str name="signatureClass"> solr.processor.MD5Signature</str>

     in solrConfig.xml.

 

For the front-end, we want to expose duplicates using a facet:

For this, we have created a new class called FacetDuplicates and which is located in /datafari/WebContent/js/AjaxFranceLabs/widgets/. This class inherits from TableWidget and overloads the update method. This was achieved due to the fact that duplication is a facet and so returns only the hashes and that we wanted to return the names of a document from the duplicated documents. So what happens is that we send for every hash in the facet a get query. We have also set a mincount for the facet so we will show only duplicated file names in the facet which is not the case when the mincount is equal to 0 (which is the default configuration). You can find in this link a short doc for the parameter mincount : https://cwiki.apache.org/confluence/display/solr/Faceting. We also make the facet disappear if it doesn't contain a duplicated document : a simple $('#facet_signature').show/hide had done the trick.