/
Deduplication

Deduplication

Valid from Datafari v5.4

The documentation below is valid starting from Datafari v5.4 upwards

When Deduplication is active and properly configured, a user with a searchexpert role can check duplicates in the admin UI. It is present in the Extra Functionalities menu.

You can get more details about this functionality in Detect duplicates configuration


Valid from Datafari v5.0 - Enterprise Edition only

The documentation below is valid starting from Datafari v5.0 up to 5.3 included

When Deduplication is active and properly configured, a user with a searchexpert role can check duplicates in the admin UI. It is present in the Extra Functionalities menu.

You can get more details about this functionality in Detect duplicates configuration


Not active since v3.0 - Enterprise Edition only

Since version 3.0, Deduplication is neither active nor maintained anymore.

Datafari can allow a user to see wich documents are duplicated in the result of the search.

The deduplication functionnality uses the MD5 Algorithm for hashing the documents so that solr could recognize which documents are duplicated.  

When activated, users have a special “duplication” facet that appears on the bottom left of the results page. Each item in this facets represents a set of duplicated documents, with a name and the number of duplications in parenthesis.

When clicking on a facet item, the results display will show all the duplicated documents related to the clicked facet item. This functionality can be useful to find out how many duplicated documents are present in the corpus.

Related content

Detect duplicates configuration
Detect duplicates configuration
More like this
[DEPRECATED] Setting up a server to host Spacy for Named Entity Recognition
[DEPRECATED] Setting up a server to host Spacy for Named Entity Recognition
Read with this
[DEPRECATED] Deduplication : Technical note
[DEPRECATED] Deduplication : Technical note
More like this
OCR on simplified jobs
OCR on simplified jobs
Read with this
[DEPRECATED] Deduplication management
[DEPRECATED] Deduplication management
More like this
New Language Configuration
New Language Configuration
Read with this