Info |
---|
Valid from Datafari v5.4 upwards |
...
Info |
---|
In aggregation mode, the further in pagination a user will go, the slower it will be.We write it again here as a warning as it is fairly important to understand this limitation. |
...
Expand | ||
---|---|---|
| ||
The SearchAggregator is a servlet that replaces the |
...
[DEPRECATED] SearchProxy API. The main difference is that it is able to dispatch the request to several external Datafari sites and aggregate the responses with the local one, keeping the standard format described in the |
...
1. Working detailsWhen the SearchAggregator servlet receives a request, there are two kinds of behavior:
|
...
To dispatch the request to other Datafari instances, the SearchAggregator creates a new request for each external Datafari that is a duplicate of the original request BUT for two parameters:
The reason why the requesting user is passed as a request parameter is because it is the only safe way ! Indeed, normally the user used to request the SOLR index is deduced from the session and ONLY from the session. A user passed as a request parameter to the Search API is automatically ignored for security reasons: anybody would be able to impersonate another user to access its data ! To be fast and efficient, the SearchAggregator parallelizes the requests to the different external Datafari instances. Each request is executed in its own thread and a timeout is set after which, the request is cancelled. The whole thread requests are monitored by a ThreadExecutor that sets another timeout which we call “global timeout” after which, all threads are killed no matter their status. This ensures that after the global timeout, in any case, a response will be created upon the available responses. 2. ConfigurationThere is a dedicated searchaggregator configuration page. 3. SearchAggregator query tipsFor the selection of the sites to be queried by the aggregator, the search aggregator servlet looks for a specific url parameter: “aggregator”. If this parameter is not set, the default site will be used (trying to find a user specific default, fall back to a global default if it does no exist, fall back to all sources in last resort). Below are some exemple URLs (%2C stands for coma: “,”): https://54.36.146.228/Datafari/SearchAggregator/select?fl=title%2Curl%2Cid%2Cextension%2Cpreview_content%2Clast_modified%2Ccrawl_date%2Cauthor%2Coriginal_file_size%2Cemptied&facet=true&q=*%3A*&rows=10&facet.field={!ex%3Drepo_source}repo_source&facet.field={!ex%3Dextension}extension&facet.field={!ex%3Dentity_person}entity_person&facet.field={!ex%3Dentity_phone_present}entity_phone_present&facet.field={!ex%3Dentity_phone}entity_phone&facet.field={!ex%3Dentity_special_present}entity_special_present&facet.field={!ex%3Dlanguage}language&facet.field={!ex%3Dsource}source&facet.query={!key%3DMoins%20de%20un%20mois}last_modified%3A[NOW-1MONTH TO NOW]&facet.query={!key%3DMoins%20de%20un%20an}last_modified%3A[NOW-1YEAR TO NOW]&facet.query={!key%3DMoins%20de%20cinq%20ans}last_modified%3A[NOW-5YEARS TO NOW]&facet.query={!key%3DMoins%20de%20100ko}original_file_size%3A[0 TO 102400]&facet.query={!key%3DDe%20100ko%20%C3%A0%2010Mo}original_file_size%3A[102400 TO 10485760]&facet.query={!key%3DPlus%20de%2010Mo}original_file_size%3A[10485760 TO *]&id=d36f89ed-03fa-41ee-9aee-ff67c4d8a352&aggregator=&sort=score desc&q.op=AND&spellcheck.collateParam.q.op=AND&spellcheck=false&wt=json&json.wrf=jQuery34104197056299893184_1594988462839&_=1594988462840 https://54.36.146.228/Datafari/SearchAggregator/select?fl=title%2Curl%2Cid%2Cextension%2Cpreview_content%2Clast_modified%2Ccrawl_date%2Cauthor%2Coriginal_file_size%2Cemptied&facet=true&q=*%3A*&rows=10&facet.field={!ex%3Drepo_source}repo_source&facet.field={!ex%3Dextension}extension&facet.field={!ex%3Dentity_person}entity_person&facet.field={!ex%3Dentity_phone_present}entity_phone_present&facet.field={!ex%3Dentity_phone}entity_phone&facet.field={!ex%3Dentity_special_present}entity_special_present&facet.field={!ex%3Dlanguage}language&facet.field={!ex%3Dsource}source&facet.query={!key%3DMoins%20de%20un%20mois}last_modified%3A[NOW-1MONTH TO NOW]&facet.query={!key%3DMoins%20de%20un%20an}last_modified%3A[NOW-1YEAR TO NOW]&facet.query={!key%3DMoins%20de%20cinq%20ans}last_modified%3A[NOW-5YEARS TO NOW]&facet.query={!key%3DMoins%20de%20100ko}original_file_size%3A[0 TO 102400]&facet.query={!key%3DDe%20100ko%20%C3%A0%2010Mo}original_file_size%3A[102400 TO 10485760]&facet.query={!key%3DPlus%20de%2010Mo}original_file_size%3A[10485760 TO *]&id=d36f89ed-03fa-41ee-9aee-ff67c4d8a352&sort=score desc&q.op=AND&spellcheck.collateParam.q.op=AND&fq={!tag%3Drepo_source}repo_source%3A"New Enron"&aggregator=Centos1%2CLocal&spellcheck=false&wt=json&json.wrf=jQuery34104197056299893184_1594988462839&_=1594988462842 4. What is deactivated / changing in Aggregator mode
5. Algorithm used and performance impactThe algorithm currently used is as follows: Assumption: the corpora among the datafaris is rather homogeneous and large. This means that the local ranking score can be assimilated as a score that can be applied as a global ranking score as the terms distribution is probably similar across the many indexes. Based on this assumption, we want to retrieve the top X results from each datafari, take their scores and rerank them globally based on their score. We then display the first ten results for the first page, etc. The impact is that to display global results page from rank N to rank N+9, we need to fetch from each datafari the local results from rank 0 to N+9, since we do not know in advance how they globally score (for instance, the overall top N+2 results may be in one of the datafaris only, but then the local 1 and 2 top results of another datafari may rank N+3 and N+4 overall and therefore need to be displayed). This has a strong impact on performances: retrieving results from each datafari from 0 to N grows linearly with N, and therefore the time to execute it as well. As a consequence, the further in pagination a user will go, the slower it will be. This does not impact performances when searching within only ONE of the datafaris.
|