SearchAggregator

Valid from Datafari v5.4 upwards

The SearchAggregator is able to dispatch the request to several external Datafari sites and aggregate the responses with the local one, keeping the standard format described in the https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1672871937

1. Working details

When the SearchAggregator servlet receives a request, there are two kinds of behavior:

  • If the results aggregation is enabled: the request is dispatched to the configured external Datafari instances (see section 2) and the returned results are aggregated with the local ones into a unique response formatted in a standard way

  • If the results aggregation is disabled: the servlet acts as the Search API and executes the request locally

To dispatch the request to other Datafari instances, the SearchAggregator creates a new request for each external Datafari that is a duplicate of the original request BUT for two parameters:

  • The target URL is replaced by the URL of the Search API of the targeted Datafari instance

  • The requesting user is passed as a request parameter

The reason why the requesting user is passed as a request parameter is because it is the only safe way ! Indeed, normally the user used to request the SOLR index is deduced from the session and ONLY from the session. A user passed as a request parameter to the Search API is automatically ignored for security reasons: anybody would be able to impersonate another user to access its data !
But when the SearchAggregator servlet sends requests to external Datafari instances, the external Datafaris cannot deduce the user from the session because the SearchAggregator cannot authenticate itself to the external Datafari with the requesting user. It would require to be aware of the user credentials and use them for each request which would not be safe in addition to be technically very complex.
This is the reason why the SearchAggregator is the only “entity” allowed by the Search API to pass the requesting user as a request parameter. The SearchAggregator has dedicated client credentials on the Search API (client name: search-aggregator) and those credentials are used to retrieve an OAuth2 access token that the SearchAggregator must include in every request so that the Search API can identify it.
To summarise: When the SearchAggregator sends a request to an external Datafari Search API, it must include in the request an access token that it previously obtained from the targeted external Datafari. That way, the external Datafari instance is able to recognize the SearchAggregator and then, it considers the user passed as a request parameter instead of ignoring it.
The SearchAggregator uses a TokenManager to retrieve access tokens for each external Datafari instance to request and to renew them when they expire.

To be fast and efficient, the SearchAggregator parallelizes the requests to the different external Datafari instances. Each request is executed in its own thread and a timeout is set after which, the request is cancelled. The whole thread requests are monitored by a ThreadExecutor that sets another timeout which we call “global timeout” after which, all threads are killed no matter their status. This ensures that after the global timeout, in any case, a response will be created upon the available responses.

2. Configuration

There is a dedicated https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1637908489 documentation.

3. SearchAggregator query tips

For the selection of the sites to be queried by the aggregator, the search aggregator servlet looks for a specific url parameter: “aggregator”.

If this parameter is not set, the default site will be used (trying to find a user specific default, fall back to a global default if it does no exist, fall back to all sources in last resort).
If this parameter is set but contains nothing (url?q=*:*&aggregator=&…) then all sources are queried.
If it is set to a specific coma separated (no spaces) list of sites, those sites are queried.

Below are some exemple URLs (%2C stands for coma: “,”):

https://xx.datafari.com/Datafari/rest/v2.0/search/select?q=*:*&fl=title,url,id,extension,preview_content,last_modified,crawl_date,author,original_file_size,emptied,repo_source&sort=score desc&q.op=AND&rows=10&start=0&aggregator=Cagnes,Toulon&facet=true&facet.field={!ex=author}author&facet.field={!ex=repo_source}repo_source&facet.field={!ex=extension}extension&facet.field={!ex=language}language&facet.field={!ex=urlHierarchy}urlHierarchy&facet.query={!key=date__lastmodified_facet_0}last_modified:[NOW/DAY TO NOW]&facet.query={!key=date__lastmodified_facet_1}last_modified:[NOW/DAY-7DAY TO NOW/DAY]&facet.query={!key=date__lastmodified_facet_2}last_modified:[NOW/DAY-30DAY TO NOW/DAY-8DAY]&facet.query={!key=date__lastmodified_facet_3}last_modified:([1970-09-01T00:01:00Z TO NOW/DAY-31DAY] || [* TO 1970-08-31T23:59:59Z])&facet.query={!key=date__lastmodified_facet_4}last_modified:[1970-09-01T00:00:00Z TO 1970-09-01T00:00:00Z]

4. Aggregator requests errors

It may happen that some requests to external Datafaris fail, either for connection issues or other unexpected errors. If that is the case, the response returned by the Datafari API will contain a field named “aggregator_errors” containing a list of JSON Objects. Each JSON in that list represents a failed request to an external Datafari, and looks like this:

{ "datafari_name":"Datafari ext 1", "search_api_url":"http://datafari.ext1.com/", "error_msg":"Request error: 500 internal server error" }

Here is a quick description for each field:

So with the “aggregator_errors” field in the JSON response you will be able to know the aggregator requests that failed and you have a quick description of what happened. You can therefore display something concerning the errors in the search UI.

5. What is deactivated / changing in Aggregator mode

  • The graphical mode for declaring document boost is not possible ( .

  • The entity autocomplete mode is not possible, because technically it uses ID for entities that may be redundant in the different datafaris, while NOT representing the same entity.

  • The autocomplete mode only works when a single source is selected, because the performances would be far from being realtime if we had to consolidate the autocomplete from the different sources.

  • The Tag Cloud widget only works when a single source is selected.

  • The spellcheck mode only works when a single source is selected.

  • The promolinks only work when a single source is selected.

  • Search alerts only monitor documents indexed in the local Datafari, not the aggregated ones.

  • Synonyms and Stopwords only work for Datafaris where they have been declared individually and must be the same.

6. Algorithm used and performance impact

The algorithm currently used is as follows:

Assumption: the corpora among the datafaris is rather homogeneous and large. This means that the local ranking score can be assimilated as a score that can be applied as a global ranking score as the terms distribution is probably similar across the many indexes.

Based on this assumption, we want to retrieve the top X results from each datafari, take their scores and rerank them globally based on their score. We then display the first ten results for the first page, etc.

The impact is that to display global results page from rank N to rank N+9, we need to fetch from each datafari the local results from rank 0 to N+9, since we do not know in advance how they globally score (for instance, the overall top N+2 results may be in one of the datafaris only, but then the local 1 and 2 top results of another datafari may rank N+3 and N+4 overall and therefore need to be displayed). This has a strong impact on performances: retrieving results from each datafari from 0 to N grows linearly with N, and therefore the time to execute it as well.

As a consequence, the further in pagination a user will go, the slower it will be. This does not impact performances when searching within only ONE of the datafaris.

In aggregation mode, the further in pagination a user will go, the slower it will be.

We write it again here as a warning as it is fairly important to understand this limitation.


Valid for Datafari Enterprise v4.6 and Datafari Community v5.0 up to 5.3

This documentation is valid from Datafari Enterprise v4.6 and Datafari Community v5.0 up to 5.3

The SearchAggregator is a servlet that replaces the . The main difference is that it is able to dispatch the request to several external Datafari sites and aggregate the responses with the local one, keeping the standard format described in the

1. Working details

When the SearchAggregator servlet receives a request, there are two kinds of behavior:

  • If the results aggregation is enabled: the request is dispatched to the configured external Datafari instances (see section 2) and the returned results are aggregated with the local ones into a unique response formatted in a standard way

  • If the results aggregation is disabled: the servlet acts as the and executes the request locally

To dispatch the request to other Datafari instances, the SearchAggregator creates a new request for each external Datafari that is a duplicate of the original request BUT for two parameters:

  • The target URL is replaced by the URL of the Search API of the targeted Datafari instance

  • The requesting user is passed as a request parameter

The reason why the requesting user is passed as a request parameter is because it is the only safe way ! Indeed, normally the user used to request the SOLR index is deduced from the session and ONLY from the session. A user passed as a request parameter to the Search API is automatically ignored for security reasons: anybody would be able to impersonate another user to access its data !
But when the SearchAggregator servlet sends requests to external Datafari instances, the external Datafaris cannot deduce the user from the session because the SearchAggregator cannot authenticate itself to the external Datafari with the requesting user. It would require to be aware of the user credentials and use them for each request which would not be safe in addition to be technically very complex !
This is the reason why the SearchAggregator is the only “entity” allowed by the Search API to pass the requesting user as a request parameter. The SearchAggregator has dedicated client credentials on the Search API (client name: search-aggregator) and those credentials are used to retrieve an OAuth2 access token that the SearchAggregator must include in every request so that the Search API can identify it.
To summarise: When the SearchAggregator sends a request to an external Datafari Search API, it must include in the request an access token that it previously obtained from the targeted external Datafari. That way, the external Datafari instance is able to recognize the SearchAggregator and then, it considers the user passed as a request parameter instead of ignoring it.
The SearchAggregator uses a TokenManager to retrieve access tokens for each external Datafari instance to request and to renew them when they expire !

To be fast and efficient, the SearchAggregator parallelizes the requests to the different external Datafari instances. Each request is executed in its own thread and a timeout is set after which, the request is cancelled. The whole thread requests are monitored by a ThreadExecutor that sets another timeout which we call “global timeout” after which, all threads are killed no matter their status. This ensures that after the global timeout, in any case, a response will be created upon the available responses.

2. Configuration

There is a dedicated searchaggregator configuration page.

3. SearchAggregator query tips

For the selection of the sites to be queried by the aggregator, the search aggregator servlet looks for a specific url parameter: “aggregator”.

If this parameter is not set, the default site will be used (trying to find a user specific default, fall back to a global default if it does no exist, fall back to all sources in last resort).
If this parameter is set but contains nothing (url?q=*:*&aggregator=&…) then all sources are queried.
If it is set to a specific coma separated (no spaces) list of sites, those sites are queried.

Below are some exemple URLs (%2C stands for coma: “,”):

https://54.36.146.228/Datafari/SearchAggregator/select?fl=title%2Curl%2Cid%2Cextension%2Cpreview_content%2Clast_modified%2Ccrawl_date%2Cauthor%2Coriginal_file_size%2Cemptied&facet=true&q=*%3A*&rows=10&facet.field={!ex%3Drepo_source}repo_source&facet.field={!ex%3Dextension}extension&facet.field={!ex%3Dentity_person}entity_person&facet.field={!ex%3Dentity_phone_present}entity_phone_present&facet.field={!ex%3Dentity_phone}entity_phone&facet.field={!ex%3Dentity_special_present}entity_special_present&facet.field={!ex%3Dlanguage}language&facet.field={!ex%3Dsource}source&facet.query={!key%3DMoins%20de%20un%20mois}last_modified%3A[NOW-1MONTH TO NOW]&facet.query={!key%3DMoins%20de%20un%20an}last_modified%3A[NOW-1YEAR TO NOW]&facet.query={!key%3DMoins%20de%20cinq%20ans}last_modified%3A[NOW-5YEARS TO NOW]&facet.query={!key%3DMoins%20de%20100ko}original_file_size%3A[0 TO 102400]&facet.query={!key%3DDe%20100ko%20%C3%A0%2010Mo}original_file_size%3A[102400 TO 10485760]&facet.query={!key%3DPlus%20de%2010Mo}original_file_size%3A[10485760 TO *]&id=d36f89ed-03fa-41ee-9aee-ff67c4d8a352&aggregator=&sort=score desc&q.op=AND&spellcheck.collateParam.q.op=AND&spellcheck=false&wt=json&json.wrf=jQuery34104197056299893184_1594988462839&_=1594988462840

https://54.36.146.228/Datafari/SearchAggregator/select?fl=title%2Curl%2Cid%2Cextension%2Cpreview_content%2Clast_modified%2Ccrawl_date%2Cauthor%2Coriginal_file_size%2Cemptied&facet=true&q=*%3A*&rows=10&facet.field={!ex%3Drepo_source}repo_source&facet.field={!ex%3Dextension}extension&facet.field={!ex%3Dentity_person}entity_person&facet.field={!ex%3Dentity_phone_present}entity_phone_present&facet.field={!ex%3Dentity_phone}entity_phone&facet.field={!ex%3Dentity_special_present}entity_special_present&facet.field={!ex%3Dlanguage}language&facet.field={!ex%3Dsource}source&facet.query={!key%3DMoins%20de%20un%20mois}last_modified%3A[NOW-1MONTH TO NOW]&facet.query={!key%3DMoins%20de%20un%20an}last_modified%3A[NOW-1YEAR TO NOW]&facet.query={!key%3DMoins%20de%20cinq%20ans}last_modified%3A[NOW-5YEARS TO NOW]&facet.query={!key%3DMoins%20de%20100ko}original_file_size%3A[0 TO 102400]&facet.query={!key%3DDe%20100ko%20%C3%A0%2010Mo}original_file_size%3A[102400 TO 10485760]&facet.query={!key%3DPlus%20de%2010Mo}original_file_size%3A[10485760 TO *]&id=d36f89ed-03fa-41ee-9aee-ff67c4d8a352&sort=score desc&q.op=AND&spellcheck.collateParam.q.op=AND&fq={!tag%3Drepo_source}repo_source%3A"New Enron"&aggregator=Centos1%2CLocal&spellcheck=false&wt=json&json.wrf=jQuery34104197056299893184_1594988462839&_=1594988462842

4. What is deactivated / changing in Aggregator mode

  • The graphical mode for declaring document boost is not possible ( .

  • The entity autocomplete mode is not possible, because technically it uses ID for entities that may be redundant in the different datafaris, while NOT representing the same entity.

  • The autocomplete mode only works when a single source is selected, because the performances would be far from being realtime if we had to consolidate the autocomplete from the different sources.

  • The Tag Cloud widget only works when a single source is selected.

  • The spellcheck mode only works when a single source is selected.

  • The promolinks only work when a single source is selected.

  • Search alerts only monitor documents indexed in the local Datafari, not the aggregated ones.

  • Synonyms and Stopwords only work for Datafaris where they have been declared individually and must be the same.

5. Algorithm used and performance impact

The algorithm currently used is as follows:

Assumption: the corpora among the datafaris is rather homogeneous and large. This means that the local ranking score can be assimilated as a score that can be applied as a global ranking score as the terms distribution is probably similar across the many indexes.

Based on this assumption, we want to retrieve the top X results from each datafari, take their scores and rerank them globally based on their score. We then display the first ten results for the first page, etc.

The impact is that to display global results page from rank N to rank N+9, we need to fetch from each datafari the local results from rank 0 to N+9, since we do not know in advance how they globally score (for instance, the overall top N+2 results may be in one of the datafaris only, but then the local 1 and 2 top results of another datafari may rank N+3 and N+4 overall and therefore need to be displayed). This has a strong impact on performances: retrieving results from each datafari from 0 to N grows linearly with N, and therefore the time to execute it as well.

As a consequence, the further in pagination a user will go, the slower it will be. This does not impact performances when searching within only ONE of the datafaris.