...
Extract regex groups: If there is one or more groups in the regular expression and this option is enabled, then the value of the groups will be set in the destination metadata. If it is not checked, the complete match is stored in the destination metadata. The groups of a regular expression are defined between (
and )
. For example, this regular expression has one group: word1 (.*?) word2
. In this regular expression, we are searching for the part of a text starting with word1
and ending with word2
. Let's take this text: My word1 titi tata toto word2 and other words
. With the chosen regular expression, if the checkbox is enabled, the destination metadata = "titi tata toto"
. Otherwise, the destination metadata = "word1 titi tata toto word2"
. With several groups, the values will be added separated with space. For example, with this regex: word1 (.?) tata (.?) word2
, the result is: destination metadata = "titi toto"
. This field is optional.
Note |
---|
|
...
But if you want to extract only the department of <part1>, this is not possible with the Regex Entity Connector as it processes a file line by line. The regex to extract department of <part1> could have been: <part1>\s<department>(.*?)</department>
. Even with the return character expected in the regex with \s
can’t help, because the connector reads only one line, so a return character after <part1> means that only <part1> will be read when the regex is applied.
However, as long as you're not in the situation where you need to extract data from tag nesting, the Regex Entity Connector is an interesting solution for, say, extracting chunks of information between tags. Let’s see this other example: <department>DEP1/COMMON-DEP/DEP2</department>
, you want to extract DEP1 and DEP2. Let's say COMMON-DEP is part of the selection criteria for finding departments to extract. The regular expression will be: <department>(.*?)/COMMON-DEP/(.*?)</department>
.
Note |
---|
Warning! When using a destination field, make sure it exists in Solr, or create it if necessary (no crash, but the field will be silently ignored). It has to be a multivalued field. Note that if you want to use it into facets, it has to be a “String” field. |
...