Introduction
Decompounding and Solr often don’t work as expected out of the box especially in the case of German language. This article goes step by step from the default setting to the custom setting to make it work, near to perfect.
Basic Setting
So, let’s start with the basic configuration of decompounding in solr.
Lucene provides “DictionaryCompoundedWordTokenFilter
”. This filter, decompounds compounded word into tokens, based on the dictionary that we have to provide. It also gives a set of configuration parameters such as minwordSize
and maxWordSize
, etc to make it more precise and to configure it as per our app data.
The setting is quite simple, we need to enable filter in the analyzer chain of the field type. For example, enable it for your field type in schema.xml.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="german.txt"/>
german.txt is the dictionary file and should be present in the config folder.
To check if the decompounding is working, reload the cores from solr admin UI. And go to Analysis. Type any compounded word selects the field type for which it is enabled in schema.xml and analyze. You should see input token is decompounded into further small tokens with the input token also preserved.
To get more details about this filter, please see solr official wiki.
Problems with Basic Setting
The very basic problem with the above setting is the filter that lucene provides doesn’t work as we expect it to work, it decompounds the word into just too many tokens if the dictionary used is very generic, the word like “Rotwein
” would be broken into “rot
”, “wein
”,”ein
”. Where “ein
” is something we don’t want. To solve this problem, I have written a custom filter which breaks the compound words only to the best sub tokens. So in case of rotwein
, tokens would be “rot
” and “wein
”.
So far, so good. But let’s see what happens when we index with this setting and query for the word rotwein
. Consider that we have the same analyzer chain for both query and indexing.
Let’s say we have 4 documents with the name
field as follows:
rotwein
=> rotwein
, rot
, wein
rot wein
=> rot
, wein
rot
=> rot
wein
=> wein
The results that we want when we search for rotwein
is doc 1 and 2. But this cannot be achieved with the work we have done so far.
If we use q.op
as OR
It searches for rotwein OR rot OR wein
, it gives 3 and 4 as well with 1 and 2.
If we use q.op
as AND
It searches for rotwein AND rot AND wein
, it gives only 1.
To achieve the expected results, we need to somehow change the query to include only rot AND wein
, which means we need to somehow remove the original token after decompounding filter.
Custom Filter to Remove Original Token
This filter would remove the original token and thus keep only the decompounded tokens. In the cases where the original tokens are not compounded, this filter should not remove that token.
For example:
Rotwein
=> rot
, wein
Milch
=> milch
This is just a filter class; make sure to write a factory for it.
public class RemoveOriginalFilter extends TokenFilter {
private CharTermAttribute charTermAttr;
protected PositionIncrementAttribute posIncAtt;
protected FlagsAttribute flagsAtt;
private static int FLAG = 1 ;
public RemoveOriginalFilter(TokenStream input) {
super(input);
this.charTermAttr = addAttribute(CharTermAttribute.class);
posIncAtt = addAttribute(PositionIncrementAttribute.class);
flagsAtt = addAttribute(FlagsAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
} else {
if (flagsAtt.getFlags() == FLAG) {
return input.incrementToken();
} else {
return true;
}
}
}
}
Make sure the above flag is set in CompoundWorkTokenFilterBase
.
@Override
public final boolean incrementToken() throws IOException {
if (!tokens.isEmpty()) {
assert current != null;
CompoundToken token = tokens.removeFirst();
restoreState(current); // keep all other attributes untouched
termAtt.setEmpty().append(token.txt);
offsetAtt.setOffset(token.startOffset, token.endOffset);
posIncAtt.setPositionIncrement(1);
return true;
}
current = null; // not really needed, but for safety
if (input.incrementToken()) {
// Only words longer than minWordSize get processed
if (termAtt.length() >= this.minWordSize) {
decompose();
// only capture the state if we really need it for producing new tokens
if (!tokens.isEmpty()) {
current = captureState();
flagsAtt.setFlags(FLAG);
}
}
// return original token:
return true;
} else {
return false;
}
}
Final Settings
After we have custom filter in place, enable this in the query analyzer chain of the field type after decompound filter.
<filter class="de.custom.lucene.RemoveOriginalFilterFactory" />
Change the solrconfig.xml to have default parser as edismax
and default q.OP
to be “AND
”,
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.op">AND</str>
Non Decompounded Fields
With the above settings, we make sure that the any compound word given in the form with whitespace "rot wein
" would be perfectly matched, but for the compound words without space "rotwein
". To enable this to be matched, add one more field with a new field type and do not include decompounding there.
Conclusion
As we see, decompunding doesn't work perfectly out of the box but with a little bit of customization, we can achieve good results.