FragListBuilder | FragListBuilder is an interface for FieldFragList builder classes. | code | html |
FragmentsBuilder | FragmentsBuilder is an interface for fragments (snippets) builder classes. | code | html |
BaseFragmentsBuilder | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. | code | html |
FastVectorHighlighter | Another highlighter implementation. | code | html |
FieldFragList | FieldFragList has a list of "frag info" that is used by FragmentsBuilder class to create fragments (snippets). | code | html |
FieldFragList.WeightedFragInfo | code | html | |
FieldFragList.WeightedFragInfo.SubInfo | code | html | |
FieldPhraseList | FieldPhraseList has a list of WeightedPhraseInfo that is used by FragListBuilder to create a FieldFragList object. | code | html |
FieldPhraseList.WeightedPhraseInfo | code | html | |
FieldPhraseList.WeightedPhraseInfo.Toffs | code | html | |
FieldQuery | FieldQuery breaks down query object into terms/phrases and keep them in QueryPhraseMap structure. | code | html |
FieldQuery.QueryPhraseMap | code | html | |
FieldTermStack | FieldTermStack is a stack that keeps query terms in the specified field
of the document to be highlighted. |
code | html |
FieldTermStack.TermInfo | code | html | |
ScoreOrderFragmentsBuilder | An implementation of FragmentsBuilder that outputs score-order fragments. | code | html |
ScoreOrderFragmentsBuilder.ScoreComparator | code | html | |
SimpleFragListBuilder | A simple implementation of FragListBuilder. | code | html |
SimpleFragmentsBuilder | A simple implementation of FragmentsBuilder. | code | html |
To explain the algorithm, let's use the following sample text (to be highlighted) and user query:
Sample Text | Lucene is a search engine library. |
User Query | Lucene^2 OR "search library"~1 |
The user query is a BooleanQuery that consists of TermQuery("Lucene") with boost of 2 and PhraseQuery("search library") with slop of 1.
For your convenience, here is the offsets and positions info of the sample text.
+--------+-----------------------------------+ | | 1111111111222222222233333| | offset|01234567890123456789012345678901234| +--------+-----------------------------------+ |document|Lucene is a search engine library. | +--------*-----------------------------------+ |position|0 1 2 3 4 5 | +--------*-----------------------------------+
In Step 1, Fast Vector Highlighter generates org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap from the user query.
QueryPhraseMap
consists of the following members:
public class QueryPhraseMap { boolean terminal; int slop; // valid if terminal == true and phraseHighlight == true float boost; // valid if terminal == true Map<String, QueryPhraseMap> subMap; }
QueryPhraseMap
has subMap. The key of the subMap is a term
text in the user query and the value is a subsequent QueryPhraseMap
.
If the query is a term (not phrase), then the subsequent QueryPhraseMap
is marked as terminal. If the query is a phrase, then the subsequent QueryPhraseMap
is not a terminal and it has the next term text in the phrase.
From the sample user query, the following QueryPhraseMap
will be generated:
QueryPhraseMap +--------+-+ +-------+-+ |"Lucene"|o+->|boost=2|*| * : terminal +--------+-+ +-------+-+ +--------+-+ +---------+-+ +-------+------+-+ |"search"|o+->|"library"|o+->|boost=1|slop=1|*| +--------+-+ +---------+-+ +-------+------+-+
In Step 2, Fast Vector Highlighter generates org.apache.lucene.search.vectorhighlight.FieldTermStack . Fast Vector Highlighter uses org.apache.lucene.index.TermFreqVector data
(must be stored org.apache.lucene.document.Field.TermVector#WITH_POSITIONS_OFFSETS )
to generate it. FieldTermStack
keeps the terms in the user query.
Therefore, in this sample case, Fast Vector Highlighter generates the following FieldTermStack
:
FieldTermStack +------------------+ |"Lucene"(0,6,0) | +------------------+ |"search"(12,18,3) | +------------------+ |"library"(26,33,5)| +------------------+ where : "termText"(startOffset,endOffset,position)
In Step 3, Fast Vector Highlighter generates org.apache.lucene.search.vectorhighlight.FieldPhraseList
by reference to QueryPhraseMap
and FieldTermStack
.
FieldPhraseList +----------------+-----------------+---+ |"Lucene" |[(0,6)] |w=2| +----------------+-----------------+---+ |"search library"|[(12,18),(26,33)]|w=1| +----------------+-----------------+---+
The type of each entry is WeightedPhraseInfo
that consists of
an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
calculate the weight) will be taken into account when Fast Vector Highlighter creates
org.apache.lucene.search.vectorhighlight.FieldFragList in the next step.
In Step 4, Fast Vector Highlighter creates FieldFragList
by reference to
FieldPhraseList
. In this sample case, the following
FieldFragList
will be generated:
FieldFragList +---------------------------------+ |"Lucene"[(0,6)] | |"search library"[(12,18),(26,33)]| |totalBoost=3 | +---------------------------------+
In Step 5, by using FieldFragList
and the field stored data,
Fast Vector Highlighter creates highlighted snippets!