org.apache.lucene.analysis
abstract public class: TokenStream [javadoc |
source]
java.lang.Object
org.apache.lucene.analysis.TokenStream
Direct Known Subclasses:
CJKTokenizer, PorterStemFilter, WikipediaTokenizer, RussianStemFilter, TokenFilter, CharTokenizer, EdgeNGramTokenizer, ChineseTokenizer, NGramTokenFilter, FrenchStemFilter, FastStringTokenizer, SingleTokenTokenStream, Tokenizer, RussianLowerCaseFilter, LetterTokenizer, ElisionFilter, GermanStemFilter, RussianLetterTokenizer, TokenRangeSinkTokenizer, SnowballFilter, KeywordTokenizer, EmptyTokenStream, DateRecognizerSinkTokenizer, PatternTokenizer, DutchStemFilter, CachingTokenFilter, BrazilianStemFilter, CompoundWordTokenFilterBase, ShingleMatrixFilter, ChineseFilter, EdgeNGramTokenFilter, StandardFilter, LowerCaseTokenizer, TeeTokenFilter, PrefixAndSuffixAwareTokenFilter, ShingleFilter, PrefixAwareTokenFilter, TokenTypeSinkTokenizer, SynonymTokenFilter, SinkTokenizer, StandardTokenizer, ThaiWordFilter, WhitespaceTokenizer, NumericPayloadTokenFilter, TokenOffsetPayloadTokenFilter, StopFilter, TypeAsPayloadTokenFilter, NGramTokenizer, ISOLatin1AccentFilter, QPTestFilter, LengthFilter, GreekLowerCaseFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, LowerCaseFilter
A TokenStream enumerates the sequence of tokens, either from
fields of a document or from query text.
This is an abstract class. Concrete subclasses are:
- Tokenizer , a TokenStream
whose input is a Reader; and
- TokenFilter , a TokenStream
whose input is another TokenStream.
NOTE: subclasses must override
#next(Token) . It's
also OK to instead override
#next() but that
method is now deprecated in favor of
#next(Token) .
| Method from org.apache.lucene.analysis.TokenStream Detail: |
public void close() throws IOException {
}
Releases resources associated with this stream. |
public Token next() throws IOException {
final Token reusableToken = new Token();
Token nextToken = next(reusableToken);
if (nextToken != null) {
Payload p = nextToken.getPayload();
if (p != null) {
nextToken.setPayload((Payload) p.clone());
}
}
return nextToken;
} Deprecated! The - returned Token is a "full private copy" (not
re-used across calls to next()) but will be slower
than calling #next(Token) instead..
Returns the next token in the stream, or null at EOS. |
public Token next(Token reusableToken) throws IOException {
// We don't actually use inputToken, but still add this assert
assert reusableToken != null;
return next();
}
Returns the next token in the stream, or null at EOS.
When possible, the input Token should be used as the
returned Token (this gives fastest tokenization
performance), but this is not required and a new Token
may be returned. Callers may re-use a single Token
instance for successive calls to this method.
This implicitly defines a "contract" between
consumers (callers of this method) and
producers (implementations of this method
that are the source for tokens):
- A consumer must fully consume the previously
returned Token before calling this method again.
- A producer must call Token#clear()
before setting the fields in it & returning it
Also, the producer must make no assumptions about a
Token after it has been returned: the caller may
arbitrarily change it. If the producer needs to hold
onto the token for subsequent calls, it must clone()
it before storing it.
Note that a TokenFilter is considered a consumer. |
public void reset() throws IOException {
}
Resets this stream to the beginning. This is an
optional operation, so subclasses may or may not
implement this method. Reset() is not needed for
the standard indexing process. However, if the Tokens
of a TokenStream are intended to be consumed more than
once, it is necessary to implement reset(). Note that
if your TokenStream caches tokens and feeds them back
again after a reset, it is imperative that you
clone the tokens when you store them away (on the
first pass) as well as when you return them (on future
passes after reset()). |