public class SpanishTokenizer<T extends HasWord> extends AbstractTokenizer<T>
The tokenizer tokenizes according to the modified AnCora corpus tokenization standards, so the rules are a little different from PTB.
A single instance of a Spanish Tokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a SpanishTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
Modifier and Type | Class and Description |
---|---|
static class |
SpanishTokenizer.SpanishTokenizerFactory<T extends HasWord>
A factory for Spanish tokenizer instances.
|
Modifier and Type | Field and Description |
---|---|
static String |
ANCORA_OPTIONS |
static String |
DEFAULT_OPTIONS |
NEWLINE_TOKEN, nextToken
Constructor and Description |
---|
SpanishTokenizer(Reader r,
LexedTokenFactory<T> tf,
Properties lexerProperties,
boolean splitCompounds,
boolean splitVerbs,
boolean splitContractions)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<CoreLabel> |
ancoraFactory()
Returns a tokenizer with Ancora tokenization.
|
static TokenizerFactory<CoreLabel> |
coreLabelFactory()
a factory that vends CoreLabel tokens with default tokenization.
|
static TokenizerFactory<CoreLabel> |
factory() |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory) |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory,
String options)
recommended factory method
|
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(String[] args)
A fast, rule-based tokenizer for Spanish based on AnCora.
|
hasNext, next, peek, remove, tokenize
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEachRemaining
public static final String ANCORA_OPTIONS
public static final String DEFAULT_OPTIONS
public SpanishTokenizer(Reader r, LexedTokenFactory<T> tf, Properties lexerProperties, boolean splitCompounds, boolean splitVerbs, boolean splitContractions)
r
- tf
- lexerProperties
- splitCompounds
- protected T getNext()
AbstractTokenizer
getNext
in class AbstractTokenizer<T extends HasWord>
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, String options)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> ancoraFactory()
public static TokenizerFactory<CoreLabel> coreLabelFactory()
public static TokenizerFactory<CoreLabel> factory()
public static void main(String[] args)
Currently, this tokenizer does not do line splitting. It assumes that the input file is delimited by the system line separator. The output will be equivalently delimited.
args
- Command-line arguments