org.inria.ns.reflex.util.io
Class RegexpTokenizer

java.lang.Object
  extended by org.inria.ns.reflex.util.io.RegexpTokenizer

public class RegexpTokenizer
extends Object

A RegexpTokenizer can tokenize a stream of characters regarding a regular expression.

This class breaks the input stream into an iterator on strings, treating any substring that matches the pattern as a separator. The separators themselves are not returned.

Unlike the well-known grep UNIX command, it doesn't split the lines of the input ; it applies the pattern on the entire input, but preserve reading in the streaming fashion. This class uses an internal buffer that must be big enough for the pattern to match. The default size of the buffer is 2048.

When no character sequences of the buffer are matching the pattern, the whole chunk of datas is returned.

The same instance can be used to tokenize several different inputs, simultaneously.

Author:
Philippe Poulard

Field Summary
static Pattern WHITE_SPACE
          A pattern for splitting on white spaces.
 
Constructor Summary
RegexpTokenizer(Pattern pattern)
          Create a new tokenizer.
RegexpTokenizer(Pattern pattern, int bufferSize)
          Create a new tokenizer.
RegexpTokenizer(String regexp)
          Create a new tokenizer.
RegexpTokenizer(String regexp, int flags)
          Create a new tokenizer.
RegexpTokenizer(String regexp, int flags, int bufferSize)
          Create a new tokenizer.
 
Method Summary
 Iterator tokenize(Reader input)
          Return an iterator over the strings found in the input.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WHITE_SPACE

public static final Pattern WHITE_SPACE
A pattern for splitting on white spaces.

Constructor Detail

RegexpTokenizer

public RegexpTokenizer(Pattern pattern,
                       int bufferSize)
Create a new tokenizer.

Parameters:
pattern - The compiled regexp, or null for "\s*".
bufferSize - The size of the internal buffer.

RegexpTokenizer

public RegexpTokenizer(Pattern pattern)
Create a new tokenizer.

Parameters:
pattern - The compiled regexp.

RegexpTokenizer

public RegexpTokenizer(String regexp,
                       int flags,
                       int bufferSize)
Create a new tokenizer.

Parameters:
regexp - The regexp.
flags - The flags used to compile the regexp.
bufferSize - The size of the internal buffer.
See Also:
Pattern

RegexpTokenizer

public RegexpTokenizer(String regexp,
                       int flags)
Create a new tokenizer.

Parameters:
regexp - The regexp.
flags - The flags used to compile the regexp.

RegexpTokenizer

public RegexpTokenizer(String regexp)
Create a new tokenizer.

Parameters:
regexp - The regexp.
Method Detail

tokenize

public Iterator tokenize(Reader input)
Return an iterator over the strings found in the input.

The input is read as one goes along is read the iterator.

Parameters:
input - The input to tokenize.
Returns:
An iterator on the strings found, excluding those that matched the pattern. Some empty strings can be in the iterator.