org.htmlparser.scanners.package.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of htmllexer Show documentation

HTML Lexer is the low level lexical analyzer.

The newest version!







The scanners package contains classes responsible for the tertiary
identification of tags. The lower level classes in the {@link
org.htmlparser.lexer.Lexer lexer} package convert
byte streams to characters and characters to nodes (via the {@link
org.htmlparser.NodeFactory NodeFactory}). In the case of tags, the
scanners in this package can then complete the tag or override the current tag
and return an augmented tag. The existing implementation of the {@link
org.htmlparser.scanners.CompositeTagScanner composite tag
scanner}, for example, gathers the children of composite tags, identifying the
nested structure of HTML documents. The {@link
org.htmlparser.scanners.ScriptScanner script scanner} overrides the nodes
returned by the lexer and creates a tag containing a single string that is the
script code.

You might need to create a scanner (that implements the
{@link org.htmlparser.scanners.Scanner Scanner} interface) if
the text you are trying to parse doesn't look like HTML, as is the case for the
script scanner, or the normal processing of tags by nesting their structure is
inadequate.