org.htmlparser.package.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of htmllexer Show documentation

HTML Lexer is the low level lexical analyzer.

The newest version!







The basic API classes which will be used by most developers when working with
the HTML Parser.
The {@link org.htmlparser.Parser} class is the main high level class that
provides simplified access to the contents of an HTML page.
A wide range of methods is available to customize the operation of the Parser,
as well as access specific pieces of the page as
{@link org.htmlparser.Node Nodes}.
The {@link org.htmlparser.NodeFactory} interface specifies the requirements
for a developer to have the Parser or Lexer generate nodes. Three types of
nodes are required: {@link org.htmlparser.Text}, {@link org.htmlparser.Remark}
and {@link org.htmlparser.Tag Tags}. Tags contain lists
of child nodes and {@link org.htmlparser.Attribute attributes}.
The only provided implementation of the NodeFactory interface
is the {@link org.htmlparser.PrototypicalNodeFactory} which
operates by holding example nodes and cloning them as needed to satisfy the
requests for nodes by the Parser. By default, a Lexer is it's own NodeFactory,
returning new {@link org.htmlparser.nodes.TextNode},
{@link org.htmlparser.nodes.RemarkNode} and undifferentiated
{@link org.htmlparser.nodes.TagNode Tagnodes} (see the
{@link org.htmlparser.nodes nodes} package), but when the parser uses a lexer
it replaces this behaviour with a PrototypicalNodeFactory to return a rich
set of specific tags (see the {@link org.htmlparser.tags tags} package).
The {@link org.htmlparser.NodeFilter} interface is used by the filtering
code to determine if a node meets a certain criteria. Some generic examples of
filters can be found in the {@link org.htmlparser.filters filters} package.