doc.index.html Maven / Gradle / Ivy
Jericho HTML Parser
Jericho HTML Parser
Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML.
It can also determine the data structures represented in an HTML form.
It is an open source library released under the GNU Lesser General Public License (LGPL).
You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
For downloads, support, updates and release notes visit the SourceForge.net project page at
http://sourceforge.net/projects/jerichohtml/
Please let me know if you are using the library in your own project or find it useful in any way.
You can also rate it at http://freshmeat.net/projects/jerichohtml/
All classes and methods have been comprehensively documented in the javadocs.
The package description
contains a brief overview of how to use the package.
At this time no files have been submitted into CVS. If others are interested in extending or porting the library, a CVS repository will be made available.
Features
The library distinguishes itself from other HTML parsers by its four major features:
-
No parse tree of the entire document is ever generated. In this sense the library
is strictly speaking not a true parser. The document source text is searched only for
the markup relevant to the current operation. This allows the library to analyse
and modify documents containing incorrect or badly formatted HTML
or any other server or client side code, script, macro or markup. Most other
parsers can't handle content that they are not explicitly programmed to accept.
-
The beginning and end positions in the source text of all parsed segments are accessible,
allowing modification of only selected segments of the document without having to reconstruct
the entire document from a parse tree. This feature, in combination with the one above,
makes the toolkit extremely powerful in its simplicity.
-
An entire set of
FormField
objects can automatically be generated
from the source document. These provide a very useful means for determining how to store
and present data that is submitted from an arbitrary HTML form.
-
ASP,
JSP,
PSP,
PHP and
Mason
server tags are explicitly recognised as accurately as is possible without incorporating
actual parsers for these languages into the library.
The library then allows any of these segments to be ignored when parsing the rest of the document
so that they do not interfere with the HTML syntax. (see
Segment.ignoreWhenParsing()
)
Sample Programs
The samples
directory in the download package contains sample programs
for performing common tasks.
The .bat
files can be run directly on a MS-Windows operating system,
or the following syntax can be used on a UNIX based operating system from the samples
directory:
java -classpath bin;../lib/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
ConvertStyleSheets
Demonstrates how to detect all external style sheets and place them inline into the document.
DisplayAllElements
Demonstrates the behaviour of the library when retrieving all elements from a document containing
a mix of normal HTML, different types of server tags, and badly formatted HTML.
DisplayFormFields
Demonstrates the use of the Segment.findFormFields()
method.
DisplaySpecialTags
Demonstrates how to search for special tags such as document type declarations, XML declarations,
processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
JSPTest
Demonstrates how to parse a document containing JSP tags without the server tags interfering with the
syntax of the HTML.
SplitLongLines
Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split
into multiple lines.
Handling of Invalid or Badly Formatted HTML
Note that although the library does a good job of analysing documents containing invalid or badly
formatted HTML in areas irrelevant to the analysis, any attempt to analyse the badly formatted HTML
itself will yield unpredictable results, which may or may not correspond with the interpretation of
the majority of user agents.
Furthermore, the behaviour of the library in relation to badly formatted HTML is not guaranteed to
remain consistent in future versions.
An exception to this is where any of the sample files containing badly formatted HTML produce
particular results in any of the sample applications.
Building
The build and sample files are implemented as DOS .bat files only.
This is because I wanted to avoid the need to install ANT for such a simple library.
Sorry to all the unix users for the inconvenience, but the batch files really don't do anything complicated anyway.
The javadoc compiler in j2sdk 1.4.0 has a problem with the first line of documentation in the
Element.isInline()
and Element.isBlock()
methods which causes an exception
to be thrown. This apparent bug in the javadoc processor has been fixed in j2sdk 1.4.2.
Alternative HTML Parsers
This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers,
none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change,
none could reproduce a source document containing badly formatted or non-HTML components without change,
and none provided a means to track the positions of nodes in the source text.
A list of these parsers and a brief description follows, but please note that I have not revised this
analysis since the before this package was written.
Please let me know if there are any errors.
-
JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
GNU GPL licence, expensive licence fee to use in commercial application.
Does not support document structure (parses into a flat node stream).
-
Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
-
JTidy (http://jtidy.sourceforge.net/)
Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document.
On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use.
-
javax.swing.text.html.parser.Parser
Comes standard in the JDK.
Supports document structure.
Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications).
Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable.
Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types.
The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation.
I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered.
Building it from scratch is not so easy.
-
Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
GNU LGPL licence. Version 1.1 was very simple without support for document structure.
I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available.
There are now two separate libraries, one with and one without document structure support.
It claims to now also be capable of reproducing source text verbatim.
-
CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser.
At the time of evaluation this parser didn't regenerate the source accurately enough.