All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.index.html Maven / Gradle / Ivy

Go to download

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3
Show newest version



 
  Jericho HTML Parser
  
  
  
  
   
 
 
  
SourceForge.net Logo

Jericho HTML Parser

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It can also determine the data structures represented in an HTML form.

It is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.

For downloads, support, updates and release notes visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/

Please let me know if you are using the library in your own project or find it useful in any way. You can also rate it at http://freshmeat.net/projects/jerichohtml/

All classes and methods have been comprehensively documented in the javadocs.

The package description contains a brief overview of how to use the package.

At this time no files have been submitted into CVS. If others are interested in extending or porting the library, a CVS repository will be made available.

Features

The library distinguishes itself from other HTML parsers by its four major features:

  • No parse tree of the entire document is ever generated. In this sense the library is strictly speaking not a true parser. The document source text is searched only for the markup relevant to the current operation. This allows the library to analyse and modify documents containing incorrect or badly formatted HTML or any other server or client side code, script, macro or markup. Most other parsers can't handle content that they are not explicitly programmed to accept.
  • The beginning and end positions in the source text of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a parse tree. This feature, in combination with the one above, makes the toolkit extremely powerful in its simplicity.
  • An entire set of FormField objects can automatically be generated from the source document. These provide a very useful means for determining how to store and present data that is submitted from an arbitrary HTML form.
  • ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised as accurately as is possible without incorporating actual parsers for these languages into the library. The library then allows any of these segments to be ignored when parsing the rest of the document so that they do not interfere with the HTML syntax. (see Segment.ignoreWhenParsing())

Sample Programs

The samples directory in the download package contains sample programs for performing common tasks. The .bat files can be run directly on a MS-Windows operating system, or the following syntax can be used on a UNIX based operating system from the samples directory:

java -classpath bin;../lib/jericho-html-x.x.jar ProgramName

where x.x is the current release number and ProgramName is the name of the sample program to run.

The following sample programs are available:

ConvertStyleSheets Demonstrates how to detect all external style sheets and place them inline into the document.
DisplayAllElements Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML.
DisplayFormFields Demonstrates the use of the Segment.findFormFields() method.
DisplaySpecialTags Demonstrates how to search for special tags such as document type declarations, XML declarations, processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
JSPTest Demonstrates how to parse a document containing JSP tags without the server tags interfering with the syntax of the HTML.
SplitLongLines Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines.

Handling of Invalid or Badly Formatted HTML

Note that although the library does a good job of analysing documents containing invalid or badly formatted HTML in areas irrelevant to the analysis, any attempt to analyse the badly formatted HTML itself will yield unpredictable results, which may or may not correspond with the interpretation of the majority of user agents. Furthermore, the behaviour of the library in relation to badly formatted HTML is not guaranteed to remain consistent in future versions. An exception to this is where any of the sample files containing badly formatted HTML produce particular results in any of the sample applications.

Building

The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience, but the batch files really don't do anything complicated anyway.

The javadoc compiler in j2sdk 1.4.0 has a problem with the first line of documentation in the Element.isInline() and Element.isBlock() methods which causes an exception to be thrown. This apparent bug in the javadoc processor has been fixed in j2sdk 1.4.2.

Alternative HTML Parsers

This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.

  • JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
    GNU GPL licence, expensive licence fee to use in commercial application. Does not support document structure (parses into a flat node stream).
  • Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
  • JTidy (http://jtidy.sourceforge.net/)
    Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document. On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use.
  • javax.swing.text.html.parser.Parser
    Comes standard in the JDK. Supports document structure. Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications). Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable. Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types. The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation. I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered. Building it from scratch is not so easy.
  • Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
    GNU LGPL licence. Version 1.1 was very simple without support for document structure. I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available. There are now two separate libraries, one with and one without document structure support. It claims to now also be capable of reproducing source text verbatim.
  • CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
    Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser. At the time of evaluation this parser didn't regenerate the source accurately enough.




© 2015 - 2024 Weber Informatics LLC | Privacy Policy