doc.index.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of jericho-html

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3

Show newest version




 
  Jericho HTML Parser
  
  
  
  
   
 
 
  

  Jericho HTML Parser
  
   Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML.
   It can also determine the data structures represented in an HTML form.
  
  
   It is an open source library released under the GNU Lesser General Public License (LGPL).
   You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
  
  
   For downloads, support, updates and release notes visit the SourceForge.net project page at
   http://sourceforge.net/projects/jerichohtml/
  
  
    Please let me know if you are using the library in your own project or find it useful in any way.
    You can also rate it at http://freshmeat.net/projects/jerichohtml/
  
  All classes and methods have been comprehensively documented in the javadocs.
  
   The package description
   contains a brief overview of how to use the package.
  
  At this time no files have been submitted into CVS.  If others are interested in extending or porting the library, a CVS repository will be made available.

  Features
  The library distinguishes itself from other HTML parsers by its four major features:
  
   
    No parse tree of the entire document is ever generated.  In this sense the library
    is strictly speaking not a true parser.  The document source text is searched only for
    the markup relevant to the current operation.  This allows the library to analyse
    and modify documents containing incorrect or badly formatted HTML
    or any other server or client side code, script, macro or markup.  Most other
    parsers can't handle content that they are not explicitly programmed to accept.
   
   
    The beginning and end positions in the source text of all parsed segments are accessible,
    allowing modification of only selected segments of the document without having to reconstruct
    the entire document from a parse tree.  This feature, in combination with the one above,
    makes the toolkit extremely powerful in its simplicity.
   
   
    An entire set of FormField objects can automatically be generated
    from the source document.  These provide a very useful means for determining how to store
    and present data that is submitted from an arbitrary HTML form.
   
   
    ASP,
    JSP,
    PSP,
    PHP and
    Mason
    server tags are explicitly recognised as accurately as is possible without incorporating
    actual parsers for these languages into the library.
    The library then allows any of these segments to be ignored when parsing the rest of the document
    so that they do not interfere with the HTML syntax. (see Segment.ignoreWhenParsing())
   
  

  Sample Programs
  
   The samples directory in the download package contains sample programs
   for performing common tasks.
   The .bat files can be run directly on a MS-Windows operating system,
   or the following syntax can be used on a UNIX based operating system from the samples directory:
  
  java -classpath bin;../lib/jericho-html-x.x.jar ProgramName
  
   where x.x is the current release number and ProgramName
   is the name of the sample program to run.
  
  The following sample programs are available:
  
   
    ConvertStyleSheets
    
     Demonstrates how to detect all external style sheets and place them inline into the document.
    
   
   
    DisplayAllElements
    
     Demonstrates the behaviour of the library when retrieving all elements from a document containing
     a mix of normal HTML, different types of server tags, and badly formatted HTML.
    
   
   
    DisplayFormFields
    
     Demonstrates the use of the Segment.findFormFields() method.
    
   
   
    DisplaySpecialTags
    
     Demonstrates how to search for special tags such as document type declarations, XML declarations,
     processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
    
   
   
    JSPTest
    
     Demonstrates how to parse a document containing JSP tags without the server tags interfering with the
     syntax of the HTML.
    
   
   
    SplitLongLines
    
     Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split
     into multiple lines.
    
   
  

  Handling of Invalid or Badly Formatted HTML
  
   Note that although the library does a good job of analysing documents containing invalid or badly
   formatted HTML in areas irrelevant to the analysis, any attempt to analyse the badly formatted HTML
   itself will yield unpredictable results, which may or may not correspond with the interpretation of
   the majority of user agents.
   Furthermore, the behaviour of the library in relation to badly formatted HTML is not guaranteed to
   remain consistent in future versions.
   An exception to this is where any of the sample files containing badly formatted HTML produce
   particular results in any of the sample applications.
  

  Building
  
   The build and sample files are implemented as DOS .bat files only.
   This is because I wanted to avoid the need to install ANT for such a simple library.
   Sorry to all the unix users for the inconvenience, but the batch files really don't do anything complicated anyway.
  
  
   The javadoc compiler in j2sdk 1.4.0 has a problem with the first line of documentation in the
   Element.isInline() and Element.isBlock() methods which causes an exception
   to be thrown.  This apparent bug in the javadoc processor has been fixed in j2sdk 1.4.2.
  

  Alternative HTML Parsers
  
   This package was originally written in the latter half of 2002.  At that time I evaluated 6 other parsers,
   none of which were capable of achieving my aims.  Most couldn't reproduce a typical HTML document without change,
   none could reproduce a source document containing badly formatted or non-HTML components without change,
   and none provided a means to track the positions of nodes in the source text.
   A list of these parsers and a brief description follows, but please note that I have not revised this
   analysis since the before this package was written.
   Please let me know if there are any errors.
  
  
   
    JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)

    GNU GPL licence, expensive licence fee to use in commercial application.
    Does not support document structure (parses into a flat node stream).
   
   
    Demonstrational HTML 3.2 parser bundled with JavaCC.  Virtually useless.
   
   
    JTidy (http://jtidy.sourceforge.net/)

    Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document.
    On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use.
   
   
    javax.swing.text.html.parser.Parser

    Comes standard in the JDK.
    Supports document structure.
    Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications).
    Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable.
    Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types.
    The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation.
    I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered.
    Building it from scratch is not so easy.
   
   
    Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)

    GNU LGPL licence.  Version 1.1 was very simple without support for document structure.
    I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available.
    There are now two separate libraries, one with and one without document structure support.
    It claims to now also be capable of reproducing source text verbatim.
   
   
    CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)

    Apache-style licence.  Supports document structure.  Based on the very popular Xerces XML parser.
    At the time of evaluation this parser didn't regenerate the source accurately enough.

ConvertStyleSheets	Demonstrates how to detect all external style sheets and place them inline into the document.
DisplayAllElements	Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML.
DisplayFormFields	Demonstrates the use of the `Segment.findFormFields()` method.
DisplaySpecialTags	Demonstrates how to search for special tags such as document type declarations, XML declarations, processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
JSPTest	Demonstrates how to parse a document containing JSP tags without the server tags interfering with the syntax of the HTML.
SplitLongLines	Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines.