All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.api.au.id.jericho.lib.html.Source.html Maven / Gradle / Ivy

Go to download

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3
Show newest version






Source (Jericho HTML Parser 1.5-dev1)

















au.id.jericho.lib.html
Class Source

java.lang.Object
  extended byau.id.jericho.lib.html.Segment
      extended byau.id.jericho.lib.html.Source
All Implemented Interfaces:
java.lang.CharSequence, java.lang.Comparable

public class Source
extends Segment

Represents a source HTML document.

Note that many of the useful functions which can be performed on the source document are defined in its superclass, Segment. The Source object is itself a Segment which spans the entire document.

Most of the methods defined in this class are useful for determining the elements and tags surrounding or neighbouring a particular character position in the document.

IMPORTANT NOTE: Because HTML allows '<' characters within attribute values (see section 5.3.2 of the HTML spec), it is theoretically impossible to determine with certainty whether any given '<' character in a source document is the start of a tag without having parsed from the beginning of the document (which Jericho HTML Parser doesn't do). For this reason, the parser may reject a start tag completely if its attributes are not properly formed, although it does try to provide some leniency. In XHTML, such characters must be represented in attribute values as character entities. (see section 3.1 of the XML spec)

See Also:
Segment

Constructor Summary
Source(java.lang.CharSequence text)
          Constructs a new Source object with the specified text.
 
Method Summary
 Segment findEnclosingComment(int pos)
          Returns a Segment spanning the HTML comment that encloses the specified position in the source document.
 Element findEnclosingElement(int pos)
          Returns the most nested Element enclosing the specified position in the source document.
 Element findEnclosingElement(int pos, java.lang.String name)
          Returns the most nested Element with the specified name enclosing the specified position in the source document.
 StartTag findEnclosingStartTag(int pos)
          Returns the StartTag enclosing the specified position in the source document.
 CharacterReference findNextCharacterReference(int pos)
          Returns the CharacterReference beginning at or immediately following the specified position in the source document.
 StartTag findNextComment(int pos)
          Returns the Comment beginning at or immediately following the specified position in the source document.
 EndTag findNextEndTag(int pos)
          Returns the EndTag beginning at or immediately following the specified position in the source document.
 EndTag findNextEndTag(int pos, java.lang.String name)
          Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.
 StartTag findNextStartTag(int pos)
          Returns the StartTag beginning at or immediately following the specified position in the source document.
 StartTag findNextStartTag(int pos, java.lang.String name)
          Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.
 StartTag findNextStartTag(int pos, java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
          Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
 Tag findNextTag(int pos)
          Returns the tag (either a StartTag or EndTag) beginning at or immediately following the specified position in the source document.
 CharacterReference findPreviousCharacterReference(int pos)
          Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.
 EndTag findPreviousEndTag(int pos, java.lang.String name)
          Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag findPreviousStartTag(int pos)
          Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.
 StartTag findPreviousStartTag(int pos, java.lang.String name)
          Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 Element getElementById(java.lang.String id)
          Returns the Element with the specified id attribute value.
 java.util.Iterator getNextTagIterator(int pos)
          Returns an iterator of Tag objects beginning at or immediately following the specified position in the source document.
 void ignoreWhenParsing(java.util.Collection segments)
          Causes all of the segments in the specified collection to be ignored when parsing.
 void ignoreWhenParsing(int begin, int end)
          Causes the specified range of the source text to be ignored when parsing.
 Attributes parseAttributes(int pos, int maxEnd)
          Parses any Attributes starting at the specified position.
 Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)
          Parses any Attributes starting at the specified position.
 void setLogWriter(java.io.Writer writer)
          Sets the destination for log messages.
 java.lang.String toString()
          Returns the source text as a String.
 
Methods inherited from class au.id.jericho.lib.html.Segment
charAt, compareTo, encloses, encloses, equals, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findFormControls, findFormFields, findWords, getBegin, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, length, parseAttributes, subSequence
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Source

public Source(java.lang.CharSequence text)
Constructs a new Source object with the specified text.

Parameters:
text - the source text.
Method Detail

toString

public java.lang.String toString()
Returns the source text as a String.

If the original CharSequence supplied when this instance was constructed was not a String, the first conversion of the text to a String is cached for subsequent calls.

Specified by:
toString in interface java.lang.CharSequence
Overrides:
toString in class Segment
Returns:
the source text as a String.

getElementById

public Element getElementById(java.lang.String id)
Returns the Element with the specified id attribute value.

This simulates the script method getElementById defined in DOM HTML level 1.

This is equivalent to findNextStartTag(0,"id",id,true).getElement().

A well formed HTML document should have no more than one element with any given id attribute value.

Calls to this method are not cached.

Parameters:
id - the id attribute value (case sensitive) to search for, must not be null.
Returns:
the Element with the specified id attribute value.

findPreviousStartTag

public StartTag findPreviousStartTag(int pos)
Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.

If the specified position is within an HTML comment, the segment spanning the comment is returned.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the StartTag immediately preceding the specified position in the source document, or null if none exists.

findPreviousStartTag

public StartTag findPreviousStartTag(int pos,
                                     java.lang.String name)
Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

Start tags positioned within an HTML comment are ignored, but the comment segment itself is treated as a start tag.

Specifying a null name parameter is equivalent to findPreviousStartTag(pos).

Parameters:
pos - the position in the source document from which to start the search.
name - the name of the StartTag to search for.
Returns:
the StartTag with the specified name immediately preceding the specified position in the source document, or null if none exists.

findNextStartTag

public StartTag findNextStartTag(int pos)
Returns the StartTag beginning at or immediately following the specified position in the source document.

StartTags positioned within an HTML comment are ignored, but subsequent comment segments are treated as start tags.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the StartTag beginning at or immediately following the specified position in the source document, or null if none exists.

findNextStartTag

public StartTag findNextStartTag(int pos,
                                 java.lang.String name)
Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.

Start tags positioned within an HTML comment are ignored.

Specifying a null name parameter is equivalent to findNextStartTag(pos).

Specifying a name parameter ending in a colon (:) searches for all start tags in the specified XML namespace.

Parameters:
pos - the position in the source document from which to start the search.
name - the name of the StartTag to search for.
Returns:
the StartTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists.

findNextStartTag

public StartTag findNextStartTag(int pos,
                                 java.lang.String attributeName,
                                 java.lang.String value,
                                 boolean valueCaseSensitive)
Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

Calls to this method are not cached.

Parameters:
pos - the position in the source document from which to start the search.
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

findNextComment

public StartTag findNextComment(int pos)
Returns the Comment beginning at or immediately following the specified position in the source document.

If the specified position is within a comment, the comment following the enclosing comment is returned.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the Comment beginning at or immediately following the specified position in the source document, or null if none exists.

findPreviousEndTag

public EndTag findPreviousEndTag(int pos,
                                 java.lang.String name)
Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.

End tags positioned within an HTML comment are ignored.

Parameters:
pos - the position in the source document from which to start the search.
name - the name of the EndTag to search for, must not be null.
Returns:
the EndTag immediately preceding the specified position in the source document, or null if none exists.

findNextEndTag

public EndTag findNextEndTag(int pos)
Returns the EndTag beginning at or immediately following the specified position in the source document.

End tags positioned within an HTML comment are ignored.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the EndTag beginning at or immediately following the specified position in the source document, or null if none exists.

findNextEndTag

public EndTag findNextEndTag(int pos,
                             java.lang.String name)
Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.

End tags positioned within an HTML comment are ignored.

Parameters:
pos - the position in the source document from which to start the search.
name - the name of the EndTag to search for, must not be null.
Returns:
the EndTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists.

getNextTagIterator

public java.util.Iterator getNextTagIterator(int pos)
Returns an iterator of Tag objects beginning at or immediately following the specified position in the source document.

Tags positioned within an HTML comment are ignored, but the comment segments themselves are treated as start tags.

Parameters:
pos - the position in the source document from which to start the iteration.
Returns:
an iterator of Tag objects beginning at or immediately following the specified position in the source document.

findNextTag

public Tag findNextTag(int pos)
Returns the tag (either a StartTag or EndTag) beginning at or immediately following the specified position in the source document.

IMPLEMENTATION NOTE: Sequential tags in a document should be retrieved using the iterator from getNextTagIterator(int pos) as it is far more efficient than using multiple calls to this method.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the tag beginning at or immediately following the specified position in the source document, or null if none exists.
See Also:
getNextTagIterator(int pos)

findEnclosingStartTag

public StartTag findEnclosingStartTag(int pos)
Returns the StartTag enclosing the specified position in the source document.

If the specified position is within an HTML comment, the segment spanning the comment is returned.

A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()

Parameters:
pos - the position in the source document.
Returns:
the StartTag enclosing the specified position in the source document, or null if the position is not within a StartTag.

findEnclosingComment

public Segment findEnclosingComment(int pos)
Returns a Segment spanning the HTML comment that encloses the specified position in the source document.

A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()

Parameters:
pos - the position in the source document.
Returns:
a Segment spanning the HTML comment that encloses the specified position in the source document, or null if the position is not within a comment.

findEnclosingElement

public Element findEnclosingElement(int pos)
Returns the most nested Element enclosing the specified position in the source document.

If the specified position is within an HTML comment, the segment spanning the comment is returned.

A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()

Parameters:
pos - the position in the source document.
Returns:
the most nested Element enclosing the specified position in the source document, or null if the position is not within an Element.

findEnclosingElement

public Element findEnclosingElement(int pos,
                                    java.lang.String name)
Returns the most nested Element with the specified name enclosing the specified position in the source document.

Elements positioned within an HTML comment are ignored, but the comment segment itself is treated as an Element.

Parameters:
pos - the position in the source document.
name - the name of the Element to search for.
Returns:
the most nested Element with the specified name enclosing the specified position in the source document, or null if none exists.

findPreviousCharacterReference

public CharacterReference findPreviousCharacterReference(int pos)
Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.

Character references positioned within an HTML comment are NOT ignored.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the CharacterReference beginning at or immediately preceding the specified position in the source document, or null if none exists.

findNextCharacterReference

public CharacterReference findNextCharacterReference(int pos)
Returns the CharacterReference beginning at or immediately following the specified position in the source document.

Character references positioned within an HTML comment are NOT ignored.

Parameters:
pos - the position in the source document from which to start the search.
Returns:
the CharacterReference beginning at or immediately following the specified position in the source document, or null if none exists.

parseAttributes

public Attributes parseAttributes(int pos,
                                  int maxEnd)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

The returned Attributes segment will always begin at pos, and will end at the first occurrence of "/>" or ">" outside of a quoted attribute value, or at maxEnd, whichever comes first.

Only returns null if the segment contains a major syntactical error or more than the default maximum number of minor syntactical errors.

This is equivalent to parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())

Parameters:
pos - the position in the source document at the beginning of the attribute list
maxEnd - the maximum end position of the attribute list, or -1 if no maximum
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing.
See Also:
StartTag.getAttributes(), Segment.parseAttributes()

parseAttributes

public Attributes parseAttributes(int pos,
                                  int maxEnd,
                                  int maxErrorCount)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

Only returns null if the segment contains a major syntactical error or more than the specified number of minor syntactical errors.

The maxErrorCount argument overrides the default maximum number of minor errors allowed, which can be set using the Attributes.setDefaultMaxErrorCount(int) static method.

See parseAttributes(int pos, int maxEnd) for more information.

Parameters:
pos - the position in the source document at the beginning of the attribute list
maxEnd - the maximum end position of the attribute list, or -1 if no maximum
maxErrorCount - the maximum number of minor errors allowed while parsing
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing.
See Also:
StartTag.getAttributes(), parseAttributes(int pos, int MaxEnd)

ignoreWhenParsing

public void ignoreWhenParsing(int begin,
                              int end)
Causes the specified range of the source text to be ignored when parsing.

This method is usually used to exclude server tags or other non-HTML segments from the source text so that it does not interfere with the parsing of the surrounding HTML.

This is necessary because many server tags are used as attribute values and in other places within HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag.

For efficiency reasons, all segments to be ignored should be registered at once, without performing searches in between.

Parameters:
begin - the beginning character position in the source text.
end - the end character position in the source text.
See Also:
Segment.ignoreWhenParsing()

ignoreWhenParsing

public void ignoreWhenParsing(java.util.Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing.

This is equivalent to calling Segment.ignoreWhenParsing() on each segment in the collection.


setLogWriter

public void setLogWriter(java.io.Writer writer)
Sets the destination for log messages.

By default, the log writer is set to null, which supresses log messages.

Parameters:
writer - the java.io.Writer where log messages will be sent






© 2015 - 2024 Weber Informatics LLC | Privacy Policy