doc.api.au.id.jericho.lib.html.Source.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of jericho-html

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3

Show newest version







Source (Jericho HTML Parser 1.5-dev1)





















  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD









au.id.jericho.lib.html


Class Source
java.lang.Object
  au.id.jericho.lib.html.Segment
      au.id.jericho.lib.html.Source


All Implemented Interfaces: 
java.lang.CharSequence, java.lang.Comparable



public class Source
extends Segment


Represents a source HTML document.
 

 Note that many of the useful functions which can be performed on the source document are
 defined in its superclass, Segment.
 The Source object is itself a Segment which spans the entire document.
 

 Most of the methods defined in this class are useful for determining the elements and tags
 surrounding or neighbouring a particular character position in the document.
 

 IMPORTANT NOTE: Because HTML allows '<' characters within attribute values
 (see section 5.3.2 of the HTML spec),
 it is theoretically impossible to determine with certainty whether
 any given '<' character in a source document is the start of a tag
 without having parsed from the beginning of the document (which Jericho HTML Parser doesn't do).
 For this reason, the parser may reject a start tag completely if its attributes are not
 properly formed, although it does try to provide some leniency.
 In XHTML, such characters must be represented in attribute values as character entities.
 (see section 3.1 of the XML spec)




See Also:
Segment















Constructor Summary


Source(java.lang.CharSequence text)



          Constructs a new Source object with the specified text.


 






Method Summary



 Segment
findEnclosingComment(int pos)



          Returns a Segment spanning the HTML comment that encloses the specified position in the source document.



 Element
findEnclosingElement(int pos)



          Returns the most nested Element enclosing the specified position in the source document.



 Element
findEnclosingElement(int pos,
                     java.lang.String name)



          Returns the most nested Element with the specified name enclosing the specified position in the source document.



 StartTag
findEnclosingStartTag(int pos)



          Returns the StartTag enclosing the specified position in the source document.



 CharacterReference
findNextCharacterReference(int pos)



          Returns the CharacterReference beginning at or immediately following the specified position in the source document.



 StartTag
findNextComment(int pos)



          Returns the Comment beginning at or immediately following the specified position in the source document.



 EndTag
findNextEndTag(int pos)



          Returns the EndTag beginning at or immediately following the specified position in the source document.



 EndTag
findNextEndTag(int pos,
               java.lang.String name)



          Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.



 StartTag
findNextStartTag(int pos)



          Returns the StartTag beginning at or immediately following the specified position in the source document.



 StartTag
findNextStartTag(int pos,
                 java.lang.String name)



          Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.



 StartTag
findNextStartTag(int pos,
                 java.lang.String attributeName,
                 java.lang.String value,
                 boolean valueCaseSensitive)



          Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.



 Tag
findNextTag(int pos)



          Returns the tag (either a StartTag or EndTag) beginning at or immediately following the specified position in the source document.



 CharacterReference
findPreviousCharacterReference(int pos)



          Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.



 EndTag
findPreviousEndTag(int pos,
                   java.lang.String name)



          Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.



 StartTag
findPreviousStartTag(int pos)



          Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.



 StartTag
findPreviousStartTag(int pos,
                     java.lang.String name)



          Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.



 Element
getElementById(java.lang.String id)



          Returns the Element with the specified id attribute value.



 java.util.Iterator
getNextTagIterator(int pos)



          Returns an iterator of Tag objects beginning at or immediately following the specified position in the source document.



 void
ignoreWhenParsing(java.util.Collection segments)



          Causes all of the segments in the specified collection to be ignored when parsing.



 void
ignoreWhenParsing(int begin,
                  int end)



          Causes the specified range of the source text to be ignored when parsing.



 Attributes
parseAttributes(int pos,
                int maxEnd)



          Parses any Attributes starting at the specified position.



 Attributes
parseAttributes(int pos,
                int maxEnd,
                int maxErrorCount)



          Parses any Attributes starting at the specified position.



 void
setLogWriter(java.io.Writer writer)



          Sets the destination for log messages.



 java.lang.String
toString()



          Returns the source text as a String.


 


Methods inherited from class au.id.jericho.lib.html.Segment


charAt, compareTo, encloses, encloses, equals, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findFormControls, findFormFields, findWords, getBegin, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, length, parseAttributes, subSequence


 


Methods inherited from class java.lang.Object


getClass, notify, notifyAll, wait, wait, wait


 











Constructor Detail




Source
public Source(java.lang.CharSequence text)

Constructs a new Source object with the specified text.

Parameters:
text - the source text.







Method Detail




toString
public java.lang.String toString()

Returns the source text as a String.
 
 If the original CharSequence supplied when this instance was constructed was not a String,
 the first conversion of the text to a String is cached for subsequent calls.


Specified by:
toString in interface java.lang.CharSequence
Overrides:
toString in class Segment



Returns:
the source text as a String.





getElementById
public Element getElementById(java.lang.String id)

Returns the Element with the specified id attribute value.
 
 This simulates the script method
 getElementById
 defined in DOM HTML level 1.
 

 This is equivalent to findNextStartTag(0,"id",id,true).getElement().
 

 A well formed HTML document should have no more than one element with any given id attribute value.
 

 Calls to this method are not cached.


Parameters:
id - the id attribute value (case sensitive) to search for, must not be null.
Returns:
the Element with the specified id attribute value.





findPreviousStartTag
public StartTag findPreviousStartTag(int pos)

Returns the StartTag at or immediately preceding (or enclosing) the specified position in the source document.
 
 If the specified position is within an HTML comment, the segment
 spanning the comment is returned.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the StartTag immediately preceding the specified position in the source document, or null if none exists.





findPreviousStartTag
public StartTag findPreviousStartTag(int pos,
                                     java.lang.String name)

Returns the StartTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 
 Start tags positioned within an HTML comment are ignored, but the comment segment itself is treated as a start tag.
 

 Specifying a null name parameter is equivalent to findPreviousStartTag(pos).


Parameters:
pos - the position in the source document from which to start the search.
name - the name of the StartTag to search for.
Returns:
the StartTag with the specified name immediately preceding the specified position in the source document, or null if none exists.





findNextStartTag
public StartTag findNextStartTag(int pos)

Returns the StartTag beginning at or immediately following the specified position in the source document.
 
 StartTags positioned within an HTML comment are ignored, but subsequent comment segments are treated as start tags.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the StartTag beginning at or immediately following the specified position in the source document, or null if none exists.





findNextStartTag
public StartTag findNextStartTag(int pos,
                                 java.lang.String name)

Returns the StartTag with the specified name beginning at or immediately following the specified position in the source document.
 
 Start tags positioned within an HTML comment are ignored.
 

 Specifying a null name parameter is equivalent to findNextStartTag(pos).
 

 Specifying a name parameter ending in a colon (:) searches for all start tags in the specified XML namespace.


Parameters:
pos - the position in the source document from which to start the search.
name - the name of the StartTag to search for.
Returns:
the StartTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists.





findNextStartTag
public StartTag findNextStartTag(int pos,
                                 java.lang.String attributeName,
                                 java.lang.String value,
                                 boolean valueCaseSensitive)

Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
 
 Calls to this method are not cached.


Parameters:
pos - the position in the source document from which to start the search.
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.





findNextComment
public StartTag findNextComment(int pos)

Returns the Comment beginning at or immediately following the specified position in the source document.
 
 If the specified position is within a comment, the comment following the enclosing comment is returned.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the Comment beginning at or immediately following the specified position in the source document, or null if none exists.





findPreviousEndTag
public EndTag findPreviousEndTag(int pos,
                                 java.lang.String name)

Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
 
 End tags positioned within an HTML comment are ignored.


Parameters:
pos - the position in the source document from which to start the search.
name - the name of the EndTag to search for, must not be null.
Returns:
the EndTag immediately preceding the specified position in the source document, or null if none exists.





findNextEndTag
public EndTag findNextEndTag(int pos)

Returns the EndTag beginning at or immediately following the specified position in the source document.
 
 End tags positioned within an HTML comment are ignored.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the EndTag beginning at or immediately following the specified position in the source document, or null if none exists.





findNextEndTag
public EndTag findNextEndTag(int pos,
                             java.lang.String name)

Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.
 
 End tags positioned within an HTML comment are ignored.


Parameters:
pos - the position in the source document from which to start the search.
name - the name of the EndTag to search for, must not be null.
Returns:
the EndTag with the specified name beginning at or immediately following the specified position in the source document, or null if none exists.





getNextTagIterator
public java.util.Iterator getNextTagIterator(int pos)

Returns an iterator of Tag objects beginning at or immediately following the specified position in the source document.
 
 Tags positioned within an HTML comment are ignored, but the comment segments themselves are treated as start tags.


Parameters:
pos - the position in the source document from which to start the iteration.
Returns:
an iterator of Tag objects beginning at or immediately following the specified position in the source document.





findNextTag
public Tag findNextTag(int pos)

Returns the tag (either a StartTag or EndTag) beginning at or immediately following the specified position in the source document.
 
 IMPLEMENTATION NOTE: Sequential tags in a document should be retrieved using the iterator from
 getNextTagIterator(int pos) as it is far more efficient than using multiple calls to this method.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the tag beginning at or immediately following the specified position in the source document, or null if none exists.
See Also:
getNextTagIterator(int pos)





findEnclosingStartTag
public StartTag findEnclosingStartTag(int pos)

Returns the StartTag enclosing the specified position in the source document.
 
 If the specified position is within an HTML comment, the segment
 spanning the comment is returned.
 

 A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()


Parameters:
pos - the position in the source document.
Returns:
the StartTag enclosing the specified position in the source document, or null if the position is not within a StartTag.





findEnclosingComment
public Segment findEnclosingComment(int pos)

Returns a Segment spanning the HTML comment that encloses the specified position in the source document.
 
 A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()


Parameters:
pos - the position in the source document.
Returns:
a Segment spanning the HTML comment that encloses the specified position in the source document, or null if the position is not within a comment.





findEnclosingElement
public Element findEnclosingElement(int pos)

Returns the most nested Element enclosing the specified position in the source document.
 
 If the specified position is within an HTML comment, the segment
 spanning the comment is returned.
 

 A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()


Parameters:
pos - the position in the source document.
Returns:
the most nested Element enclosing the specified position in the source document, or null if the position is not within an Element.





findEnclosingElement
public Element findEnclosingElement(int pos,
                                    java.lang.String name)

Returns the most nested Element with the specified name enclosing the specified position in the source document.
 
 Elements positioned within an HTML comment are ignored, but the comment segment itself is treated as an Element.


Parameters:
pos - the position in the source document.
name - the name of the Element to search for.
Returns:
the most nested Element with the specified name enclosing the specified position in the source document, or null if none exists.





findPreviousCharacterReference
public CharacterReference findPreviousCharacterReference(int pos)

Returns the CharacterReference at or immediately preceding (or enclosing) the specified position in the source document.
 
 Character references positioned within an HTML comment are NOT ignored.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the CharacterReference beginning at or immediately preceding the specified position in the source document, or null if none exists.





findNextCharacterReference
public CharacterReference findNextCharacterReference(int pos)

Returns the CharacterReference beginning at or immediately following the specified position in the source document.
 
 Character references positioned within an HTML comment are NOT ignored.


Parameters:
pos - the position in the source document from which to start the search.
Returns:
the CharacterReference beginning at or immediately following the specified position in the source document, or null if none exists.





parseAttributes
public Attributes parseAttributes(int pos,
                                  int maxEnd)

Parses any Attributes starting at the specified position.
 This method is only used in the unusual situation where attributes exist outside of a start tag.
 The StartTag.getAttributes() method should be used in normal situations.
 
 The returned Attributes segment will always begin at pos,
 and will end at the first occurrence of "/>" or ">" outside of a quoted attribute value,
 or at maxEnd, whichever comes first.
 

 Only returns null if the segment contains a major syntactical error
 or more than the default maximum number of
 minor syntactical errors.
 

 This is equivalent to
 parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())


Parameters:
pos - the position in the source document at the beginning of the attribute list
maxEnd - the maximum end position of the attribute list, or -1 if no maximum
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing.
See Also:
StartTag.getAttributes(), 
Segment.parseAttributes()





parseAttributes
public Attributes parseAttributes(int pos,
                                  int maxEnd,
                                  int maxErrorCount)

Parses any Attributes starting at the specified position.
 This method is only used in the unusual situation where attributes exist outside of a start tag.
 The StartTag.getAttributes() method should be used in normal situations.
 
 Only returns null if the segment contains a major syntactical error
 or more than the specified number of minor syntactical errors.
 

 The maxErrorCount argument overrides the default maximum number of minor errors allowed,
 which can be set using the Attributes.setDefaultMaxErrorCount(int) static method.
 

 See parseAttributes(int pos, int maxEnd) for more information.


Parameters:
pos - the position in the source document at the beginning of the attribute list
maxEnd - the maximum end position of the attribute list, or -1 if no maximum
maxErrorCount - the maximum number of minor errors allowed while parsing
Returns:
the Attributes starting at the specified position, or null if too many errors occur while parsing.
See Also:
StartTag.getAttributes(), 
parseAttributes(int pos, int MaxEnd)





ignoreWhenParsing
public void ignoreWhenParsing(int begin,
                              int end)

Causes the specified range of the source text to be ignored when parsing.
 
 This method is usually used to exclude server tags or other non-HTML segments from the source text
 so that it does not interfere with the parsing of the surrounding HTML.
 

 This is necessary because many server tags are used as attribute values and in other places within
 HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag.
 

 For efficiency reasons, all segments to be ignored should be registered at once, without performing
 searches in between.


Parameters:
begin - the beginning character position in the source text.
end - the end character position in the source text.
See Also:
Segment.ignoreWhenParsing()





ignoreWhenParsing
public void ignoreWhenParsing(java.util.Collection segments)

Causes all of the segments in the specified collection to be ignored when parsing.
 
 This is equivalent to calling Segment.ignoreWhenParsing() on each segment in the collection.








setLogWriter
public void setLogWriter(java.io.Writer writer)

Sets the destination for log messages.
 
 By default, the log writer is set to null, which supresses log messages.


Parameters:
writer - the java.io.Writer where log messages will be sent














  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Method Summary
`Segment`	`findEnclosingComment(int pos)` Returns a Segment spanning the HTML comment that encloses the specified position in the source document.
`Element`	`findEnclosingElement(int pos)` Returns the most nested Element enclosing the specified position in the source document.
`Element`	`findEnclosingElement(int pos, java.lang.String name)` Returns the most nested Element with the specified name enclosing the specified position in the source document.
`StartTag`	`findEnclosingStartTag(int pos)` Returns the `StartTag` enclosing the specified position in the source document.
`CharacterReference`	`findNextCharacterReference(int pos)` Returns the `CharacterReference` beginning at or immediately following the specified position in the source document.
`StartTag`	`findNextComment(int pos)` Returns the Comment beginning at or immediately following the specified position in the source document.
`EndTag`	`findNextEndTag(int pos)` Returns the EndTag beginning at or immediately following the specified position in the source document.
`EndTag`	`findNextEndTag(int pos, java.lang.String name)` Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.
`StartTag`	`findNextStartTag(int pos)` Returns the `StartTag` beginning at or immediately following the specified position in the source document.
`StartTag`	`findNextStartTag(int pos, java.lang.String name)` Returns the `StartTag` with the specified name beginning at or immediately following the specified position in the source document.
`StartTag`	`findNextStartTag(int pos, java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)` Returns the `StartTag` with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
`Tag`	`findNextTag(int pos)` Returns the tag (either a `StartTag` or `EndTag`) beginning at or immediately following the specified position in the source document.
`CharacterReference`	`findPreviousCharacterReference(int pos)` Returns the `CharacterReference` at or immediately preceding (or enclosing) the specified position in the source document.
`EndTag`	`findPreviousEndTag(int pos, java.lang.String name)` Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
`StartTag`	`findPreviousStartTag(int pos)` Returns the `StartTag` at or immediately preceding (or enclosing) the specified position in the source document.
`StartTag`	`findPreviousStartTag(int pos, java.lang.String name)` Returns the `StartTag` with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
`Element`	`getElementById(java.lang.String id)` Returns the `Element` with the specified `id` attribute value.
`java.util.Iterator`	`getNextTagIterator(int pos)` Returns an iterator of `Tag` objects beginning at or immediately following the specified position in the source document.
`void`	`ignoreWhenParsing(java.util.Collection segments)` Causes all of the segments in the specified collection to be ignored when parsing.
`void`	`ignoreWhenParsing(int begin, int end)` Causes the specified range of the source text to be ignored when parsing.
`Attributes`	`parseAttributes(int pos, int maxEnd)` Parses any `Attributes` starting at the specified position.
`Attributes`	`parseAttributes(int pos, int maxEnd, int maxErrorCount)` Parses any `Attributes` starting at the specified position.
`void`	`setLogWriter(java.io.Writer writer)` Sets the destination for log messages.
`java.lang.String`	`toString()` Returns the source text as a `String`.

Constructor Summary
`Source(java.lang.CharSequence text)` Constructs a new `Source` object with the specified text.

doc.api.au.id.jericho.lib.html.Source.html Maven / Gradle / Ivy

au.id.jericho.lib.html Class Source

Source

toString

getElementById

findPreviousStartTag

findPreviousStartTag

findNextStartTag

findNextStartTag

findNextStartTag

findNextComment

findPreviousEndTag

findNextEndTag

findNextEndTag

getNextTagIterator

findNextTag

findEnclosingStartTag

findEnclosingComment

findEnclosingElement

findEnclosingElement

findPreviousCharacterReference

findNextCharacterReference

parseAttributes

parseAttributes

ignoreWhenParsing

ignoreWhenParsing

setLogWriter

au.id.jericho.lib.html
Class Source