doc.api.au.id.jericho.lib.html.Segment.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of jericho-html

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3

Show newest version







Segment (Jericho HTML Parser 1.5-dev1)





















  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD









au.id.jericho.lib.html


Class Segment
java.lang.Object
  au.id.jericho.lib.html.Segment


All Implemented Interfaces: 
java.lang.CharSequence, java.lang.Comparable


Direct Known Subclasses: 
Attribute, CharacterReference, Element, FormControl, au.id.jericho.lib.html.internal.SequentialListSegment, Source, Tag



public class Segment
extends java.lang.Object
implements java.lang.Comparable, java.lang.CharSequence


Represents a segment of a Source document.
 

 The "span" of a segment is defined by the combination of its begin and end character positions.


















Constructor Summary


Segment(Source source,
        int begin,
        int end)



          Constructs a new Segment with the specified Source and the specified begin and end character positions.


 






Method Summary



 char
charAt(int index)



          Returns the character at the specified index.



 int
compareTo(java.lang.Object o)



          Compares this Segment object to another object.



 boolean
encloses(int pos)



          Indicates whether this segment encloses the specified character position in the Source document.



 boolean
encloses(Segment segment)



          Indicates whether this Segment encloses the specified Segment.



 boolean
equals(java.lang.Object object)



          Compares the specified object with this Segment for equality.



 java.util.List
findAllCharacterReferences()



          Returns a list of all CharacterReference objects enclosed by this segment.



 java.util.List
findAllComments()



          Returns a list of all Segment objects enclosed by this segment that represent HTML comments.



 java.util.List
findAllElements()



          Returns a list of all Element objects enclosed by this segment.



 java.util.List
findAllElements(java.lang.String name)



          Returns a list of all Element objects with the specified name enclosed by this segment.



 java.util.List
findAllStartTags()



          Returns a list of all StartTag objects enclosed by this segment.



 java.util.List
findAllStartTags(java.lang.String name)



          Returns a list of all StartTag objects with the specified name enclosed by this segment.



 java.util.List
findAllStartTags(java.lang.String attributeName,
                 java.lang.String value,
                 boolean valueCaseSensitive)



          Returns a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.



 java.util.List
findFormControls()



          Returns a list of the FormControl objects enclosed by this segment.



 FormFields
findFormFields()



          Returns the FormFields object representing all form fields enclosed by this segment.



 java.util.List
findWords()



          Deprecated. no replacement



 int
getBegin()



          Returns the character position in the Source where this segment begins.



 java.lang.String
getDebugInfo()



          Returns a string representation of this object useful for debugging purposes.



 int
getEnd()



          Returns the character position in the Source where this segment ends.



 java.lang.String
getSourceText()



          Deprecated. Use the toString() method instead



 java.lang.String
getSourceTextNoWhitespace()



          Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.



 int
hashCode()



          Returns a hash code value for the segment.



 void
ignoreWhenParsing()



          Causes the this segment to be ignored when parsing.



 boolean
isComment()



          Indicates whether this Segment represents an HTML comment.



static boolean
isWhiteSpace(char ch)



          Indicates whether the specified character is white space.



 int
length()



          Returns the length of the segment.



 Attributes
parseAttributes()



          Parses any Attributes within this segment.



 java.lang.CharSequence
subSequence(int beginIndex,
            int endIndex)



          Returns a new character sequence that is a subsequence of this sequence.



 java.lang.String
toString()



          Returns the source text of this segment as a String.


 


Methods inherited from class java.lang.Object


getClass, notify, notifyAll, wait, wait, wait


 











Constructor Detail




Segment
public Segment(Source source,
               int begin,
               int end)

Constructs a new Segment with the specified Source and the specified begin and end character positions.

Parameters:
source - the source document.
begin - the character position in the source where this segment begins.
end - the character position in the source where this segment ends.







Method Detail




getBegin
public final int getBegin()

Returns the character position in the Source where this segment begins.






Returns:
the character position in the Source where this segment begins.





getEnd
public final int getEnd()

Returns the character position in the Source where this segment ends.






Returns:
the character position in the Source where this segment ends.





equals
public final boolean equals(java.lang.Object object)

Compares the specified object with this Segment for equality.
 
 Returns true if and only if the specified object is also a Segment,
 and both segments have the same Source, and the same begin and end positions.





Parameters:
object - the object to be compared for equality with this Segment.
Returns:
true if the specified object is equal to this Segment, otherwise false.





hashCode
public int hashCode()

Returns a hash code value for the segment.
 
 The current implementation returns the sum of the begin and end positions, although this is not
 guaranteed in future versions.






Returns:
a hash code value for the segment.





length
public final int length()

Returns the length of the segment.
 This is defined as the number of characters between the begin and end positions.


Specified by:
length in interface java.lang.CharSequence



Returns:
the length of the segment.





encloses
public final boolean encloses(Segment segment)

Indicates whether this Segment encloses the specified Segment.





Parameters:
segment - the segment to be tested for being enclosed by this segment.
Returns:
true if this Segment encloses the specified Segment, otherwise false.





encloses
public final boolean encloses(int pos)

Indicates whether this segment encloses the specified character position in the Source document.
 
 This is the case if getBegin() <= pos < getEnd().





Parameters:
pos - the position in the source document to be tested.
Returns:
true if this segment encloses the specified position, otherwise false.





isComment
public boolean isComment()

Indicates whether this Segment represents an HTML comment.
 
 An HTML comment is an area of the source document enclosed by the delimiters
 <!-- on the left and --> on the right.
 

 The HTML 4.01 Specification section 3.2.4
 states that the end of comment delimiter may contain white space between the "--" and ">" characters,
 but this library does not recognise end of comment delimiters containing white space.






Returns:
true if this Segment represents an HTML comment, otherwise false.





toString
public java.lang.String toString()

Returns the source text of this segment as a String.
 
 The returned String is newly created with every call to this method, unless this
 segment is itself a Source object.
 

 Note that before version 1.5 this returned a representation of this object useful for debugging purposes,
 which can now be obtained via the getDebugInfo() method.


Specified by:
toString in interface java.lang.CharSequence



Returns:
the source text of this segment as a String.





findAllStartTags
public java.util.List findAllStartTags()

Returns a list of all StartTag objects enclosed by this segment.






Returns:
a list of all StartTag objects enclosed by this segment.





findAllStartTags
public java.util.List findAllStartTags(java.lang.String name)

Returns a list of all StartTag objects with the specified name enclosed by this segment.
 
 If the name argument is null, all StartTags are returned.





Parameters:
name - the name of the StartTags to find.
Returns:
a list of all StartTag objects with the specified name enclosed by this segment.





findAllStartTags
public java.util.List findAllStartTags(java.lang.String attributeName,
                                       java.lang.String value,
                                       boolean valueCaseSensitive)

Returns a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.





Parameters:
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.





findAllComments
public java.util.List findAllComments()

Returns a list of all Segment objects enclosed by this segment that represent HTML comments.






Returns:
a list of all Segment objects enclosed by this segment that represent HTML comments.





findAllElements
public java.util.List findAllElements()

Returns a list of all Element objects enclosed by this segment.






Returns:
a list of all Element objects enclosed by this segment.





findAllElements
public java.util.List findAllElements(java.lang.String name)

Returns a list of all Element objects with the specified name enclosed by this segment.
 
 If the name argument is null, all Elements are returned.





Parameters:
name - the name of the Elements to find.
Returns:
a list of all Element objects with the specified name enclosed by this segment.





findAllCharacterReferences
public java.util.List findAllCharacterReferences()

Returns a list of all CharacterReference objects enclosed by this segment.






Returns:
a list of all CharacterReference objects enclosed by this segment.





findFormControls
public java.util.List findFormControls()

Returns a list of the FormControl objects enclosed by this segment.






Returns:
a list of the FormControl objects enclosed by this segment.





findFormFields
public FormFields findFormFields()

Returns the FormFields object representing all form fields enclosed by this segment.
 
 This is equivalent to FormFields.constructFrom(findFormControls())






Returns:
the FormFields object representing all form fields enclosed by this segment.
See Also:
findFormControls()





parseAttributes
public Attributes parseAttributes()

Parses any Attributes within this segment.
 This method is only used in the unusual situation where attributes exist outside of a start tag.
 The StartTag.getAttributes() method should be used in normal situations.
 
 This is equivalent to source.parseAttributes(this.getBegin(),this.getEnd())






Returns:
the Attributes within this segment, or null if too many errors occur while parsing.





ignoreWhenParsing
public void ignoreWhenParsing()

Causes the this segment to be ignored when parsing.
 
 This is equivalent to source.ignoreWhenParsing(segment.getBegin(),segment.getEnd())





See Also:
Source.ignoreWhenParsing(int begin, int end), 
Source.ignoreWhenParsing(Collection segments)





compareTo
public int compareTo(java.lang.Object o)

Compares this Segment object to another object.
 
 If the argument is not a Segment, a ClassCastException is thrown.
 

 A segment is considered to be before another segment if its begin position is earlier,
 or in the case that both segments begin at the same position, its end position is earlier.
 

 Segments that begin and end at the same position are considered equal for
 the purposes of this comparison, even if they relate to different source documents.
 

 Note: this class has a natural ordering that is inconsistent with equals.
 This means that this method may return zero in some cases where calling the
 equals(Object) method with the same argument returns false.


Specified by:
compareTo in interface java.lang.Comparable


Parameters:
o - the segment to be compared
Returns:
a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
Throws:
java.lang.ClassCastException - if the argument is not a Segment





isWhiteSpace
public static final boolean isWhiteSpace(char ch)

Indicates whether the specified character is white space.
 
 The HTML 4.01 Specification section 9.1
 specifies the following white space characters:
 

  space (U+0020)
  
tab (U+0009)
  
form feed (U+000C)
  
line feed (U+000A)
  
carriage return (U+000D)
  
zero-width space (U+200B)
 
 
 Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not
 recognise them as whitespace and renders them as an unprintable character (empty square).
 Even zero-width spaces included using the numeric character reference  are rendered this way.
 

 Note that in versions prior to 1.5, this method did not recognise form feeds or zero-width spaces as white space.





Parameters:
ch - the character to test.
Returns:
true if the specified character is white space, otherwise false.





getDebugInfo
public java.lang.String getDebugInfo()

Returns a string representation of this object useful for debugging purposes.






Returns:
a string representation of this object useful for debugging purposes.





charAt
public char charAt(int index)

Returns the character at the specified index.
 
 This is logically equivalent to toString().charAt(index)
 for a valid argument values 0 <= index < length().
 

 However because this implementation works directly on the underlying document source string,
 it should not be assumed that an IndexOutOfBoundsException will be thrown
 for an invalid argument value.


Specified by:
charAt in interface java.lang.CharSequence


Parameters:
index - the index of the character.
Returns:
the character at the specified index.





subSequence
public final java.lang.CharSequence subSequence(int beginIndex,
                                                int endIndex)

Returns a new character sequence that is a subsequence of this sequence.
 
 This is logically equivalent to toString().subSequence(beginIndex,endIndex)
 for valid values of beginIndex and endIndex.
 

 However because this implementation works directly on the underlying document source string,
 it should not be assumed that an IndexOutOfBoundsException will be thrown
 for invalid argument values as described in the String.subSequence(int,int) method.


Specified by:
subSequence in interface java.lang.CharSequence


Parameters:
beginIndex - the begin index, inclusive.
endIndex - the end index, exclusive.
Returns:
a new character sequence that is a subsequence of this sequence.





getSourceText
public java.lang.String getSourceText()

Deprecated. Use the toString() method instead

Returns the source text of this segment.
 
 This method has been deprecated as of version 1.5 as it now duplicates the functionality of the toString() method.






Returns:
the source text of this segment.





getSourceTextNoWhitespace
public final java.lang.String getSourceTextNoWhitespace()

Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.

Returns the source text of this segment without white space.
 
 All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
 

 This method has been deprecated as of version 1.5 as it is no longer used internally and
 was never very useful as a public method.
 It is similar to the new CharacterReference.decodeCollapseWhiteSpace(CharSequence) method, but
 does not decode the text after collapsing the white space.
 







Returns:
the source text of this segment without white space.





findWords
public final java.util.List findWords()

Deprecated. no replacement

Returns a list of Segment objects representing every word in this segment separated by white space.
 Note that any markup contained in this segment will be regarded as normal text for the purposes of this method.
 
 This method has been deprecated as of version 1.5 as it has no discernable use.






Returns:
a list of Segment objects representing every word in this segment separated by white space.














  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Method Summary
`char`	`charAt(int index)` Returns the character at the specified index.
`int`	`compareTo(java.lang.Object o)` Compares this `Segment` object to another object.
`boolean`	`encloses(int pos)` Indicates whether this segment encloses the specified character position in the `Source` document.
`boolean`	`encloses(Segment segment)` Indicates whether this `Segment` encloses the specified `Segment`.
`boolean`	`equals(java.lang.Object object)` Compares the specified object with this `Segment` for equality.
`java.util.List`	`findAllCharacterReferences()` Returns a list of all `CharacterReference` objects enclosed by this segment.
`java.util.List`	`findAllComments()` Returns a list of all `Segment` objects enclosed by this segment that represent HTML comments.
`java.util.List`	`findAllElements()` Returns a list of all `Element` objects enclosed by this segment.
`java.util.List`	`findAllElements(java.lang.String name)` Returns a list of all `Element` objects with the specified name enclosed by this segment.
`java.util.List`	`findAllStartTags()` Returns a list of all `StartTag` objects enclosed by this segment.
`java.util.List`	`findAllStartTags(java.lang.String name)` Returns a list of all `StartTag` objects with the specified name enclosed by this segment.
`java.util.List`	`findAllStartTags(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)` Returns a list of all `StartTag` objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
`java.util.List`	`findFormControls()` Returns a list of the `FormControl` objects enclosed by this segment.
`FormFields`	`findFormFields()` Returns the `FormFields` object representing all form fields enclosed by this segment.
`java.util.List`	`findWords()` Deprecated. no replacement
`int`	`getBegin()` Returns the character position in the Source where this segment begins.
`java.lang.String`	`getDebugInfo()` Returns a string representation of this object useful for debugging purposes.
`int`	`getEnd()` Returns the character position in the Source where this segment ends.
`java.lang.String`	`getSourceText()` Deprecated. Use the `toString()` method instead
`java.lang.String`	`getSourceTextNoWhitespace()` Deprecated. Use the more useful `CharacterReference.decodeCollapseWhiteSpace(CharSequence)` method instead.
`int`	`hashCode()` Returns a hash code value for the segment.
`void`	`ignoreWhenParsing()` Causes the this segment to be ignored when parsing.
`boolean`	`isComment()` Indicates whether this `Segment` represents an HTML comment.
`static boolean`	`isWhiteSpace(char ch)` Indicates whether the specified character is white space.
`int`	`length()` Returns the length of the segment.
`Attributes`	`parseAttributes()` Parses any `Attributes` within this segment.
`java.lang.CharSequence`	`subSequence(int beginIndex, int endIndex)` Returns a new character sequence that is a subsequence of this sequence.
`java.lang.String`	`toString()` Returns the source text of this segment as a `String`.

Constructor Summary
`Segment(Source source, int begin, int end)` Constructs a new `Segment` with the specified `Source` and the specified begin and end character positions.

doc.api.au.id.jericho.lib.html.Segment.html Maven / Gradle / Ivy

au.id.jericho.lib.html Class Segment

Segment

getBegin

getEnd

equals

hashCode

length

encloses

encloses

isComment

toString

findAllStartTags

findAllStartTags

findAllStartTags

findAllComments

findAllElements

findAllElements

findAllCharacterReferences

findFormControls

findFormFields

parseAttributes

ignoreWhenParsing

compareTo

isWhiteSpace

getDebugInfo

charAt

subSequence

getSourceText

getSourceTextNoWhitespace

findWords

au.id.jericho.lib.html
Class Segment