doc.api.au.id.jericho.lib.html.Segment.html Maven / Gradle / Ivy
Segment (Jericho HTML Parser 1.5-dev1)
Package
Class
Tree
Deprecated
Index
Help
PREV CLASS
NEXT CLASS
FRAMES
NO FRAMES
SUMMARY: NESTED | FIELD | CONSTR | METHOD
DETAIL: FIELD | CONSTR | METHOD
au.id.jericho.lib.html
Class Segment
java.lang.Object
au.id.jericho.lib.html.Segment
- All Implemented Interfaces:
- java.lang.CharSequence, java.lang.Comparable
- Direct Known Subclasses:
- Attribute, CharacterReference, Element, FormControl, au.id.jericho.lib.html.internal.SequentialListSegment, Source, Tag
- public class Segment
- extends java.lang.Object
- implements java.lang.Comparable, java.lang.CharSequence
Represents a segment of a Source
document.
The "span" of a segment is defined by the combination of its begin and end character positions.
Constructor Summary
Segment(Source source,
int begin,
int end)
Constructs a new Segment
with the specified Source
and the specified begin and end character positions.
Method Summary
char
charAt(int index)
Returns the character at the specified index.
int
compareTo(java.lang.Object o)
Compares this Segment
object to another object.
boolean
encloses(int pos)
Indicates whether this segment encloses the specified character position in the Source
document.
boolean
encloses(Segment segment)
Indicates whether this Segment
encloses the specified Segment
.
boolean
equals(java.lang.Object object)
Compares the specified object with this Segment
for equality.
java.util.List
findAllCharacterReferences()
Returns a list of all CharacterReference
objects enclosed by this segment.
java.util.List
findAllComments()
Returns a list of all Segment
objects enclosed by this segment that represent HTML comments.
java.util.List
findAllElements()
Returns a list of all Element
objects enclosed by this segment.
java.util.List
findAllElements(java.lang.String name)
Returns a list of all Element
objects with the specified name enclosed by this segment.
java.util.List
findAllStartTags()
Returns a list of all StartTag
objects enclosed by this segment.
java.util.List
findAllStartTags(java.lang.String name)
Returns a list of all StartTag
objects with the specified name enclosed by this segment.
java.util.List
findAllStartTags(java.lang.String attributeName,
java.lang.String value,
boolean valueCaseSensitive)
Returns a list of all StartTag
objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
java.util.List
findFormControls()
Returns a list of the FormControl
objects enclosed by this segment.
FormFields
findFormFields()
Returns the FormFields
object representing all form fields enclosed by this segment.
java.util.List
findWords()
Deprecated. no replacement
int
getBegin()
Returns the character position in the Source where this segment begins.
java.lang.String
getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
int
getEnd()
Returns the character position in the Source where this segment ends.
java.lang.String
getSourceText()
Deprecated. Use the toString()
method instead
java.lang.String
getSourceTextNoWhitespace()
Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence)
method instead.
int
hashCode()
Returns a hash code value for the segment.
void
ignoreWhenParsing()
Causes the this segment to be ignored when parsing.
boolean
isComment()
Indicates whether this Segment
represents an HTML comment.
static boolean
isWhiteSpace(char ch)
Indicates whether the specified character is white space.
int
length()
Returns the length of the segment.
Attributes
parseAttributes()
Parses any Attributes
within this segment.
java.lang.CharSequence
subSequence(int beginIndex,
int endIndex)
Returns a new character sequence that is a subsequence of this sequence.
java.lang.String
toString()
Returns the source text of this segment as a String
.
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
Constructor Detail
Segment
public Segment(Source source,
int begin,
int end)
- Constructs a new
Segment
with the specified Source
and the specified begin and end character positions.
- Parameters:
source
- the source document.begin
- the character position in the source where this segment begins.end
- the character position in the source where this segment ends.
Method Detail
getBegin
public final int getBegin()
- Returns the character position in the Source where this segment begins.
- Returns:
- the character position in the Source where this segment begins.
getEnd
public final int getEnd()
- Returns the character position in the Source where this segment ends.
- Returns:
- the character position in the Source where this segment ends.
equals
public final boolean equals(java.lang.Object object)
- Compares the specified object with this
Segment
for equality.
Returns true
if and only if the specified object is also a Segment
,
and both segments have the same Source
, and the same begin and end positions.
- Parameters:
object
- the object to be compared for equality with this Segment
.
- Returns:
true
if the specified object is equal to this Segment
, otherwise false
.
hashCode
public int hashCode()
- Returns a hash code value for the segment.
The current implementation returns the sum of the begin and end positions, although this is not
guaranteed in future versions.
- Returns:
- a hash code value for the segment.
length
public final int length()
- Returns the length of the segment.
This is defined as the number of characters between the begin and end positions.
- Specified by:
length
in interface java.lang.CharSequence
- Returns:
- the length of the segment.
encloses
public final boolean encloses(Segment segment)
- Indicates whether this
Segment
encloses the specified Segment
.
- Parameters:
segment
- the segment to be tested for being enclosed by this segment.
- Returns:
true
if this Segment
encloses the specified Segment
, otherwise false
.
encloses
public final boolean encloses(int pos)
- Indicates whether this segment encloses the specified character position in the
Source
document.
This is the case if getBegin()
<= pos < getEnd()
.
- Parameters:
pos
- the position in the source document to be tested.
- Returns:
true
if this segment encloses the specified position, otherwise false
.
isComment
public boolean isComment()
- Indicates whether this
Segment
represents an HTML comment.
An HTML comment is an area of the source document enclosed by the delimiters
<!--
on the left and -->
on the right.
The HTML 4.01 Specification section 3.2.4
states that the end of comment delimiter may contain white space between the "--
" and ">
" characters,
but this library does not recognise end of comment delimiters containing white space.
- Returns:
true
if this Segment
represents an HTML comment, otherwise false
.
toString
public java.lang.String toString()
- Returns the source text of this segment as a
String
.
The returned String
is newly created with every call to this method, unless this
segment is itself a Source
object.
Note that before version 1.5 this returned a representation of this object useful for debugging purposes,
which can now be obtained via the getDebugInfo()
method.
- Specified by:
toString
in interface java.lang.CharSequence
- Returns:
- the source text of this segment as a
String
.
findAllStartTags
public java.util.List findAllStartTags()
- Returns a list of all
StartTag
objects enclosed by this segment.
- Returns:
- a list of all
StartTag
objects enclosed by this segment.
findAllStartTags
public java.util.List findAllStartTags(java.lang.String name)
- Returns a list of all
StartTag
objects with the specified name enclosed by this segment.
If the name argument is null
, all StartTags are returned.
- Parameters:
name
- the name of the StartTags to find.
- Returns:
- a list of all StartTag objects with the specified name enclosed by this segment.
findAllStartTags
public java.util.List findAllStartTags(java.lang.String attributeName,
java.lang.String value,
boolean valueCaseSensitive)
- Returns a list of all
StartTag
objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
- Parameters:
attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
- Returns:
- a list of all
StartTag
objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
findAllComments
public java.util.List findAllComments()
- Returns a list of all
Segment
objects enclosed by this segment that represent HTML comments.
- Returns:
- a list of all
Segment
objects enclosed by this segment that represent HTML comments.
findAllElements
public java.util.List findAllElements()
- Returns a list of all
Element
objects enclosed by this segment.
- Returns:
- a list of all
Element
objects enclosed by this segment.
findAllElements
public java.util.List findAllElements(java.lang.String name)
- Returns a list of all
Element
objects with the specified name enclosed by this segment.
If the name argument is null
, all Elements are returned.
- Parameters:
name
- the name of the Elements to find.
- Returns:
- a list of all
Element
objects with the specified name enclosed by this segment.
findAllCharacterReferences
public java.util.List findAllCharacterReferences()
- Returns a list of all
CharacterReference
objects enclosed by this segment.
- Returns:
- a list of all
CharacterReference
objects enclosed by this segment.
findFormControls
public java.util.List findFormControls()
- Returns a list of the
FormControl
objects enclosed by this segment.
- Returns:
- a list of the
FormControl
objects enclosed by this segment.
findFormFields
public FormFields findFormFields()
- Returns the
FormFields
object representing all form fields enclosed by this segment.
This is equivalent to FormFields.constructFrom(findFormControls())
- Returns:
- the
FormFields
object representing all form fields enclosed by this segment. - See Also:
findFormControls()
parseAttributes
public Attributes parseAttributes()
- Parses any
Attributes
within this segment.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
This is equivalent to source.parseAttributes(this.getBegin(),this.getEnd())
- Returns:
- the
Attributes
within this segment, or null
if too many errors occur while parsing.
ignoreWhenParsing
public void ignoreWhenParsing()
- Causes the this segment to be ignored when parsing.
This is equivalent to source.ignoreWhenParsing(segment.getBegin(),segment.getEnd())
-
compareTo
public int compareTo(java.lang.Object o)
- Compares this
Segment
object to another object.
If the argument is not a Segment
, a ClassCastException
is thrown.
A segment is considered to be before another segment if its begin position is earlier,
or in the case that both segments begin at the same position, its end position is earlier.
Segments that begin and end at the same position are considered equal for
the purposes of this comparison, even if they relate to different source documents.
Note: this class has a natural ordering that is inconsistent with equals.
This means that this method may return zero in some cases where calling the
equals(Object)
method with the same argument returns false
.
- Specified by:
compareTo
in interface java.lang.Comparable
- Parameters:
o
- the segment to be compared
- Returns:
- a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
- Throws:
java.lang.ClassCastException
- if the argument is not a Segment
isWhiteSpace
public static final boolean isWhiteSpace(char ch)
- Indicates whether the specified character is white space.
The HTML 4.01 Specification section 9.1
specifies the following white space characters:
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not
recognise them as whitespace and renders them as an unprintable character (empty square).
Even zero-width spaces included using the numeric character reference
are rendered this way.
Note that in versions prior to 1.5, this method did not recognise form feeds or zero-width spaces as white space.
- Parameters:
ch
- the character to test.
- Returns:
true
if the specified character is white space, otherwise false
.
getDebugInfo
public java.lang.String getDebugInfo()
- Returns a string representation of this object useful for debugging purposes.
- Returns:
- a string representation of this object useful for debugging purposes.
charAt
public char charAt(int index)
- Returns the character at the specified index.
This is logically equivalent to toString().charAt(index)
for a valid argument values 0 <= index < length()
.
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException
will be thrown
for an invalid argument value.
- Specified by:
charAt
in interface java.lang.CharSequence
- Parameters:
index
- the index of the character.
- Returns:
- the character at the specified index.
subSequence
public final java.lang.CharSequence subSequence(int beginIndex,
int endIndex)
- Returns a new character sequence that is a subsequence of this sequence.
This is logically equivalent to toString().subSequence(beginIndex,endIndex)
for valid values of beginIndex
and endIndex
.
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException
will be thrown
for invalid argument values as described in the String.subSequence(int,int)
method.
- Specified by:
subSequence
in interface java.lang.CharSequence
- Parameters:
beginIndex
- the begin index, inclusive.endIndex
- the end index, exclusive.
- Returns:
- a new character sequence that is a subsequence of this sequence.
getSourceText
public java.lang.String getSourceText()
- Deprecated. Use the
toString()
method instead
- Returns the source text of this segment.
This method has been deprecated as of version 1.5 as it now duplicates the functionality of the toString()
method.
- Returns:
- the source text of this segment.
getSourceTextNoWhitespace
public final java.lang.String getSourceTextNoWhitespace()
- Deprecated. Use the more useful
CharacterReference.decodeCollapseWhiteSpace(CharSequence)
method instead.
- Returns the source text of this segment without white space.
All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
This method has been deprecated as of version 1.5 as it is no longer used internally and
was never very useful as a public method.
It is similar to the new CharacterReference.decodeCollapseWhiteSpace(CharSequence)
method, but
does not decode the text after collapsing the white space.
- Returns:
- the source text of this segment without white space.
findWords
public final java.util.List findWords()
- Deprecated. no replacement
- Returns a list of
Segment
objects representing every word in this segment separated by white space.
Note that any markup contained in this segment will be regarded as normal text for the purposes of this method.
This method has been deprecated as of version 1.5 as it has no discernable use.
- Returns:
- a list of
Segment
objects representing every word in this segment separated by white space.
Package
Class
Tree
Deprecated
Index
Help
PREV CLASS
NEXT CLASS
FRAMES
NO FRAMES
SUMMARY: NESTED | FIELD | CONSTR | METHOD
DETAIL: FIELD | CONSTR | METHOD