All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.api.au.id.jericho.lib.html.Segment.html Maven / Gradle / Ivy

Go to download

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3
Show newest version






Segment (Jericho HTML Parser 1.5-dev1)

















au.id.jericho.lib.html
Class Segment

java.lang.Object
  extended byau.id.jericho.lib.html.Segment
All Implemented Interfaces:
java.lang.CharSequence, java.lang.Comparable
Direct Known Subclasses:
Attribute, CharacterReference, Element, FormControl, au.id.jericho.lib.html.internal.SequentialListSegment, Source, Tag

public class Segment
extends java.lang.Object
implements java.lang.Comparable, java.lang.CharSequence

Represents a segment of a Source document.

The "span" of a segment is defined by the combination of its begin and end character positions.


Constructor Summary
Segment(Source source, int begin, int end)
          Constructs a new Segment with the specified Source and the specified begin and end character positions.
 
Method Summary
 char charAt(int index)
          Returns the character at the specified index.
 int compareTo(java.lang.Object o)
          Compares this Segment object to another object.
 boolean encloses(int pos)
          Indicates whether this segment encloses the specified character position in the Source document.
 boolean encloses(Segment segment)
          Indicates whether this Segment encloses the specified Segment.
 boolean equals(java.lang.Object object)
          Compares the specified object with this Segment for equality.
 java.util.List findAllCharacterReferences()
          Returns a list of all CharacterReference objects enclosed by this segment.
 java.util.List findAllComments()
          Returns a list of all Segment objects enclosed by this segment that represent HTML comments.
 java.util.List findAllElements()
          Returns a list of all Element objects enclosed by this segment.
 java.util.List findAllElements(java.lang.String name)
          Returns a list of all Element objects with the specified name enclosed by this segment.
 java.util.List findAllStartTags()
          Returns a list of all StartTag objects enclosed by this segment.
 java.util.List findAllStartTags(java.lang.String name)
          Returns a list of all StartTag objects with the specified name enclosed by this segment.
 java.util.List findAllStartTags(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
          Returns a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
 java.util.List findFormControls()
          Returns a list of the FormControl objects enclosed by this segment.
 FormFields findFormFields()
          Returns the FormFields object representing all form fields enclosed by this segment.
 java.util.List findWords()
          Deprecated. no replacement
 int getBegin()
          Returns the character position in the Source where this segment begins.
 java.lang.String getDebugInfo()
          Returns a string representation of this object useful for debugging purposes.
 int getEnd()
          Returns the character position in the Source where this segment ends.
 java.lang.String getSourceText()
          Deprecated. Use the toString() method instead
 java.lang.String getSourceTextNoWhitespace()
          Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.
 int hashCode()
          Returns a hash code value for the segment.
 void ignoreWhenParsing()
          Causes the this segment to be ignored when parsing.
 boolean isComment()
          Indicates whether this Segment represents an HTML comment.
static boolean isWhiteSpace(char ch)
          Indicates whether the specified character is white space.
 int length()
          Returns the length of the segment.
 Attributes parseAttributes()
          Parses any Attributes within this segment.
 java.lang.CharSequence subSequence(int beginIndex, int endIndex)
          Returns a new character sequence that is a subsequence of this sequence.
 java.lang.String toString()
          Returns the source text of this segment as a String.
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Segment

public Segment(Source source,
               int begin,
               int end)
Constructs a new Segment with the specified Source and the specified begin and end character positions.

Parameters:
source - the source document.
begin - the character position in the source where this segment begins.
end - the character position in the source where this segment ends.
Method Detail

getBegin

public final int getBegin()
Returns the character position in the Source where this segment begins.

Returns:
the character position in the Source where this segment begins.

getEnd

public final int getEnd()
Returns the character position in the Source where this segment ends.

Returns:
the character position in the Source where this segment ends.

equals

public final boolean equals(java.lang.Object object)
Compares the specified object with this Segment for equality.

Returns true if and only if the specified object is also a Segment, and both segments have the same Source, and the same begin and end positions.

Parameters:
object - the object to be compared for equality with this Segment.
Returns:
true if the specified object is equal to this Segment, otherwise false.

hashCode

public int hashCode()
Returns a hash code value for the segment.

The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.

Returns:
a hash code value for the segment.

length

public final int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.

Specified by:
length in interface java.lang.CharSequence
Returns:
the length of the segment.

encloses

public final boolean encloses(Segment segment)
Indicates whether this Segment encloses the specified Segment.

Parameters:
segment - the segment to be tested for being enclosed by this segment.
Returns:
true if this Segment encloses the specified Segment, otherwise false.

encloses

public final boolean encloses(int pos)
Indicates whether this segment encloses the specified character position in the Source document.

This is the case if getBegin() <= pos < getEnd().

Parameters:
pos - the position in the source document to be tested.
Returns:
true if this segment encloses the specified position, otherwise false.

isComment

public boolean isComment()
Indicates whether this Segment represents an HTML comment.

An HTML comment is an area of the source document enclosed by the delimiters <!-- on the left and --> on the right.

The HTML 4.01 Specification section 3.2.4 states that the end of comment delimiter may contain white space between the "--" and ">" characters, but this library does not recognise end of comment delimiters containing white space.

Returns:
true if this Segment represents an HTML comment, otherwise false.

toString

public java.lang.String toString()
Returns the source text of this segment as a String.

The returned String is newly created with every call to this method, unless this segment is itself a Source object.

Note that before version 1.5 this returned a representation of this object useful for debugging purposes, which can now be obtained via the getDebugInfo() method.

Specified by:
toString in interface java.lang.CharSequence
Returns:
the source text of this segment as a String.

findAllStartTags

public java.util.List findAllStartTags()
Returns a list of all StartTag objects enclosed by this segment.

Returns:
a list of all StartTag objects enclosed by this segment.

findAllStartTags

public java.util.List findAllStartTags(java.lang.String name)
Returns a list of all StartTag objects with the specified name enclosed by this segment.

If the name argument is null, all StartTags are returned.

Parameters:
name - the name of the StartTags to find.
Returns:
a list of all StartTag objects with the specified name enclosed by this segment.

findAllStartTags

public java.util.List findAllStartTags(java.lang.String attributeName,
                                       java.lang.String value,
                                       boolean valueCaseSensitive)
Returns a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

Parameters:
attributeName - the attribute name (case insensitive) to search for, must not be null.
value - the value of the specified attribute to search for, must not be null.
valueCaseSensitive - specifies whether the attribute value matching is case sensitive.
Returns:
a list of all StartTag objects with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

findAllComments

public java.util.List findAllComments()
Returns a list of all Segment objects enclosed by this segment that represent HTML comments.

Returns:
a list of all Segment objects enclosed by this segment that represent HTML comments.

findAllElements

public java.util.List findAllElements()
Returns a list of all Element objects enclosed by this segment.

Returns:
a list of all Element objects enclosed by this segment.

findAllElements

public java.util.List findAllElements(java.lang.String name)
Returns a list of all Element objects with the specified name enclosed by this segment.

If the name argument is null, all Elements are returned.

Parameters:
name - the name of the Elements to find.
Returns:
a list of all Element objects with the specified name enclosed by this segment.

findAllCharacterReferences

public java.util.List findAllCharacterReferences()
Returns a list of all CharacterReference objects enclosed by this segment.

Returns:
a list of all CharacterReference objects enclosed by this segment.

findFormControls

public java.util.List findFormControls()
Returns a list of the FormControl objects enclosed by this segment.

Returns:
a list of the FormControl objects enclosed by this segment.

findFormFields

public FormFields findFormFields()
Returns the FormFields object representing all form fields enclosed by this segment.

This is equivalent to FormFields.constructFrom(findFormControls())

Returns:
the FormFields object representing all form fields enclosed by this segment.
See Also:
findFormControls()

parseAttributes

public Attributes parseAttributes()
Parses any Attributes within this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes() method should be used in normal situations.

This is equivalent to source.parseAttributes(this.getBegin(),this.getEnd())

Returns:
the Attributes within this segment, or null if too many errors occur while parsing.

ignoreWhenParsing

public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing.

This is equivalent to source.ignoreWhenParsing(segment.getBegin(),segment.getEnd())

See Also:
Source.ignoreWhenParsing(int begin, int end), Source.ignoreWhenParsing(Collection segments)

compareTo

public int compareTo(java.lang.Object o)
Compares this Segment object to another object.

If the argument is not a Segment, a ClassCastException is thrown.

A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier.

Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents.

Note: this class has a natural ordering that is inconsistent with equals. This means that this method may return zero in some cases where calling the equals(Object) method with the same argument returns false.

Specified by:
compareTo in interface java.lang.Comparable
Parameters:
o - the segment to be compared
Returns:
a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
Throws:
java.lang.ClassCastException - if the argument is not a Segment

isWhiteSpace

public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space.

The HTML 4.01 Specification section 9.1 specifies the following white space characters:

  • space (U+0020)
  • tab (U+0009)
  • form feed (U+000C)
  • line feed (U+000A)
  • carriage return (U+000D)
  • zero-width space (U+200B)

Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as whitespace and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference are rendered this way.

Note that in versions prior to 1.5, this method did not recognise form feeds or zero-width spaces as white space.

Parameters:
ch - the character to test.
Returns:
true if the specified character is white space, otherwise false.

getDebugInfo

public java.lang.String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.

Returns:
a string representation of this object useful for debugging purposes.

charAt

public char charAt(int index)
Returns the character at the specified index.

This is logically equivalent to toString().charAt(index) for a valid argument values 0 <= index < length().

However because this implementation works directly on the underlying document source string, it should not be assumed that an IndexOutOfBoundsException will be thrown for an invalid argument value.

Specified by:
charAt in interface java.lang.CharSequence
Parameters:
index - the index of the character.
Returns:
the character at the specified index.

subSequence

public final java.lang.CharSequence subSequence(int beginIndex,
                                                int endIndex)
Returns a new character sequence that is a subsequence of this sequence.

This is logically equivalent to toString().subSequence(beginIndex,endIndex) for valid values of beginIndex and endIndex.

However because this implementation works directly on the underlying document source string, it should not be assumed that an IndexOutOfBoundsException will be thrown for invalid argument values as described in the String.subSequence(int,int) method.

Specified by:
subSequence in interface java.lang.CharSequence
Parameters:
beginIndex - the begin index, inclusive.
endIndex - the end index, exclusive.
Returns:
a new character sequence that is a subsequence of this sequence.

getSourceText

public java.lang.String getSourceText()
Deprecated. Use the toString() method instead

Returns the source text of this segment.

This method has been deprecated as of version 1.5 as it now duplicates the functionality of the toString() method.

Returns:
the source text of this segment.

getSourceTextNoWhitespace

public final java.lang.String getSourceTextNoWhitespace()
Deprecated. Use the more useful CharacterReference.decodeCollapseWhiteSpace(CharSequence) method instead.

Returns the source text of this segment without white space.

All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.

This method has been deprecated as of version 1.5 as it is no longer used internally and was never very useful as a public method. It is similar to the new CharacterReference.decodeCollapseWhiteSpace(CharSequence) method, but does not decode the text after collapsing the white space.

Returns:
the source text of this segment without white space.

findWords

public final java.util.List findWords()
Deprecated. no replacement

Returns a list of Segment objects representing every word in this segment separated by white space. Note that any markup contained in this segment will be regarded as normal text for the purposes of this method.

This method has been deprecated as of version 1.5 as it has no discernable use.

Returns:
a list of Segment objects representing every word in this segment separated by white space.






© 2015 - 2024 Weber Informatics LLC | Privacy Policy