doc.api.au.id.jericho.lib.html.Source.html Maven / Gradle / Ivy
Source (Jericho HTML Parser 1.5-dev1)
Package
Class
Tree
Deprecated
Index
Help
PREV CLASS
NEXT CLASS
FRAMES
NO FRAMES
SUMMARY: NESTED | FIELD | CONSTR | METHOD
DETAIL: FIELD | CONSTR | METHOD
au.id.jericho.lib.html
Class Source
java.lang.Object
au.id.jericho.lib.html.Segment
au.id.jericho.lib.html.Source
- All Implemented Interfaces:
- java.lang.CharSequence, java.lang.Comparable
- public class Source
- extends Segment
Represents a source HTML document.
Note that many of the useful functions which can be performed on the source document are
defined in its superclass, Segment
.
The Source object is itself a Segment which spans the entire document.
Most of the methods defined in this class are useful for determining the elements and tags
surrounding or neighbouring a particular character position in the document.
IMPORTANT NOTE: Because HTML allows '<
' characters within attribute values
(see section 5.3.2 of the HTML spec),
it is theoretically impossible to determine with certainty whether
any given '<
' character in a source document is the start of a tag
without having parsed from the beginning of the document (which Jericho HTML Parser doesn't do).
For this reason, the parser may reject a start tag completely if its attributes are not
properly formed, although it does try to provide some leniency.
In XHTML, such characters must be represented in attribute values as character entities.
(see section 3.1 of the XML spec)
- See Also:
Segment
Constructor Summary
Source(java.lang.CharSequence text)
Constructs a new Source
object with the specified text.
Method Summary
Segment
findEnclosingComment(int pos)
Returns a Segment spanning the HTML comment that encloses the specified position in the source document.
Element
findEnclosingElement(int pos)
Returns the most nested Element enclosing the specified position in the source document.
Element
findEnclosingElement(int pos,
java.lang.String name)
Returns the most nested Element with the specified name enclosing the specified position in the source document.
StartTag
findEnclosingStartTag(int pos)
Returns the StartTag
enclosing the specified position in the source document.
CharacterReference
findNextCharacterReference(int pos)
Returns the CharacterReference
beginning at or immediately following the specified position in the source document.
StartTag
findNextComment(int pos)
Returns the Comment beginning at or immediately following the specified position in the source document.
EndTag
findNextEndTag(int pos)
Returns the EndTag beginning at or immediately following the specified position in the source document.
EndTag
findNextEndTag(int pos,
java.lang.String name)
Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos)
Returns the StartTag
beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos,
java.lang.String name)
Returns the StartTag
with the specified name beginning at or immediately following the specified position in the source document.
StartTag
findNextStartTag(int pos,
java.lang.String attributeName,
java.lang.String value,
boolean valueCaseSensitive)
Returns the StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
Tag
findNextTag(int pos)
Returns the tag (either a StartTag
or EndTag
) beginning at or immediately following the specified position in the source document.
CharacterReference
findPreviousCharacterReference(int pos)
Returns the CharacterReference
at or immediately preceding (or enclosing) the specified position in the source document.
EndTag
findPreviousEndTag(int pos,
java.lang.String name)
Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
StartTag
findPreviousStartTag(int pos)
Returns the StartTag
at or immediately preceding (or enclosing) the specified position in the source document.
StartTag
findPreviousStartTag(int pos,
java.lang.String name)
Returns the StartTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
Element
getElementById(java.lang.String id)
Returns the Element
with the specified id
attribute value.
java.util.Iterator
getNextTagIterator(int pos)
Returns an iterator of Tag
objects beginning at or immediately following the specified position in the source document.
void
ignoreWhenParsing(java.util.Collection segments)
Causes all of the segments in the specified collection to be ignored when parsing.
void
ignoreWhenParsing(int begin,
int end)
Causes the specified range of the source text to be ignored when parsing.
Attributes
parseAttributes(int pos,
int maxEnd)
Parses any Attributes
starting at the specified position.
Attributes
parseAttributes(int pos,
int maxEnd,
int maxErrorCount)
Parses any Attributes
starting at the specified position.
void
setLogWriter(java.io.Writer writer)
Sets the destination for log messages.
java.lang.String
toString()
Returns the source text as a String
.
Methods inherited from class au.id.jericho.lib.html.Segment
charAt, compareTo, encloses, encloses, equals, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findFormControls, findFormFields, findWords, getBegin, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, length, parseAttributes, subSequence
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
Constructor Detail
Source
public Source(java.lang.CharSequence text)
- Constructs a new
Source
object with the specified text.
- Parameters:
text
- the source text.
Method Detail
toString
public java.lang.String toString()
- Returns the source text as a
String
.
If the original CharSequence
supplied when this instance was constructed was not a String
,
the first conversion of the text to a String
is cached for subsequent calls.
-
- Returns:
- the source text as a
String
.
getElementById
public Element getElementById(java.lang.String id)
- Returns the
Element
with the specified id
attribute value.
This simulates the script method
getElementById
defined in DOM HTML level 1.
This is equivalent to findNextStartTag(0,"id",id,true).getElement()
.
A well formed HTML document should have no more than one element with any given id
attribute value.
Calls to this method are not cached.
- Parameters:
id
- the id
attribute value (case sensitive) to search for, must not be null
.
- Returns:
- the
Element
with the specified id
attribute value.
findPreviousStartTag
public StartTag findPreviousStartTag(int pos)
- Returns the
StartTag
at or immediately preceding (or enclosing) the specified position in the source document.
If the specified position is within an HTML comment, the segment
spanning the comment is returned.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the
StartTag
immediately preceding the specified position in the source document, or null
if none exists.
findPreviousStartTag
public StartTag findPreviousStartTag(int pos,
java.lang.String name)
- Returns the
StartTag
with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
Start tags positioned within an HTML comment are ignored, but the comment segment itself is treated as a start tag.
Specifying a null
name parameter is equivalent to findPreviousStartTag(pos)
.
- Parameters:
pos
- the position in the source document from which to start the search.name
- the name of the StartTag
to search for.
- Returns:
- the
StartTag
with the specified name immediately preceding the specified position in the source document, or null
if none exists.
findNextStartTag
public StartTag findNextStartTag(int pos)
- Returns the
StartTag
beginning at or immediately following the specified position in the source document.
StartTags positioned within an HTML comment are ignored, but subsequent comment segments are treated as start tags.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the
StartTag
beginning at or immediately following the specified position in the source document, or null
if none exists.
findNextStartTag
public StartTag findNextStartTag(int pos,
java.lang.String name)
- Returns the
StartTag
with the specified name beginning at or immediately following the specified position in the source document.
Start tags positioned within an HTML comment are ignored.
Specifying a null
name parameter is equivalent to findNextStartTag(pos)
.
Specifying a name parameter ending in a colon (:
) searches for all start tags in the specified XML namespace.
- Parameters:
pos
- the position in the source document from which to start the search.name
- the name of the StartTag
to search for.
- Returns:
- the
StartTag
with the specified name beginning at or immediately following the specified position in the source document, or null
if none exists.
findNextStartTag
public StartTag findNextStartTag(int pos,
java.lang.String attributeName,
java.lang.String value,
boolean valueCaseSensitive)
- Returns the
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
Calls to this method are not cached.
- Parameters:
pos
- the position in the source document from which to start the search.attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.
- Returns:
- the
StartTag
with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.
findNextComment
public StartTag findNextComment(int pos)
- Returns the Comment beginning at or immediately following the specified position in the source document.
If the specified position is within a comment, the comment following the enclosing comment is returned.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the Comment beginning at or immediately following the specified position in the source document, or
null
if none exists.
findPreviousEndTag
public EndTag findPreviousEndTag(int pos,
java.lang.String name)
- Returns the EndTag with the specified name at or immediately preceding (or enclosing) the specified position in the source document.
End tags positioned within an HTML comment are ignored.
- Parameters:
pos
- the position in the source document from which to start the search.name
- the name of the EndTag to search for, must not be null
.
- Returns:
- the EndTag immediately preceding the specified position in the source document, or
null
if none exists.
findNextEndTag
public EndTag findNextEndTag(int pos)
- Returns the EndTag beginning at or immediately following the specified position in the source document.
End tags positioned within an HTML comment are ignored.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the EndTag beginning at or immediately following the specified position in the source document, or
null
if none exists.
findNextEndTag
public EndTag findNextEndTag(int pos,
java.lang.String name)
- Returns the EndTag with the specified name beginning at or immediately following the specified position in the source document.
End tags positioned within an HTML comment are ignored.
- Parameters:
pos
- the position in the source document from which to start the search.name
- the name of the EndTag to search for, must not be null
.
- Returns:
- the EndTag with the specified name beginning at or immediately following the specified position in the source document, or
null
if none exists.
getNextTagIterator
public java.util.Iterator getNextTagIterator(int pos)
- Returns an iterator of
Tag
objects beginning at or immediately following the specified position in the source document.
Tags positioned within an HTML comment are ignored, but the comment segments themselves are treated as start tags.
- Parameters:
pos
- the position in the source document from which to start the iteration.
- Returns:
- an iterator of
Tag
objects beginning at or immediately following the specified position in the source document.
findNextTag
public Tag findNextTag(int pos)
- Returns the tag (either a
StartTag
or EndTag
) beginning at or immediately following the specified position in the source document.
IMPLEMENTATION NOTE: Sequential tags in a document should be retrieved using the iterator from
getNextTagIterator(int pos)
as it is far more efficient than using multiple calls to this method.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the tag beginning at or immediately following the specified position in the source document, or
null
if none exists. - See Also:
getNextTagIterator(int pos)
findEnclosingStartTag
public StartTag findEnclosingStartTag(int pos)
- Returns the
StartTag
enclosing the specified position in the source document.
If the specified position is within an HTML comment, the segment
spanning the comment is returned.
A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()
- Parameters:
pos
- the position in the source document.
- Returns:
- the
StartTag
enclosing the specified position in the source document, or null
if the position is not within a StartTag.
findEnclosingComment
public Segment findEnclosingComment(int pos)
- Returns a Segment spanning the HTML comment that encloses the specified position in the source document.
A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()
- Parameters:
pos
- the position in the source document.
- Returns:
- a Segment spanning the HTML comment that encloses the specified position in the source document, or
null
if the position is not within a comment.
findEnclosingElement
public Element findEnclosingElement(int pos)
- Returns the most nested Element enclosing the specified position in the source document.
If the specified position is within an HTML comment, the segment
spanning the comment is returned.
A segment is considered to enclose a character position x if
segment.getBegin() <= x < segment.getEnd()
- Parameters:
pos
- the position in the source document.
- Returns:
- the most nested Element enclosing the specified position in the source document, or
null
if the position is not within an Element.
findEnclosingElement
public Element findEnclosingElement(int pos,
java.lang.String name)
- Returns the most nested Element with the specified name enclosing the specified position in the source document.
Elements positioned within an HTML comment are ignored, but the comment segment itself is treated as an Element.
- Parameters:
pos
- the position in the source document.name
- the name of the Element to search for.
- Returns:
- the most nested Element with the specified name enclosing the specified position in the source document, or
null
if none exists.
findPreviousCharacterReference
public CharacterReference findPreviousCharacterReference(int pos)
- Returns the
CharacterReference
at or immediately preceding (or enclosing) the specified position in the source document.
Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the
CharacterReference
beginning at or immediately preceding the specified position in the source document, or null
if none exists.
findNextCharacterReference
public CharacterReference findNextCharacterReference(int pos)
- Returns the
CharacterReference
beginning at or immediately following the specified position in the source document.
Character references positioned within an HTML comment are NOT ignored.
- Parameters:
pos
- the position in the source document from which to start the search.
- Returns:
- the
CharacterReference
beginning at or immediately following the specified position in the source document, or null
if none exists.
parseAttributes
public Attributes parseAttributes(int pos,
int maxEnd)
- Parses any
Attributes
starting at the specified position.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
The returned Attributes segment will always begin at pos,
and will end at the first occurrence of "/>" or ">" outside of a quoted attribute value,
or at maxEnd, whichever comes first.
Only returns null
if the segment contains a major syntactical error
or more than the default maximum number of
minor syntactical errors.
This is equivalent to
parseAttributes(pos,maxEnd,Attributes.getDefaultMaxErrorCount())
- Parameters:
pos
- the position in the source document at the beginning of the attribute listmaxEnd
- the maximum end position of the attribute list, or -1 if no maximum
- Returns:
- the
Attributes
starting at the specified position, or null
if too many errors occur while parsing. - See Also:
StartTag.getAttributes()
,
Segment.parseAttributes()
parseAttributes
public Attributes parseAttributes(int pos,
int maxEnd,
int maxErrorCount)
- Parses any
Attributes
starting at the specified position.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
Only returns null
if the segment contains a major syntactical error
or more than the specified number of minor syntactical errors.
The maxErrorCount argument overrides the default maximum number of minor errors allowed,
which can be set using the Attributes.setDefaultMaxErrorCount(int)
static method.
See parseAttributes(int pos, int maxEnd)
for more information.
- Parameters:
pos
- the position in the source document at the beginning of the attribute listmaxEnd
- the maximum end position of the attribute list, or -1 if no maximummaxErrorCount
- the maximum number of minor errors allowed while parsing
- Returns:
- the
Attributes
starting at the specified position, or null
if too many errors occur while parsing. - See Also:
StartTag.getAttributes()
,
parseAttributes(int pos, int MaxEnd)
ignoreWhenParsing
public void ignoreWhenParsing(int begin,
int end)
- Causes the specified range of the source text to be ignored when parsing.
This method is usually used to exclude server tags or other non-HTML segments from the source text
so that it does not interfere with the parsing of the surrounding HTML.
This is necessary because many server tags are used as attribute values and in other places within
HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag.
For efficiency reasons, all segments to be ignored should be registered at once, without performing
searches in between.
- Parameters:
begin
- the beginning character position in the source text.end
- the end character position in the source text.- See Also:
Segment.ignoreWhenParsing()
ignoreWhenParsing
public void ignoreWhenParsing(java.util.Collection segments)
- Causes all of the segments in the specified collection to be ignored when parsing.
This is equivalent to calling Segment.ignoreWhenParsing()
on each segment in the collection.
setLogWriter
public void setLogWriter(java.io.Writer writer)
- Sets the destination for log messages.
By default, the log writer is set to null
, which supresses log messages.
- Parameters:
writer
- the java.io.Writer where log messages will be sent
Package
Class
Tree
Deprecated
Index
Help
PREV CLASS
NEXT CLASS
FRAMES
NO FRAMES
SUMMARY: NESTED | FIELD | CONSTR | METHOD
DETAIL: FIELD | CONSTR | METHOD