doc.api.au.id.jericho.lib.html.CharacterReference.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of jericho-html

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

There is a newer version: 2.3

Show newest version







CharacterReference (Jericho HTML Parser 1.5-dev1)





















  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD









au.id.jericho.lib.html


Class CharacterReference
java.lang.Object
  au.id.jericho.lib.html.Segment
      au.id.jericho.lib.html.CharacterReference


All Implemented Interfaces: 
java.lang.CharSequence, java.lang.Comparable


Direct Known Subclasses: 
CharacterEntityReference, NumericCharacterReference



public abstract class CharacterReference
extends Segment


Represents either a CharacterEntityReference or NumericCharacterReference.
 

 This class, together with its subclasses, contains static methods to perform most required operations without ever having to instantiate an object.
 

 Objects of this class are useful when the positions of character references in a source document are required,
 or to replace the found character references with customised text.
 

 Objects are created using one of the following methods:
 

  parse(CharSequence characterReferenceText)
  
Source.findNextCharacterReference(int pos)
  
Source.findPreviousCharacterReference(int pos)
  
Segment.findAllCharacterReferences()
 















Field Summary



static boolean
ApostropheEncoded



          Determines whether apostrophes are encoded when calling the encode(CharSequence) method.



static int
INVALID_CODE_POINT



          Represents an invalid Unicode code point.


 









Method Summary



static java.lang.String
decode(java.lang.CharSequence encodedText)



          Decodes the specified HTML encoded text into normal text.



static java.lang.String
decodeCollapseWhiteSpace(java.lang.CharSequence text)



          Decodes the specified text after collapsing its white space.



static java.lang.String
encode(java.lang.CharSequence unencodedText)



          Encodes the specified text, escaping special characters into character references.



static java.lang.String
encodeWithWhiteSpaceFormatting(java.lang.CharSequence unencodedText)



          Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.



 char
getChar()



          Returns the character represented by this character reference.



abstract  java.lang.String
getCharacterReferenceString()



          Returns the encoded form of this character reference.



static java.lang.String
getCharacterReferenceString(int codePoint)



          Returns the encoded form of the specified Unicode code point.



 int
getCodePoint()



          Returns the Unicode code point represented by this character reference.



static int
getCodePointFromCharacterReferenceString(java.lang.CharSequence characterReferenceText)



          Parses a single encoded character reference text into a Unicode code point.



 java.lang.String
getDecimalCharacterReferenceString()



          Returns the decimal encoded form of this character reference.



static java.lang.String
getDecimalCharacterReferenceString(int codePoint)



          Returns the decimal encoded form of the specified Unicode code point.



 java.lang.String
getHexadecimalCharacterReferenceString()



          Returns the hexadecimal encoded form of this character reference.



static java.lang.String
getHexadecimalCharacterReferenceString(int codePoint)



          Returns the hexadecimal encoded form of the specified Unicode code point.



 java.lang.String
getUnicodeText()



          Returns the Unicode code point of this character reference in U+ notation.



static java.lang.String
getUnicodeText(int codePoint)



          Returns the specified Unicode code point in U+ notation.



static CharacterReference
parse(java.lang.CharSequence characterReferenceText)



          Parses a single encoded character reference text into a CharacterReference object.



static java.lang.String
reencode(java.lang.CharSequence encodedText)



          Re-encodes the specified text, equivalent to decoding and then encoding again.



static boolean
requiresEncoding(char ch)



          Indicates whether the specified character would need to be encoded in HTML text.


 


Methods inherited from class au.id.jericho.lib.html.Segment


charAt, compareTo, encloses, encloses, equals, findAllCharacterReferences, findAllComments, findAllElements, findAllElements, findAllStartTags, findAllStartTags, findAllStartTags, findFormControls, findFormFields, findWords, getBegin, getDebugInfo, getEnd, getSourceText, getSourceTextNoWhitespace, hashCode, ignoreWhenParsing, isComment, isWhiteSpace, length, parseAttributes, subSequence, toString


 


Methods inherited from class java.lang.Object


getClass, notify, notifyAll, wait, wait, wait


 








Field Detail




INVALID_CODE_POINT
public static final int INVALID_CODE_POINT

Represents an invalid Unicode code point.
 
 This can be the result of parsing a numeric character reference outside of the valid Unicode range of 0x000000-0x10FFFF, or any other invalid character reference.


See Also:
Constant Field Values




ApostropheEncoded
public static boolean ApostropheEncoded

Determines whether apostrophes are encoded when calling the encode(CharSequence) method.
 
 This is a global setting which affects all threads.
 

 Specifying a value of false means apostrophe
 (U+0027) characters will not be encoded.
 The only time apostrophes need to be encoded is within an attribute value delimited by
 single quotes (apostrophes), so in most cases ignoring apostrophes is perfectly safe and
 enhances readability of the source document.
 

 The recommended setting is false, although the default value is true so that
 the behaviour of the encode(CharSequence) method is consistent with previous versions.














Method Detail




getCodePoint
public int getCodePoint()

Returns the Unicode code point represented by this character reference.



Returns:
the Unicode code point represented by this character reference.





getChar
public char getChar()

Returns the character represented by this character reference.
 
 If this character reference represents a Unicode
 supplimentary code point,
 any bits outside of the least significant 16 bits of the code point are truncated, yielding an incorrect result.



Returns:
the character represented by this character reference.





encode
public static java.lang.String encode(java.lang.CharSequence unencodedText)

Encodes the specified text, escaping special characters into character references.
 
 Each character is encoded only if the requiresEncoding(char) method would return true for that character,
 using its CharacterEntityReference if available, or a decimal NumericCharacterReference if their Unicode
 code point value is greater than U+007F.
 

 The only exception to this is an apostrophe (U+0027),
 which depending on the current setting of the static ApostropheEncoded property,
 is either encoded as the numeric character reference "&#39;" (default setting), or left unencoded.
 

 This method will never encode an apostrophe into its character entity reference "&apos;" as this
 entity is not defined for use in HTML.  See the comments in the CharacterEntityReference class for more information.
 

 To encode text using only numeric character references, use the

 NumericCharacterReference.encode(CharSequence unencodedText) method instead.


Parameters:
unencodedText - the text to encode.
Returns:
the encoded string.
See Also:
decode(CharSequence encodedText)





encodeWithWhiteSpaceFormatting
public static java.lang.String encodeWithWhiteSpaceFormatting(java.lang.CharSequence unencodedText)

Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.
 
 This performs the same encoding as the encode(CharSequence) method, but also performs the following conversions:
 

  Line breaks, being Carriage Return (U+000D) or Line Feed (U+000A) characters, and Form Feed characters (U+000C)
   are converted to "<br />".  CR/LF pairs are treated as a single line break.
  
Multiple consecutive spaces are converted so that every second space is converted to "&nbsp;"
   while ensuring the last is always a normal space.
  
Tab characters (U+0009) are converted as if they were four consecutive spaces.
 
 
 The conversion of multiple consecutive spaces to alternating space/non-breaking-space allows the correct number of
 spaces to be rendered, but also allows the line to wrap in the middle of it.
 

 Note that zero-width spaces (U+200B) are converted to the numeric character reference
 &#x200B; through the normal encoding process, but IE6 does not render them properly
 either encoded or unencoded.
 

 There is no method provided to reverse this encoding.


Parameters:
unencodedText - the text to encode.
Returns:
the encoded string with whitespace formatting converted to markup.
See Also:
encode(CharSequence unencodedText)





decode
public static java.lang.String decode(java.lang.CharSequence encodedText)

Decodes the specified HTML encoded text into normal text.
 
 All character entity references and numeric character references are converted to their respective characters.
 

 The SGML specification allows character references without a terminating semicolon (;) in some circumstances.
 Although not permitted in HTML or XHTML, some browsers do accept them.

 The behaviour of this library is as follows:
 

  Character entity references terminated by any non-alphabetic character are accepted
  
Decimal numeric character references terminated by any non-digit character are accepted
  
Hexadecimal numeric character references must be terminated correctly by a semicolon.
 
 
 Although character entity references are case sensitive, and in some cases differ from other entity references only by their case,
 some browsers will also recognise them in a case-insensitive way.
 For this reason, all decoding methods in this library will recognise character entity references even if they are in the wrong case.


Parameters:
encodedText - the text to decode.
Returns:
the decoded string.
See Also:
encode(CharSequence unencodedText)





decodeCollapseWhiteSpace
public static java.lang.String decodeCollapseWhiteSpace(java.lang.CharSequence text)

Decodes the specified text after collapsing its white space.
 
 All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
 

 The resultant text is what would normally be rendered by a user agent.


Parameters:
text - the source text
Returns:
the decoded text with collapsed white space.
See Also:
FormControl.getPredefinedValues()





reencode
public static java.lang.String reencode(java.lang.CharSequence encodedText)

Re-encodes the specified text, equivalent to decoding and then encoding again.
 
 This process ensures that the specified encoded text does not contain any remaining unencoded characters.
 

 IMPLEMENTATION NOTE: At present this method simply calls the decode method
 followed by the encode method, but a more efficient implementation
 may be used in future.


Parameters:
encodedText - the text to re-encode.
Returns:
the re-encoded string.





getCharacterReferenceString
public abstract java.lang.String getCharacterReferenceString()

Returns the encoded form of this character reference.
 
 The exact behaviour of this method depends on the class of this object.
 See the CharacterEntityReference.getCharacterReferenceString() and
 NumericCharacterReference.getCharacterReferenceString() methods for more details.
 

 

  Examples:
   CharacterReference.parse("&GT;").getCharacterReferenceString() returns "&gt;"
   CharacterReference.parse("&#x3E;").getCharacterReferenceString() returns "&#3e;"
 



Returns:
the encoded form of this character reference.
See Also:
getCharacterReferenceString(int codePoint), 
getDecimalCharacterReferenceString()





getCharacterReferenceString
public static java.lang.String getCharacterReferenceString(int codePoint)

Returns the encoded form of the specified Unicode code point.
 
 This method returns the character entity reference encoded form of the code point
 if one exists, otherwise it returns the decimal numeric character reference encoded form.
 

 The only exception to this is an apostrophe (U+0027),
 which is encoded as the numeric character reference "&#39;" instead of its character entity reference
 "&apos;".
 

 

  Examples:
   CharacterReference.getCharacterReferenceString(62) returns "&gt;"
   CharacterReference.getCharacterReferenceString('>') returns "&gt;"
   CharacterReference.getCharacterReferenceString('☺') returns "&#9786;"
 


Parameters:
codePoint - the Unicode code point to encode.
Returns:
the encoded form of the specified Unicode code point.
See Also:
getHexadecimalCharacterReferenceString(int codePoint)





getDecimalCharacterReferenceString
public java.lang.String getDecimalCharacterReferenceString()

Returns the decimal encoded form of this character reference.
 
 This is equivalent to getDecimalCharacterReferenceString(getCodePoint()).
 

 

  Example:
  CharacterReference.parse("&gt;").getDecimalCharacterReferenceString() returns "&#62;"
 



Returns:
the decimal encoded form of this character reference.
See Also:
getCharacterReferenceString(), 
getHexadecimalCharacterReferenceString()





getDecimalCharacterReferenceString
public static java.lang.String getDecimalCharacterReferenceString(int codePoint)

Returns the decimal encoded form of the specified Unicode code point.
 
 

  Example:
  CharacterReference.getDecimalCharacterReferenceString('>') returns "&#62;"
 


Parameters:
codePoint - the Unicode code point to encode.
Returns:
the decimal encoded form of the specified Unicode code point.
See Also:
getCharacterReferenceString(int codePoint), 
getHexadecimalCharacterReferenceString(int codePoint)





getHexadecimalCharacterReferenceString
public java.lang.String getHexadecimalCharacterReferenceString()

Returns the hexadecimal encoded form of this character reference.
 
 This is equivalent to getHexadecimalCharacterReferenceString(getCodePoint()).
 

 

  Example:
  CharacterReference.parse("&gt;").getHexadecimalCharacterReferenceString() returns "&#x3e;"
 



Returns:
the hexadecimal encoded form of this character reference.
See Also:
getCharacterReferenceString(), 
getDecimalCharacterReferenceString()





getHexadecimalCharacterReferenceString
public static java.lang.String getHexadecimalCharacterReferenceString(int codePoint)

Returns the hexadecimal encoded form of the specified Unicode code point.
 
 

  Example:
  CharacterReference.getHexadecimalCharacterReferenceString('>') returns "&#x3e;"
 


Parameters:
codePoint - the Unicode code point to encode.
Returns:
the hexadecimal encoded form of the specified Unicode code point.
See Also:
getCharacterReferenceString(int codePoint), 
getDecimalCharacterReferenceString(int codePoint)





getUnicodeText
public java.lang.String getUnicodeText()

Returns the Unicode code point of this character reference in U+ notation.
 
 This is equivalent to getUnicodeText(getCodePoint()).
 

 

  Example:
  CharacterReference.parse("&gt;").getUnicodeText() returns "U+003E"
 



Returns:
the Unicode code point of this character reference in U+ notation.
See Also:
getUnicodeText(int codePoint)





getUnicodeText
public static java.lang.String getUnicodeText(int codePoint)

Returns the specified Unicode code point in U+ notation.
 
 

  Example:
  CharacterReference.getUnicodeText('>') returns "U+003E"
 


Parameters:
codePoint - the Unicode code point.
Returns:
the specified Unicode code point in U+ notation.





parse
public static CharacterReference parse(java.lang.CharSequence characterReferenceText)

Parses a single encoded character reference text into a CharacterReference object.
 
 The character reference must be at the start of the given text, but may contain other characters at the end.
 The getEnd() method can be used on the resulting object to determine at which character position the character reference ended.
 

 If the text does not represent a valid character reference, this method returns null.
 

 To decode all character references in a given text, use the decode(CharSequence encodedText) method instead.
 

 

  Example:
  CharacterReference.parse("&gt;").getChar() returns '>'
 


Parameters:
characterReferenceText - the text containing a single encoded character reference.
Returns:
a CharacterReference object representing the specified text, or null if the text does not represent a valid character reference.
See Also:
decode(CharSequence encodedText)





getCodePointFromCharacterReferenceString
public static int getCodePointFromCharacterReferenceString(java.lang.CharSequence characterReferenceText)

Parses a single encoded character reference text into a Unicode code point.
 
 The character reference must be at the start of the given text, but may contain other characters at the end.
 

 If the text does not represent a valid character reference, this method returns INVALID_CODE_POINT.
 

 

  Example:
  CharacterReference.getCodePointFromCharacterReferenceString("&gt;") returns 38
 


Parameters:
characterReferenceText - the text containing a single encoded character reference.
Returns:
the Unicode code point representing representing the specified text, or INVALID_CODE_POINT if the text does not represent a valid character reference.





requiresEncoding
public static final boolean requiresEncoding(char ch)

Indicates whether the specified character would need to be encoded in HTML text.
 
 This is the case if a character entity reference exists for the character, or the Unicode code point value is greater than U+007F.
 

 The only exception to this is an apostrophe (U+0027),
 which only returns true if the static ApostropheEncoded property is currently set to true.


Parameters:
ch - the character to be tested.
Returns:
true if the specified character would need to be encoded in HTML text, otherwise false.














  
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 






  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Field Summary
`static boolean`	`ApostropheEncoded` Determines whether apostrophes are encoded when calling the `encode(CharSequence)` method.
`static int`	`INVALID_CODE_POINT` Represents an invalid Unicode code point.

Method Summary
`static java.lang.String`	`decode(java.lang.CharSequence encodedText)` Decodes the specified HTML encoded text into normal text.
`static java.lang.String`	`decodeCollapseWhiteSpace(java.lang.CharSequence text)` Decodes the specified text after collapsing its white space.
`static java.lang.String`	`encode(java.lang.CharSequence unencodedText)` Encodes the specified text, escaping special characters into character references.
`static java.lang.String`	`encodeWithWhiteSpaceFormatting(java.lang.CharSequence unencodedText)` Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.
`char`	`getChar()` Returns the character represented by this character reference.
`abstract java.lang.String`	`getCharacterReferenceString()` Returns the encoded form of this character reference.
`static java.lang.String`	`getCharacterReferenceString(int codePoint)` Returns the encoded form of the specified Unicode code point.
`int`	`getCodePoint()` Returns the Unicode code point represented by this character reference.
`static int`	`getCodePointFromCharacterReferenceString(java.lang.CharSequence characterReferenceText)` Parses a single encoded character reference text into a Unicode code point.
`java.lang.String`	`getDecimalCharacterReferenceString()` Returns the decimal encoded form of this character reference.
`static java.lang.String`	`getDecimalCharacterReferenceString(int codePoint)` Returns the decimal encoded form of the specified Unicode code point.
`java.lang.String`	`getHexadecimalCharacterReferenceString()` Returns the hexadecimal encoded form of this character reference.
`static java.lang.String`	`getHexadecimalCharacterReferenceString(int codePoint)` Returns the hexadecimal encoded form of the specified Unicode code point.
`java.lang.String`	`getUnicodeText()` Returns the Unicode code point of this character reference in U+ notation.
`static java.lang.String`	`getUnicodeText(int codePoint)` Returns the specified Unicode code point in U+ notation.
`static CharacterReference`	`parse(java.lang.CharSequence characterReferenceText)` Parses a single encoded character reference text into a CharacterReference object.
`static java.lang.String`	`reencode(java.lang.CharSequence encodedText)` Re-encodes the specified text, equivalent to decoding and then encoding again.
`static boolean`	`requiresEncoding(char ch)` Indicates whether the specified character would need to be encoded in HTML text.

doc.api.au.id.jericho.lib.html.CharacterReference.html Maven / Gradle / Ivy

au.id.jericho.lib.html Class CharacterReference

INVALID_CODE_POINT

ApostropheEncoded

getCodePoint

getChar

encode

encodeWithWhiteSpaceFormatting

decode

decodeCollapseWhiteSpace

reencode

getCharacterReferenceString

getCharacterReferenceString

getDecimalCharacterReferenceString

getDecimalCharacterReferenceString

getHexadecimalCharacterReferenceString

getHexadecimalCharacterReferenceString

getUnicodeText

getUnicodeText

parse

getCodePointFromCharacterReferenceString

requiresEncoding

au.id.jericho.lib.html
Class CharacterReference