All Downloads are FREE. Search and download functionalities are using the official Maven repository.

nu.validator.gnu.xml.aelfred2.package.html Maven / Gradle / Ivy

Go to download

An HTML-checking library (used by https://html5.validator.nu and the HTML5 facet of the W3C Validator)

There is a newer version: 20.7.2
Show newest version



    package overview



This package contains Ælfred2, which includes an enhanced SAX2-compatible version of the Ælfred non-validating XML parser, a modular (and hence optional) DTD validating parser, and modular (and hence optional) JAXP glue to those. Use these like any other SAX2 parsers.

Some of the documentation below was modified from the original Ælfred README.txt file. All of it has been updated.

About Ælfred

Ælfred is a Java-based XML parser originally from Microstar Software Limited (no longer in existence) and more or less placed into the public domain.

Design Principles

In most Java applets and applications, XML should not be the central feature; instead, XML is the means to another end, such as loading configuration information, reading meta-data, or parsing transactions.

When an XML parser is only a single component of a much larger program, it cannot be large, slow, or resource-intensive. With Java applets, in particular, code size is a significant issue. The standard modem is still not operating at 56 Kbaud, or sometimes even with data compression. Assuming an uncompressed 28.8 Kbaud modem, only about 3 KBytes can be downloaded in one second; compression often doubles that speed, but a V.90 modem may not provide another doubling. When used with embedded processors, similar size concerns apply.

Ælfred is designed for easy and efficient use over the Internet, based on the following principles:

  1. Ælfred must be as small as possible, so that it doesn't add too much to an applet's download time.
  2. Ælfred must use as few class files as possible, to minimize the number of HTTP connections necessary. (The use of JAR files has made this be less of a concern.)
  3. Ælfred must be compatible with most or all Java implementations and platforms. (Write once, run anywhere.)
  4. Ælfred must use as little memory as possible, so that it does not take away resources from the rest of your program. (It doesn't force you to use DOM or a similar costly data structure API.)
  5. Ælfred must run as fast as possible, so that it does not slow down the rest of your program.
  6. Ælfred must produce correct output for well-formed and valid documents, but need not reject every document that is not valid or not well-formed. (In Ælfred2, correctness was a bigger concern than in the original version; and a validation option is available.)
  7. Ælfred must provide full internationalization from the first release. (Ælfred2 now automatically handles all encodings supported by the underlying JVM; previous versions handled only UTF-8, UTF_16, ASCII, and ISO-8859-1.)

As you can see from this list, Ælfred is designed for production use, but neither validation nor perfect conformance was a requirement. Good validating parsers exist, including one in this package, and you should use them as appropriate. (See conformance reviews available at http://www.xml.com)

One of the main goals of Ælfred2 was to significantly improve conformance, while not significantly affecting the other goals stated above. Since the only use of this parser is with SAX, some classes could be removed, and so the overall size of Ælfred was actually reduced. Subsequent performance work produced a notable speedup (over twenty percent on larger files). That is, the tradeoffs between speed, size, and conformance were re-targeted towards conformance and support of newer APIs (SAX2), with a a positive performance impact.

The role anticipated for this version of Ælfred is as a lightweight Free Software SAX parser that can be used in essentially every Java program where the handful of conformance violations (noted below) are acceptable. That certainly includes applets, and nowadays one must also mention embedded systems as being even more size-critical. At this writing, all parsers that are more conformant are significantly larger, even when counting the optional validation support in this version of Ælfred.

About the Name Ælfred

Ælfred the Great (AElfred in ASCII) was King of Wessex, and some say of King of England, at the time of his death in 899 AD. Ælfred introduced a wide-spread literacy program in the hope that his people would learn to read English, at least, if Latin was too difficult for them. This Ælfred hopes to bring another sort of literacy to Java, using XML, at least, if full SGML is too difficult.

The initial Æ ligature ("AE)" is also a reminder that XML is not limited to ASCII.

Character Encodings

The Ælfred parser currently builds in support for a handful of input encodings. Of course these include UTF-8 and UTF-16, which all XML parsers are required to support:

  • UTF-8 ... the standard eight bit encoding, used unless you provide an encoding declaration or a MIME charset tag.
  • US-ASCII ... an extremely common seven bit encoding, which happens to be a subset of UTF-8 and ISO-8859-1 as well as many other encodings. XHTML web pages using US-ASCII (without an encoding declaration) are probably more widely interoperable than those in any other encoding.
  • ISO-8859-1 ... includes accented characters used in much of western Europe (but excluding the Euro currency symbol).
  • UTF-16 ... with several variants, this encodes each sixteen bit Unicode character in sixteen bits of output. Variants include UTF-16BE (big endian, no byte order mark), UTF-16LE (little endian, no byte order mark), and ISO-10646-UCS-2 (an older and less used encoding, using a version of Unicode without surrogate pairs). This is essentially the native encoding used by Java.
  • ISO-10646-UCS-4 ... a seldom-used four byte encoding, also known as UTF-32BE. Four byte order variants are supported, including one known as UTF-32LE. Some operating systems standardized on UCS-4 despite its significant size penalty, in anticipation that Unicode (even with surrogate pairs) would eventually become limiting. UCS-4 permits encoding of non-Unicode characters, which Java can't represent (and XML doesn't allow).

If you use any encoding other than UTF-8 or UTF-16 you should make sure to label your data appropriately:

<?xml version="1.0" encoding="ISO-8859-15"?>

Encodings accessed through java.io.InputStreamReader are now fully supported for both external labels (such as MIME types) and internal types (as shown above). There is one limitation in the support for internal labels: the encodings must be derived from the US-ASCII encoding, the EBCDIC family of encodings is not recognized. Note that Java defines its own encoding names, which don't always correspond to the standard Internet encoding names defined by the IETF/IANA, and that Java may even require use of nonstandard encoding names. Please report such problems; some of them can be worked around in this parser, and many can be worked around by using external labels.

Note that if you are using the Euro symbol with an fixed length eight bit encoding, you should probably be using the encoding label iso-8859-15 or, with a Microsoft OS, cp-1252. Of course, UTF-8 and UTF-16 handle the Euro symbol directly.

Known Conformance Violations

Known conformance issues should be of negligible importance for most applications, and include:

  • Rather than following the voluminous "Appendix B" rules about what characters may appear in names (and name tokens), the Unicode rules embedded in java.lang.Character are used. This means mostly that some names are inappropriately accepted, though a few are inappropriately rejected. (It's much simpler to avoid that much special case code. Recent OASIS/NIST test cases may have these rules be realistically testable.)
  • Text containing "]]>" is not rejected unless it fully resides in an internal buffer ... which is, thankfully, the typical case. This text is illegal, but sometimes appears in illegal attempts to nest CDATA sections. (Not catching that boundary condition substantially simplifies parsing text.)
  • Surrogate characters that aren't correctly paired are ignored rather than rejected, unless they were encoded using UTF-8. (This simplifies parsing text.) Unicode 3.1 assigned the first characters to those character codes, in early 2001, so few documents (or tools) use such characters in any case.
  • Declarations following references to an undefined parameter entity reference are not ignored. (Not maintaining and using state about this validity error simplifies declaration handling; few XML parsers address this constraint in any case.)
  • Well formedness constraints for general entity references are not enforced. (The code to handle the "content" production is merged with the element parsing code, making it hard to reuse for this additional situation.)

When tested against the July 12, 1999 version of the OASIS XML Conformance test suite, an earlier version passed 1057 of 1067 tests. That contrasts with the original version, which passed 867. The current parser is top-ranked in terms of conformance, as is its validating sibling (which has some additional conformance violations imposed on it by SAX2 API deficiencies as well as some of the more curious SGML layering artifacts found in the XML specification).

The XML 1.0 specification itself was not without problems, and after some delays the W3C has come out with a revised "second edition" specification. While that doesn't resolve all the problems identified the XML specification, many of the most egregious problems have been resolved. (You still need to drink magic Kool-Aid before some DTD-related issues make sense.) To the extent possible, this parser conforms to that second edition specification, and does well against corrected versions of the OASIS/NIST XML conformance test cases. See http://xmlconf.sourceforge.net for more information about SAX2/XML conformance testing.

Licensing

As noted above, the original distribution was more or less public domain. The license had the constraint that modifications be clearly documented, as has been done here.

This version is Copyright (C) 1999,2000,2001 The Free Software Foundation, and all the modifications are distributed under the GNU General Public License (GPL). It is subject to the "Library Exception", supporting use in some environments (such as embedded systems where dynamic linking may not be available) by proprietary code without necessarily requiring all code to be licensed under the GPL.

Changes Since the last Microstar Release

As noted above, Microstar has not updated this parser since the summer of 1998, when it released version 1.2a on its web site. This release is intended to benefit the developer community by refocusing the API on SAX2, and improving conformance to the extent that most developers should not need to use another XML parser.

The code has been cleaned up (referring to the XML 1.0 spec in all the production numbers in comments, rather than some preliminary draft, for one example) and has been sped up a bit as well. JAXP support has been added, although developers are still strongly encouraged to use the SAX2 APIs directly.

SAX2 Support

The original version of Ælfred did not support the SAX2 APIs.

This version supports the SAX2 APIs, exposing the standard boolean feature descriptors. It supports the "DeclHandler" property to provide access to all DTD declarations not already exposed through the SAX1 API. The "LexicalHandler" property is supported, exposing entity boundaries (including the unnamed external subset) and things like comments and CDATA boundaries. SAX1 compatibility is currently provided.

Validation

In the 'pipeline' package in this same software distribution is an XML Validation component using any full SAX2 event stream (including all document type declarations) to validate. There is now a XmlReader class which combines that class and this enhanced Ælfred parser, creating an optionally validating SAX2 parser.

As noted in the documentation for that validating component, certain validity constraints can't reliably be tested by a layered validator. These include all constraints relying on layering violations (exposing XML at the level of tokens or below, required since XML isn't a context-free grammar), some that SAX2 doesn't support, and a few others. The resulting validating parser is conformant enough for most applications that aren't doing strange SGML tricks with DTDs. Moreover, that validating filter can be used without a parser ... any application component that emits SAX event streams can DTD-validate its output on demand.

You want Smaller?

You'll have noticed that the original version of Ælfred had small size as a top goal. Ælfred2 normally includes a DTD validation layer, but you can package without that. Similarly, JAXP factory support is available but optional. Then the main added cost due to this revision are for supporting the SAX2 API itself; DTD validation is as cleanly layered as allowed by SAX2.

Bugs Fixed

Bugs fixed in Ælfred2 include:

  1. Originally Ælfred didn't close file descriptors, which led to file descriptor leakage on programs which ran for any length of time.
  2. NOTATION declarations without system identifiers are now handled correctly.
  3. DTD events are now reported for all invocations of a given parser, not just the first one.
  4. More correct character handling:
    • Rejects out-of-range characters, both in text and in character references.
    • Correctly handles character references that expand to surrogate pairs.
    • Correctly handles UTF-8 encodings of surrogate pairs.
    • Correctly handles Unicode 3.1 rules about illegal UTF-8 encodings: there is only one legal encoding per character.
    • PUBLIC identifiers are now rejected if they have illegal characters.
    • The parser is more correct about what characters are allowed in names and name tokens. Uses Unicode rules (built in to Java) rather than the voluminous XML rules, although some extensions have been made to match XML rules more closely.
    • Line ends are now normalized to newlines in all known cases.
  5. Certain validity errors were previously treated as well formedness violations.
    • Repeated declarations of an element type are no longer fatal errors.
    • Undeclared parameter entity references are no longer fatal errors.
  6. Attribute handling is improved:
    • Whitespace must exist between attributes.
    • Only one value for a given attribute is permitted.
    • ATTLIST declarations don't need to declare attributes.
    • Attribute values are normalized when required.
    • Tabs in attribute values are normalized to spaces.
    • Attribute values containing a literal "<" are rejected.
  7. More correct entity handling:
    • Whitespace must precede NDATA when declaring unparsed entities.
    • Parameter entity declarations may not have NDATA annotations.
    • The XML specification has a bug in that it doesn't specify that certain contexts exist within which parameter entity expansion must not be performed. Lacking an offical erratum, this parser now disables such expansion inside comments, processing instructions, ignored sections, public identifiers, and parts of entity declarations.
    • Entity expansions that include quote characters no longer confuse parsing of strings using such expansions.
    • Whitespace in the values of internal entities is not mapped to space characters.
    • General Entity references in attribute defaults within the DTD now cause fatal errors when the entity is not defined at the time it is referenced.
    • Malformed general entity references in entity declarations are now detected.
  8. Neither conditional sections nor parameter entity references within markup declarations are permitted in the internal subset.
  9. Processing instructions whose target names are "XML" (ignoring case) are now rejected.
  10. Comments may not include "--".
  11. Most "]]>" sequences in text are rejected.
  12. Correct syntax for standalone declarations is enforced.
  13. Setting a locale for diagnostics only produces an exception if the language of that locale isn't English.
  14. Some more encoding names are recognized. These include the Unicode 3.0 variants of UTF-16 (UTF-16BE, UTF-16LE) as well as US-ASCII and a few commonly seen synonyms.
  15. Text (from character content, PIs, or comments) large enough not to fit into internal buffers is now handled correctly even in some cases which were originally handled incorrectly.
  16. Content is now reported for element types for which attributes have been declared, but no content model is known. (Such documents are invalid, but may still be well formed.)

Other bugs may also have been fixed.

For better overall validation support, some of the validity constraints that can't be verified using the SAX2 event stream are now reported directly by Ælfred2.





© 2015 - 2024 Weber Informatics LLC | Privacy Policy