fxp Features

----------------
o Unicode Support
o Catalog Support

----------------

Unicode Support

fxp has full support for Unicode and auto-detection of encoding of external XML entities. The supported encodings are currently:

Encoding    Other recognized names
ASCII ANSI_X3.4-1968, ANSI_X3.4-1986, US-ASCII, US, ISO646-US, ISO-IR-6, ISO_646.IRV:1991, IBM367 and CP367
EBCDIC
LATIN1 ISO_8859-1:1987, ISO-8859-1, ISO_8859-1, ISO-IR-100, CP819, IBM819, L1
UCS-4 ISO-10646-UCS-4
UCS-2 ISO-10646-UCS-2
UTF-16
UTF-8

----------------

Catalog Support

o Catalogs
o Options by Example
o Summary of Options

----------------

Catalogs

fxp supports the Socat syntax of XML Catalog. Catalogs are used for generating system identifiers from public identifiers (mapping), or for substituting system identifiers by other system identifiers (remapping). Catalogs come in two syntaxes: the Socat syntax is a subset of a catalog syntax used for SGML; the XML syntax is an XML document instance.

Syntax

There are five kinds of entries in a catalog:

Type Socat/XML syntax Meaning
base BASE uri
<Base HRef="uri">
Specifies a URI to be used as a base for succeeding relative URIs.
extend CATALOG uri
<Extend HRef="uri">
Indicates an alternative catalog to be searched if the actual catalog does not contain a matching entry.
delegate   DELEGATE prefix uri
<Delegate PublicId="prefix" HRef="uri">
Specifies an alternative catalog, but only for public identifiers beginning with prefix.
map PUBLIC pubid uri
<Map PublicId="pubid" HRef="uri">
Maps a public identifier to a URI.
remap SYSTEM src dst
<Remap SystemId="src" HRef="dst">
Indicates that URI dst shall be used in the place of the source URI src.

If the XML syntax is used, the catalog is parsed in non-validating mode and everything except for the start-tags of the above five elements is ignored. It is recommended, however, that the catalog be a valid XML document with a document type similar to this.

Relative URIs are treated as relative to the catalog in which they appear, or if there was a preceding base entry, relative to the URI of that entry. The only exception is that the src URI in a remap entry must be mapped exactly, ignoring any specified base.

Example in Socat Syntax

If a catalog's file name ends in .SOC or .soc, fxp assumes it is in Socat syntax, e.g.:
BASE     "/pub/dtd/w3c/"
PUBLIC   "-//W3C//DTD Specification::19980910//EN" "spec.dtd"
SYSTEM   "spec.dtd" "xmlspec.dtd"
DELEGATE "ISO" "/pub/dtd/iso/iso.soc"
CATALOG  "/pub/entities/ent.soc"
PUBLIC   "ISO 8879:1986//ENTITIES Added Latin 1//EN" "/pub/iso/lat1.ent"
SYSTEM   "isolat1.ent" "latin1.ent"

Example in XML Syntax

For XML syntax, the catalog must be a well-formed, but not necessarily valid XML document. I.e., if the catalog has more than one entry, there must be at least one root element containing all the entries. All textual data and elements other than the five catalog entries are ignored.
<Catalog>
  <Base HRef="/pub/dtd/w3c/"/>
  <Map  PublicId="-//W3C//DTD Specification::19980910//EN" HRef="spec.dtd"/>
  <Remap SystemId="spec.dtd" HRef="xmlspec.dtd"/>
  <Delegate PublicId="ISO" HRef="/pub/dtd/iso/iso.soc"/>
  <Extend HRef="/pub/entities/ent.soc"/>
  <Map PublicId="ISO 8879:1986//ENTITIES Added Latin 1//EN" HRef="/pub/iso/lat1.ent"/>
  <Remap SystemId="isolat1.ent" HRef="latin1.ent"/>
</Catalog>

Search Order

The search order is breadth-first, i.e., a matching map or remap entry is always preferred to a matching entry in an alternative catalog specified by a preceding delegate or extend entry. E.g., in the example above the public identifier "ISO 8879:1986//ENTITIES Added Latin 1//EN" is mapped to /pub/iso/lat1.ent even if the catalog /pub/entities/ent.soc contains a matching entry for it.

----------------

Catalog Options by Example

Catalog Search Path

A catalog to be used for resolving can be specified with the --catalog option. Repeating this option several times is equivalent to concatenating all specified catalogs into one. Note that, e.g, a matching entry in the second catalog overrides a match in a catalog specified in a delegate or extend entry in the first one: suppose that iso.soc contains the line
DELEGATE "ISO 8879:1986//ENTITIES" "8879.soc"
8879.soc contains
PUBLIC   "ISO 8879:1986//ENTITIES Added Latin 1//EN" "/pub/iso/lat1.ent"
and ents.soc contains
PUBLIC   "ISO 8879:1986//ENTITIES Added Latin 1//EN" "isolat1.ent"
Specifying --catalog=iso.soc --catalog=ents.soc makes "ISO 8879:1986//ENTITIES Added Latin 1//EN" resolve to isolat1.ent, and not to /pub/iso/lat1.ent.

Resolving Strategy

A catalog may be used for several reasons: as a fall-back, i.e., for generating system identifiers if the information in the XML document itself is not sufficient; or as the default, overriding the system identifiers specified in the DTD. By default, fxp tries to resolve an external identifier as follows:
  1. if a public identifier is present, then it is tried to be mapped to a system identifier using the catalog; if this fails or no public identifier was given, the declared system identifier is used;
  2. the system identifier obtained by step 1 is tried to be remapped by a matching catalog entry.
This can be affected by the --catalog-priority option. This option takes one of the following arguments:

map the default behaviour; for succeeding relative URIs.
remap first try to remap the declared system identifier; only if that fails proceed with step 1.
sys if a system identifier is given, don't consider the catalog at all; if there is no system identifier, proceed to steps 1 and 2. Note that in well-formed documents an external identifier must always contain a system identifier. Therefore this applies only to external identifiers declared for notations.

E.g., suppose you have the following declarations in the DTD:

<ENTITY % isolat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" "isolat1.ent">
<NOTATION ps PUBLIC "PostScript Level 3">
By default, the external identifier for isolat1 is mapped to /pub/iso/lat1.ent. With --catalog-priority=remap remapping of the declared system identifier comes first and yields latin1.ent (which is modified to /pub/dtd/w3c/latin1.ent due to the base entry in the catalog's first line). Giving option --catalog-priority=sys totally disables the catalog for this external identifier because it has a system identifier. For notation ps, however, the catalog is still consulted because its declaration lacks a system identifier.

Since remapping should be used with caution in publicly available catalogs it can be disabled with --catalog-remap=no. E.g., resolving public identifier "-//W3C//DTD Specification::19980910//EN" first results in the URI spec.dtd. By default, this is remapped to xmlspec.dtd, but with --catalog-remap=no it is returned as is.

Catalog syntax and encoding

A catalog is used for resolving system identifiers in XML documents. A system identifier is a URI and may, according to RFC 2396, only contain ASCII characters. Due to an inaccuracy in the XML recommendation, however, arbitrary Unicode characters may occur in system identifiers. Since system identifiers in catalogs are matched literally, it is desirable to specify them identically both in the catalog and in the XML document. Therefore catalogs are Unicode documents and can be written in all encodings supported for XML documents. Though XML recommends encoding non-ASCII characters in system identifiers in UTF-8 and escaping the resulting bytes in the URI, matching of system identifiers in catalogs is performed on the Unicode representation. Therefore, system identifier "entité" does not match "entit%C3%A9", though both decode to the same URI.

Catalogs in Socat syntax, however, have no encoding declaration. Therefore fxp only checks for a byte-order mark at the beginning of a catalog in order to auto-detect a UTF-16 encoding. If it doesn't find one it assumes a default encoding. Because catalogs are usually written by hand, this is by default LATIN1. The --catalog-encoding option tells fxp to use another default encoding.

fxp tries to guess the syntax of catalog by means of the suffix of its file name. A suffix of .soc or .SOC suggests to use Socat syntax, whereas for suffixes .xml and .XML the XML syntax is chosen. For files having none of these suffices, fxp assumes XML syntax. This can be changed with --catalog-syntax=soc.

----------------

Summary of Catalog Options

-C uri
--catalog=uri
Use uri as a catalog. Several catalogs can be specified by repeating this option.
--catalog-syntax=(soc|xml)
For catalogs with unknown suffix, specifies whether to assume Socat syntax or XML syntax. Defaults to xml.
--catalog-encoding=enc
Use encoding enc for reading a catalog unless it starts with a byte order mark. enc must be a supported encoding. Defaults to LATIN1.
--catalog-remap=[(yes|no)]
Turn on or off support for remapping system identifiers. Defaults to yes.
--catalog-priority=(map|remap|sys)
Controls the resolving strategy in catalogs. map means that mapping the public identifier has highest priority; remap means that remapping the system identifier comes first; sys means that the catalog is used only if no system identifier is present. Defaults to map.

----------------

A. Neumann (neumann@PSI.Uni-Trier.DE)