Scripting Lecture notes -- XML Schemas

Brad Vander Zanden

Schemas Vs DTDs

DTDs are good for specifying the structure of an XML document. They should be primarily used with files that are primarily text.
Schemas are good for specifying the organization of XML documents that contain a great deal of specifically typed data
1. Schemas are very verbose and seem to be oriented toward business programmers. Types can be restricted using regular expressions, or using very Cobol-like syntax. With the Cobol-like syntax, it takes a large number of statements to specify relatively simple types.
2. Schemas can be used with a validator, like xmllint, to typecheck a file, thus obviating the necessity of doing it in your program
3. Unlike DTD files, Schema files are themselves an XML file

One drawback of Schemas is that they do not allow entities to be easily declared and entities are restricted to being used as attribute values.

Understanding Schema Headers

Unlike DTDs, Schemas must always be external to the XML file. Both the schema and the xml file start with fairly complicated looking syntax. Here's a breakdown of what that syntax means:

Here's a sample header from the books2.xsd schema file: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.wiley.com" xmlns="http://www.wiley.com" elementFormDefault="qualified"> Here's what each line means:
1. The xml element is the standard one that starts all xml files
2. The xs:schema element indicates that the schema is the root element for this file. The xs prefix indicates that the schema can be found in the xs namespace
3. xmlns:xs="http://www.w3.org/2001/XMLSchema": indicates that the elements and data types used in the schema come from the "http://www.w3.org/2001/XMLSchema" namespace.
  1. It also specifies that the elements and data types that come from the "http://www.w3.org/2001/XMLSchema" namespace should be prefixed with xs.
  2. The namespace should be specified exactly as above since the browser associates a schema file with this particular name (it's a system file)
4. targetNamespace="http://www.wiley.com": The namespace we are creating. The elements we are defining will be placed in this namespace. By convention, the namespace prefix will be your company's URL. Often times the namespace will actually be a valid URL that contains a document that describes the namespace. For example, "http://www.w3.org/2001/XMLSchema" contains a document that explains the elements of an XML schema.
5. xmlns="http://www.wiley.com": The name of the default namespace. If an element is not prefixed with anything, it is assumed to belong to this namespace
6. elementFormDefault="qualified": Indicates that any element declared in this namespace must be namespace qualified, although if we declare the namespace to be the default namespace, no qualifier will be necessary
Here's a sample header from the books2.xml file: <?xml version="1.0" encoding="UTF-8"?> <books xmlns="http://www.wiley.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.wiley.com books2.xsd"> Here's how each line is interpreted:
1. The xml tag is the standard beginning of any xml file
2. The books tag indicates that books is the root element
3. xmlns="http://www.wiley.com": indicates that the default namespace is "http://www.wiley.com".
4. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance": Makes the XML Schema Instance namespace available. Once again you must specify the namespace exactly as shown here, so that the browser can locate the appropriate system defined schema file for this namespace.
5. xsi:schemaLocation="http://www.wiley.com books2.xsd">: With the xsi namespace declared, you can now use the schemaLocation attribute to tell the browser where to find the schema file for the "http://www.wiley.com" namespace. Note that the string takes two distinct values, the name of the namespace and then the URL for the XML schema to use for that namespace.

Defining Elements

Elements are defined using content models. A content model defines the type of content that can be contaned in an element. The four content models for XML schema elements are:

text: contains text only, but the text may be typed.
element: contains child elements
mixed: contains elements and text
empty: contains no content

An element with the first type of model is called a simple element. An element with any other type of model, or with both text and attributes, is called a complex element.

XML Types

XML defines a type hierarchy that starts with the root element anyType. From this root element are derived complex types and simple types. Simple types are textual data that are constrained in some way, such as to be a boolean value or an integer. XML provides a wide variety of simple types, which are described further in the next section. Complex types allow you to specify aggregate types that consist of sub-elements, as well as elements that contain attributes.

Simple Types

Simple elements can be associated with a simple type as follows:

<xs:element name="lastname" type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="startDate" type="xs:date"/> XML defines a large number of simple types, which you can find here. Before deriving your own type, check here first. For example, there is a standard date type, so you should use that one in preference to a custom one.

You can also derive types by using restrictions or extensions and introducing the new type with the element xs:simpleType. Restrictions on types are called facets. You almost always will use restrictions when deriving types. Extensions are typically used in the context of associating attributes with simple types. Occasionally you might use an extension to add enumerated values to a previously enumerated string type or additional children elements to a complex type. An example of using extensions to add an attribute to a simple type is shown later in the notes.

Here is an example that limits a string to one of two enumerated values:

<xs:simpleType name="BookType"> <xs:restriction base="xs:string"> <xs:enumeration value="Fiction"/> <xs:enumeration value="Nonfiction"/> </xs:restriction> </xs:simpleType> Notice that you can name a type, so that it can be used in multiple places. You can create either anonymous, "inline" types, or external, named types. Here is an example of an inline, anonymous type: <xs:element name="age"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="120"/> </xs:restriction> </xs:simpleType> </xs:element> Here are some of the common restrictions that one employs on integers, floats, decimals and strings (decimals are arbitrary precision numbers, while floats are supposed to conform to IEEE's floating point standard--decimal numbers are equivalent to Cobol's packed decimal numbers).

Restriction Explanation string integer float decimal

length String must be exactly this number of chars x

minLength Minimum number of chars in string x

maxLength Maximum number of chars in string x

pattern Perl style regular expression x x x x

enumeration Constrains the value space to a specified set of values x x x x

minInclusive Minimum possible value for a number, including the specfied number x x x

maxInclusive Maximum possible value for a number, including the specfied number x x x

minExclusive Minimum possible value for a number, excluding the specfied number x x x

maxInclusive Maximum possible value for a number, excluding the specfied number x x x

totalDigits Maximum number of digits in the number x x x

fractionDigits Maximum number of fractional digits in the number x

Restriction	Explanation	string	integer	float	decimal
length	String must be exactly this number of chars	x
minLength	Minimum number of chars in string	x
maxLength	Maximum number of chars in string	x
pattern	Perl style regular expression	x	x	x	x
enumeration	Constrains the value space to a specified set of values	x	x	x	x
minInclusive	Minimum possible value for a number, including the specfied number		x	x	x
maxInclusive	Maximum possible value for a number, including the specfied number		x	x	x
minExclusive	Minimum possible value for a number, excluding the specfied number		x	x	x
maxInclusive	Maximum possible value for a number, excluding the specfied number		x	x	x
totalDigits	Maximum number of digits in the number		x	x	x
fractionDigits	Maximum number of fractional digits in the number				x

A complete list of restrictions can be found here in Section 10 (Constraining Facets) and a table showing which restrictions can be used with which simple types can be found in Section 11. Some more examples can be found in books2.xsd.

Attributes

Attributes are introduced just like elements, with a name and a type. For example: <xs:attribute name="productId" type="xs:integer" default="10"/> <xs:attribute name="contentType" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Fiction"/> <xs:enumeration value="Nonfiction"/> </xs:restriction> </xs:simpleType> </xs:attribute> Attributes may have three attribute values associated with them, which determine their initial value:

default="value": optional attribute that will be given the default value if the user does not provide one
fixed="value": a fixed value that is given to the attribute and that cannot be changed by the user.
use="required": an attribute whose value must be provided by the user.

Complex Types

Complex types are divided into two groups: those with simple content and those with complex content. Simple content is used when you want to have an element with a simple type and attributes. Complex content is used when you want your element to have child elements. Of course complex content is also allowed to have attributes.

xs:simpleContent

While you typically use restrictions to derive new simple types, you typically use extensions to derive new types based on simpleContent. The reason is that you use an extension to add attributes to simple types. Here is an example where I take the age that I defined earlier and associate it with an element named antique that has an attribute named quality:

<xs:simpleType name="antique_age"> <xs:restriction base="xs:integer"> <xs:minInclusive value="30"/> </xs:restriction> </xs:simpleType> <xs:element name="antique"> <xs:complexType> <xs:simpleContent> <xs:extension base="antique_age"> <xs:attribute name="quality" type="xs:string"> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> Note how the type is introduced with the tag xs:complexType.

xs:complexContent

If I want my type to contain children elements, then I need to use xs:complexContent. This tag allows me to use one of three compositor elements to specify the structure of the tag:

sequence: indicates that the elements must occur in the specified order
choice: indicates that any one of the elements may occur
all: indicates that any or all of the elements may occur, and in any order

Within each of these compositor tags, you list the elements that may occur. For brevity, you may omit the xs:complexContent tag when it appears inside a xs:complexType tag, because content that appears inside such a tag is assumed by default to be complexContent. For example: <xs:complexType name="bookType"> <xs:sequence> <xs:element ref="title"/> <xs:element ref="author"/> <xs:element ref="publisher"/> <xs:element ref="price"/> <xs:element ref="isbn"/> </xs:sequence> </xs:complexType> or <xs:complexType> <xs:sequence> <xs:element name="lastname" type="xs:string"/> <xs:choice> <xs:element ref="spouse"> <xs:element ref="child"> <xs:element ref="sibling"> </xs:choice> </xs:sequence> </xs:complexType> Notice the use of ref rather than a name/type pair. The ref attribute must have a value equal to the value of a named element that is defined either earlier or later in the document (i.e., the element's name element must be the same as the ref attribute's value).

Additionally, you can specify the number of occurrences of an element or whether mixed content is allowed with the followng attributes:

minOccurs, maxOccurs: the minimum or maximum number of times an element may occur. Use "unbounded" for an unbounded number of times. If you want something to occur an exact number of times, you must use minOccurs and maxOccurs, providing both with the same value. For example:
mixed="true": specifies that both children elements and text are allowed. For example, if I want to define a question type in which I could embed a points element as well as a statement of the problem, I might write: If I want to include attributes with complex types, they must be listed after the compositor element. For example:

Empty Elements

Empty elements may be specified with or without attribute values, but they require complexTypes. Here's one with and without attribute values: