Download
FAQ History |
API
Search Feedback |
Generating XML Data
This section takes you step by step through the process of constructing an XML document. Along the way, you'll gain experience with the XML components you'll typically use to create your data structures.
Writing a Simple XML File
You'll start by writing the kind of XML data you can use for a slide presentation. To become comfortable with the basic format of an XML file, you'll use your text editor to create the data. You'll use this file and extend it in later exercises.
Creating the File
Using a standard text editor, create a file called
slideSample.xml
.
Note: Here is a version of it that already exists:
slideSample01.xml
. (The browsable version isslideSample01-xml.html
.) You can use this version to compare your work or just review it as you read this guide.
Writing the Declaration
Next, write the declaration, which identifies the file as an XML document. The declaration starts with the characters
<?
, which is also the standard XML identifier for a processing instruction. (You'll see processing instructions later in this tutorial.)This line identifies the document as an XML document that conforms to version 1.0 of the XML specification and says that it uses the 8-bit Unicode character-encoding scheme. (For information on encoding schemes, see Appendix A.)
Because the document has not been specified as
standalone
, the parser assumes that it may contain references to other documents. To see how to specify a document asstandalone
, see The XML Prolog.Adding a Comment
Comments are ignored by XML parsers. A program will never see them unless you activate special settings in the parser. To put a comment into the file, add the following highlighted text.
Defining the Root Element
After the declaration, every XML file defines exactly one element, known as the root element. Any other elements in the file are contained within that element. Enter the following highlighted text to define the root element for this file,
slideshow
:
Note: XML element names are case-sensitive. The end tag must exactly match the start tag.
Adding Attributes to an Element
A slide presentation has a number of associated data items, none of which requires any structure. So it is natural to define these data items as attributes of the
slideshow
element. Add the following highlighted text to set up some attributes:... <slideshowtitle="Sample Slide Show" date="Date of publication" author="Yours Truly"
> </slideshow>When you create a name for a tag or an attribute, you can use hyphens (
-
), underscores (_
), colons (:
), and periods (.
) in addition to characters and numbers. Unlike HTML, values for XML attributes are always in quotation marks, and multiple attributes are never separated by commas.
Note: Colons should be used with care or avoided, because they are used when defining the namespace for an XML document.
Adding Nested Elements
XML allows for hierarchically structured data, which means that an element can contain other elements. Add the following highlighted text to define a slide element and a title element contained within it:
<slideshow ... ><!-- TITLE SLIDE --> <slide type="all"> <title>Wake up to WonderWidgets!</title> </slide>
</slideshow>Here you have also added a
type
attribute to the slide. The idea of this attribute is that you can earmark slides for a mostly technical or mostly executive audience usingtype="tech"
ortype="exec"
, or identify them as suitable for both audiences usingtype="all"
.More importantly, this example illustrates the difference between things that are more usefully defined as elements (the
title
element) and things that are more suitable as attributes (thetype
attribute). The visibility heuristic is primarily at work here. The title is something the audience will see, so it is an element. The type, on the other hand, is something that never gets presented, so it is an attribute. Another way to think about that distinction is that an element is a container, like a bottle. The type is a characteristic of the container (tall or short, wide or narrow). The title is a characteristic of the contents (water, milk, or tea). These are not hard-and-fast rules, of course, but they can help when you design your own XML structures.Adding HTML-Style Text
Because XML lets you define any tags you want, it makes sense to define a set of tags that look like HTML. In fact, the XHTML standard does exactly that. You'll see more about that toward the end of the SAX tutorial. For now, type the following highlighted text to define a slide with a couple of list item entries that use an HTML-style
<em>
tag for emphasis (usually rendered as italicized text):... <!-- TITLE SLIDE --> <slide type="all"> <title>Wake up to WonderWidgets!</title> </slide><!-- OVERVIEW --> <slide type="all"> <title>Overview</title> <item>Why <em>WonderWidgets</em> are great</item> <item>Who <em>buys</em> WonderWidgets</item> </slide>
</slideshow>Note that defining a title element conflicts with the XHTML element that uses the same name. Later in this tutorial, we discuss the mechanism that produces the conflict (the DTD), along with possible solutions.
Adding an Empty Element
One major difference between HTML and XML is that all XML must be well formed, which means that every tag must have an ending tag or be an empty tag. By now, you're getting pretty comfortable with ending tags. Add the following highlighted text to define an empty list item element with no contents:
... <!-- OVERVIEW --> <slide type="all"> <title>Overview</title> <item>Why <em>WonderWidgets</em> are great</item><item/>
<item>Who <em>buys</em> WonderWidgets</item> </slide> </slideshow>Note that any element can be an empty element. All it takes is ending the tag with
/>
instead of>
. You could do the same thing by entering<item></item>
, which is equivalent.
Note: Another factor that makes an XML file well formed is proper nesting. So
<b><i>some_text</i></b>
is well formed, because the<i>...</i>
sequence is completely nested within the<b>..</b>
tag. This sequence, on the other hand, is not well formed:<b><i>some_text</b></i>
.
The Finished Product
Here is the completed version of the XML file:
<?xml version='1.0' encoding='utf-8'?> <!-- A SAMPLE set of slides --> <slideshow title="Sample Slide Show" date="Date of publication" author="Yours Truly" > <!-- TITLE SLIDE --> <slide type="all"> <title>Wake up to WonderWidgets!</title> </slide> <!-- OVERVIEW --> <slide type="all"> <title>Overview</title> <item>Why <em>WonderWidgets</em> are great</item> <item/> <item>Who <em>buys</em> WonderWidgets</item> </slide </slideshow>Save a copy of this file as
slideSample01.xml
so that you can use it as the initial data structure when experimenting with XML programming operations.Writing Processing Instructions
It sometimes makes sense to code application-specific processing instructions in the XML data. In this exercise, you'll add a processing instruction to your
slideSample.xml
file.
Note: The file you'll create in this section is
slideSample02.xml
. (The browsable version isslideSample02-xml.html
.)
As you saw in Processing Instructions, the format for a processing instruction is
<?
target
data
?>
, wheretarget
is the application that is expected to do the processing, anddata
is the instruction or information for it to process. Add the following highlighted text to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):<slideshow ... ><!-- PROCESSING INSTRUCTION --> <?my.presentation.Program QUERY="exec, tech, all"?>
<!-- TITLE SLIDE -->Notes:
- The data portion of the processing instruction can contain spaces or it can even be null. But there cannot be any space between the initial
<?
and the target identifier.- The data begins after the first space.
- It makes sense to fully qualify the target with the complete Web-unique package prefix, to preclude any conflict with other programs that might process the same data.
- For readability, it seems like a good idea to include a colon (:) after the name of the application:
The colon makes the target name into a kind of "label" that identifies the intended recipient of the instruction. However, even though the W3C spec allows a colon in a target name, some versions of Internet Explorer 5 (IE5) consider it an error. For this tutorial, then, we avoid using a colon in the target name.
Save a copy of this file as
slideSample02.xml
so that you can use it when experimenting with processing instructions.Introducing an Error
The parser can generate three kinds of errors: a fatal error, an error, and a warning. In this exercise, you'll make a simple modification to the XML file to introduce a fatal error. Then you'll see how it's handled in the Echo app.
Note: The XML structure you'll create in this exercise is in
slideSampleBad1.xml
. (The browsable version isslideSampleBad1-xml.html.
)
One easy way to introduce a fatal error is to remove the final
/
from the emptyitem
element to create a tag that does not have a corresponding end tag. That constitutes a fatal error, because all XML documents must, by definition, be well formed. Do the following:This change produces the following:
... <item>Why <em>WonderWidgets</em> are great</item> <item> <item>Who <em>buys</em> WonderWidgets</item> ...Now you have a file that you can use to generate an error in any parser, any time. (XML parsers are required to generate a fatal error for this file, because the lack of an end tag for the
<item>
element means that the XML structure is no longer well formed.)Substituting and Inserting Text
In this section, you'll learn about
Handling Special Characters
In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:
Later, when you learn how to write a DTD, you'll see that you can define your own entities so that
&yourEntityName;
expands to all the text you defined for that entity. For now, though, we'll focus on the predefined entities and character references that don't require any special definitions.Predefined Entities
An entity reference such as
&
contains a name (in this case, amp) between the start and end delimiters. The text it refers to (&
) is substituted for the name, as with a macro in a programming language. Table 2-1 shows the predefined entities for special characters.
Table 2-1 Predefined Entities Character Name Reference&
ampersand&
<
less than<
>
greater than>
"
quote"
'
apostrophe'
Character References
A character reference such as
“
contains a hash mark (#
) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter A, 147 for the left curly quote, or 148 for the right curly quote. In this case, the "name" of the entity is the hash mark followed by the digits that identify the character.
Note: XML expects values to be specified in decimal. However, the Unicode charts at
http://www.unicode.org/charts/
specify values in hexadecimal! So you'll need to do a conversion to get the right value to insert into your XML data set.
Using an Entity Reference in an XML Document
Suppose you want to insert a line like this in your XML document:
The problem with putting that line into an XML file directly is that when the parser sees the left angle bracket (<), it starts looking for a tag name, throws off the parse. To get around that problem, you put
<
in the file instead of<
.
Note: The results of the next modifications are contained in
slideSample03.xml
.
Add the following highlighted text to your
slideSample.xml
file, and save a copy of it for future use asslideSample03.xml
:<!-- OVERVIEW --> <slide type="all"> <title>Overview</title> ... </slide><slide type="exec"> <title>Financial Forecast</title> <item>Market Size < predicted</item> <item>Anticipated Penetration</item> <item>Expected Revenues</item> <item>Profit Margin</item> </slide>
</slideshow>When you use an XML parser to echo this data, you will see the desired output:
You see an angle bracket (<) where you coded
<
, because the XML parser converts the reference into the entity it represents and passes that entity to the application.Handling Text with XML-Style Syntax
When you are handling large blocks of XML or HTML that include many special characters, it is inconvenient to replace each of them with the appropriate entity reference. For those situations, you can use a
CDATA
section.
Note: The results of the next modifications are contained in
slideSample04.xml
.
A
CDATA
section works like<pre>...</pre>
in HTML, only more so: all whitespace in aCDATA
section is significant, and characters in it are not interpreted as XML. ACDATA
section starts with<![CDATA[
and ends with]]>
.Add the following highlighted text to your
slideSample.xml
file to define aCDATA
section for a fictitious technical slide, and save a copy of the file asslideSample04.xml
:...<slide type="tech"> <title>How it Works</title> <item>First we fozzle the frobmorten</item> <item>Then we framboze the staten</item> <item>Finally, we frenzle the fuznaten</item> <item><![CDATA[Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle <2> ]]></item> </slide>
</slideshow>When you echo this file with an XML parser, you see the following output:
Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle<2>
The point here is that the text in the
CDATA
section arrives as it was written. Because the parser doesn't treat the angle brackets as XML, they don't generate the fatal errors they would otherwise cause. (If the angle brackets weren't in aCDATA
section, the document would not be well formed.)Creating a Document Type Definition
After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.
Basic DTD Definitions
To begin learning about DTD definitions, let's start by telling the parser where text is expected and where any text (other than whitespace) would be an error. (Whitespace in such locations is ignorable.)
Note: The DTD defined in this section is contained in
slideshow1a.dtd
. (The browsable version isslideshow1a-dtd.html
.)
Start by creating a file named
slideshow.dtd
. Enter an XML declaration and a comment to identify the file:Next, add the following highlighted text to specify that a
slideshow
element containsslide
elements and nothing else:As you can see, the DTD tag starts with
<!
followed by the tag name (ELEMENT
). After the tag name comes the name of the element that is being defined (slideshow
) and, in parentheses, one or more items that indicate the valid contents for that element. In this case, the notation says that aslideshow
consists of one or moreslide
elements.Without the plus sign, the definition would be saying that a
slideshow
consists of a singleslide
element. The qualifiers you can add to an element definition are listed in Table 2-2.
Table 2-2 DTD Element Qualifiers Qualifier Name Meaning?
Question mark Optional (zero or one)*
Asterisk Zero or more+
Plus sign One or more
You can include multiple elements inside the parentheses in a comma-separated list and use a qualifier on each element to indicate how many instances of that element can occur. The comma-separated list tells which elements are valid and the order they can occur in.
You can also nest parentheses to group multiple items. For an example, after defining an
image
element (discussed shortly), you can specify((image, title)+)
to declare that everyimage
element in a slide must be paired with atitle
element. Here, the plus sign applies to theimage/title
pair to indicate that one or more pairs of the specified items can occur.Defining Text and Nested Elements
Now that you have told the parser something about where not to expect text, let's see how to tell it where text can occur. Add the following highlighted text to define the
slide
,title
,item
, andlist
elements:<!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)> <!ELEMENT item (#PCDATA | item)* >
The first line you added says that a slide consists of a
title
followed by zero or moreitem
elements. Nothing new there. The next line says that a title consists entirely of parsed character data (PCDATA
). That's known as "text" in most parts of the country, but in XML-speak it's called "parsed character data." (That distinguishes it fromCDATA
sections, which contain character data that is not parsed.) The#
that precedesPCDATA
indicates that what follows is a special word rather than an element name.The last line introduces the vertical bar (
|
), which indicates an or condition. In this case, eitherPCDATA
or anitem
can occur. The asterisk at the end says that either element can occur zero or more times in succession. The result of this specification is known as a mixed-content model, because any number ofitem
elements can be interspersed with the text. Such models must always be defined with#PCDATA
specified first, followed by some number of alternate items divided by vertical bars (|
), and an asterisk (*
) at the end.Save a copy of this DTD as slideSample1a.dtd for use when you experiment with basic DTD processing.
Limitations of DTDs
It would be nice if we could specify that an
item
contains either text, or text followed by one or more list items. But that kind of specification turns out to be hard to achieve in a DTD. For example, you might be tempted to define anitem
this way:That would certainly be accurate, but as soon as the parser sees
#PCDATA
and the vertical bar, it requires the remaining definition to conform to the mixed-content model. This specification doesn't, so you get can error that saysIllegal mixed content model for 'item'. Found ( ...,
where the hex character 28 is the angle bracket that ends the definition.Trying to double-define the item element doesn't work either. Suppose you try a specification like this:
This sequence produces a "duplicate definition" warning when the validating parser runs. The second definition is, in fact, ignored. So it seems that defining a mixed-content model (which allows
item
elements to be interspersed in text) is the best we can do.In addition to the limitations of the mixed-content model we've mentioned, there is no way to further qualify the kind of text that can occur where
PCDATA
has been specified. Should it contain only numbers? Should it be in a date format, or possibly a monetary format? There is no way to specify such things in a DTD.Finally, note that the DTD offers no sense of hierarchy. The definition of the
title
element applies equally to aslide
title and to anitem
title. When we expand the DTD to allow HTML-style markup in addition to plain text, it would make sense to, for example, restrict the size of anitem
title compared with that of aslide
title. But the only way to do that would be to give one of them a different name, such asitem-title
. The bottom line is that the lack of hierarchy in the DTD forces you to introduce a "hyphenation hierarchy" (or its equivalent) in your namespace. All these limitations are fundamental motivations behind the development of schema-specification standards.Special Element Values in the DTD
Rather than specify a parenthesized list of elements, the element definition can use one of two special values:
ANY
orEMPTY
. TheANY
specification says that the element can contain any other defined element, orPCDATA
. Such a specification is usually used for the root element of a general-purpose XML document such as you might create with a word processor. Textual elements can occur in any order in such a document, so specifyingANY
makes sense.The
EMPTY
specification says that the element contains no contents. So the DTD for email messages that let you flag the message with<flag/>
might have a line like this in the DTD:Referencing the DTD
In this case, the DTD definition is in a separate file from the XML document. With this arrangement, you reference the DTD from the XML document, and that makes the DTD file part of the external subset of the full document type definition for the XML file. As you'll see later on, you can also include parts of the DTD within the document. Such definitions constitute the local subset of the DTD.
Note: The XML written in this section is contained in
slideSample05.xml
. (The browsable version isslideSample05-xml.html
.)
To reference the DTD file you just created, add the following highlighted line to your
slideSample.xml
file, and save a copy of the file asslideSample05.xml
:Again, the DTD tag starts with
<!
. In this case, the tag name,DOCTYPE
, says that the document is aslideshow
, which means that the document consists of theslideshow
element and everything within it:This tag defines the
slideshow
element as the root element for the document. An XML document must have exactly one root element. This is where that element is specified. In other words, this tag identifies the document content as aslideshow
.The
DOCTYPE
tag occurs after the XML declaration and before the root element. TheSYSTEM
identifier specifies the location of the DTD file. Because it does not start with a prefix such ashttp:/
orfile:/
, the path is relative to the location of the XML document. Remember thesetDocumentLocator
method? The parser is using that information to find the DTD file, just as your application would use it to find a file relative to the XML document. APUBLIC
identifier can also be used to specify the DTD file using a unique name, but the parser would have to be able to resolve it.The
DOCTYPE
specification can also contain DTD definitions within the XML document, rather than refer to an external DTD file. Such definitions are contained in square brackets:You'll take advantage of that facility in a moment to define some entities that can be used in the document.
Documents and Data
Earlier, you learned that one reason you hear about XML documents, on the one hand, and XML data, on the other, is that XML handles both comfortably, depending on whether text is or is not allowed between elements in the structure.
In the sample file you have been working with, the
slideshow
element is an example of a data element: it contains only subelements with no intervening text. Theitem
element, on the other hand, might be termed a document element, because it is defined to include both text and subelements.As you work through this tutorial, you will see how to expand the definition of the title element to include HTML-style markup, which will turn it into a document element as well.
Defining Attributes and Entities in the DTD
The DTD you've defined so far is fine for use with a nonvalidating parser. It tells where text is expected and where it isn't, and that is all the nonvalidating parser pays attention to. But for use with the validating parser, the DTD must specify the valid attributes for the different elements. You'll do that in this section, and then you'll define one internal entity and one external entity that you can reference in your XML file.
Defining Attributes in the DTD
Let's start by defining the attributes for the elements in the slide presentation.
Note: The XML written in this section is contained in
slideshow1b.dtd
. (The browsable version isslideshow1b-dtd.html
.)
Add the following highlighted text to define the attributes for the
slideshow
element:<!ELEMENT slideshow (slide+)><!ATTLIST slideshow title CDATA #REQUIRED date CDATA #IMPLIED author CDATA "unknown" >
<!ELEMENT slide (title, item*)>The DTD tag
ATTLIST
begins the series of attribute definitions. The name that followsATTLIST
specifies the element for which the attributes are being defined. In this case, the element is theslideshow
element. (Note again the lack of hierarchy in DTD specifications.)Each attribute is defined by a series of three space-separated values. Commas and other separators are not allowed, so formatting the definitions as shown here is helpful for readability. The first element in each line is the name of the attribute:
title
,date
, orauthor
, in this case. The second element indicates the type of the data:CDATA
is character data--unparsed data, again, in which a left angle bracket (<) will never be construed as part of an XML tag. Table 2-3 presents the valid choices for the attribute type.
When the attribute type consists of a parenthesized list of choices separated by vertical bars, the attribute must use one of the specified values. For an example, add the following highlighted text to the DTD:
<!ELEMENT slide (title, item*)><!ATTLIST slide type (tech | exec | all) #IMPLIED >
<!ELEMENT title (#PCDATA)> <!ELEMENT item (#PCDATA | item)* >This specification says that the
slide
element'stype
attribute must be given astype="tech"
,type="exec"
, ortype="all"
. No other values are acceptable. (DTD-aware XML editors can use such specifications to present a pop-up list of choices.)The last entry in the attribute specification determines the attribute's default value, if any, and tells whether or not the attribute is required. Table 2-4 shows the possible choices.
Finally, save a copy of the DTD as
slideshow1b.dtd
for use when you experiment with attribute definitions.Defining Entities in the DTD
So far, you've seen predefined entities such as
&
and you've seen that an attribute can reference an entity. It's time now for you to learn how to define entities of your own.
Note: The XML you'll create here is contained in
slideSample06.xml
. (The browsable version isslideSample06-xml.html
.)
Add the following highlighted text to the
DOCTYPE
tag in your XML file:<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [<!ENTITY product "WonderWidget"> <!ENTITY products "WonderWidgets"> ]
>The
ENTITY
tag name says that you are defining an entity. Next comes the name of the entity and its definition. In this case, you are defining an entity namedproduct
that will take the place of the product name. Later when the product name changes (as it most certainly will), you need only change the name in one place, and all your slides will reflect the new value.The last part is the substitution string that replaces the entity name whenever it is referenced in the XML document. The substitution string is defined in quotes, which are not included when the text is inserted into the document.
Just for good measure, we defined two versions--one singular and one plural--so that when the marketing mavens come up with "Wally" for a product name, you will be prepared to enter the plural as "Wallies" and have it substituted correctly.
Note: Truth be told, this is the kind of thing that really belongs in an external DTD so that all your documents can reference the new name when it changes. But, hey, this is only an example.
Now that you have the entities defined, the next step is to reference them in the slide show. Make the following highlighted changes:
<slideshow title="WonderWidget&product;
Slide Show" ... <!-- TITLE SLIDE --> <slide type="all"> <title>Wake up to WonderWidgets&products;
!</title> </slide> <!-- OVERVIEW --> <slide type="all"> <title>Overview</title> <item>Why <em>WonderWidgets&products;
</em> are great</item> <item/> <item>Who <em>buys</em> WonderWidgets&products;
</item> </slide>Notice two points. Entities you define are referenced with the same syntax (
&entityName;
) that you use for predefined entities, and the entity can be referenced in an attribute value as well as in an element's contents.When you echo this version of the file with an XML parser, here is the kind of thing you'll see:
Note that the product name has been substituted for the entity reference.
To finish, save a copy of the file as
slideSample06.xml
.Additional Useful Entities
Here are several other examples for entity definitions that you might find useful when you write an XML document:
<!ENTITY ldquo "“"> <!-- Left Double Quote --> <!ENTITY rdquo "”"> <!-- Right Double Quote --> <!ENTITY trade "™"> <!-- Trademark Symbol (TM) --> <!ENTITY rtrade "®"> <!-- Registered Trademark (R) --> <!ENTITY copyr "©"> <!-- Copyright Symbol -->Referencing External Entities
You can also use the
SYSTEM
orPUBLIC
identifier to name an entity that is defined in an external file. You'll do that now.
Note: The XML defined here is contained in
slideSample07.xml
and incopyright.xml
. (The browsable versions areslideSample07-xml.html
andcopyright-xml.html
.)
To reference an external entity, add the following highlighted text to the
DOCTYPE
statement in your XML file:<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [ <!ENTITY product "WonderWidget"> <!ENTITY products "WonderWidgets"><!ENTITY copyright SYSTEM "copyright.xml">
]>This definition references a copyright message contained in a file named
copyright.xml
. Create that file and put some interesting text in it, perhaps something like this:<!-- A SAMPLE copyright --> This is the standard copyright message that our lawyers make us put everywhere so we don't have to shell out a million bucks every time someone spills hot coffee in their lap...Finally, add the following highlighted text to your
slideSample.xml
file to reference the external entity, and save a copy of the file asslideSample07.html
:<!-- TITLE SLIDE --> ... </slide><!-- COPYRIGHT SLIDE --> <slide type="all"> <item>©right;</item> </slide>
You could also use an external entity declaration to access a servlet that produces the current date using a definition something like this:
You would then reference that entity the same as any other entity:
When you echo the latest version of the slide presentation with an XML parser, here is what you'll see:
... <slide type="all"> <item> This is the standard copyright message that our lawyers make us put everywhere so we don't have to shell out a million bucks every time someone spills hot coffee in their lap... </item> </slide> ...You'll notice that the newline that follows the comment in the file is echoed as a character, but that the comment itself is ignored. This newline is the reason that the copyright message appears to start on the next line after the
<item>
element instead of on the same line: the first character echoed is actually the newline that follows the comment.Summarizing Entities
An entity that is referenced in the document content, whether internal or external, is termed a general entity. An entity that contains DTD specifications that are referenced from within the DTD is termed a parameter entity. (More on that later.)
An entity that contains XML (text and markup), and is therefore parsed, is known as a parsed entity. An entity that contains binary data (such as images) is known as an unparsed entity. (By its nature, it must be external.) In the next section, we discuss references to unparsed entities.
Referencing Binary Entities
This section discusses the options for referencing binary files such as image files and multimedia data files.
Using a MIME Data Type
There are two ways to reference an unparsed entity such as a binary image file. One is to use the DTD's
NOTATION
specification mechanism. However, that mechanism is a complex, unintuitive holdover that exists mostly for compatibility with SGML documents.
Note: SGML stands for Standard Generalized Markup Language. It was extremely powerful but so general that a program had to read the beginning of a document just to find out how to parse the remainder of it. Some very large document-management systems were built using it, but it was so large and complex that only the largest organizations managed to deal with it. XML, on the other hand, chose to remain small and simple--more like HTML than SGML--and, as a result, it has enjoyed rapid, widespread deployment. This story may well hold a moral for schema standards as well. Time will tell.
We will have occasion to discuss the subject in a bit more depth when we look at the
DTDHandler
API, but suffice it for now to say that the XML namespaces standard, in conjunction with the MIME data types defined for electronic messaging attachments, together provide a much more useful, understandable, and extensible mechanism for referencing unparsed external entities.
Note: The XML described here is in
slideshow1b.dtd
. (The browsable version isslideshow1b-dtd.html
.) It shows how binary references can be made, assuming that the application that will process the XML data knows how to handle such references.
To set up the slide show to use image files, add the following highlighted text to your
slideshow1b.dtd
file:<!ELEMENT slide (image?, title, item*)> <!ATTLIST slide type (tech | exec | all) #IMPLIED > <!ELEMENT title (#PCDATA)> <!ELEMENT item (#PCDATA | item)* ><!ELEMENT image EMPTY> <!ATTLIST image alt CDATA #IMPLIED src CDATA #REQUIRED type CDATA "image/gif" >
These modifications declare
image
as an optional element in aslide
, define it as empty element, and define the attributes it requires. Theimage
tag is patterned after the HTML 4.0img
tag, with the addition of an image type specifier,type
. (Theimg
tag is defined in the HTML 4.0 specification.)The
image
tag's attributes are defined by theATTLIST
entry. Thealt
attribute, which defines alternative text to display in case the image can't be found, accepts character data (CDATA
). It has an implied value, which means that it is optional and that the program processing the data knows enough to substitute something such as "Image not found." On the other hand, thesrc
attribute, which names the image to display, is required.The
type
attribute is intended for the specification of a MIME data type, as defined athttp://www.iana.org/assignments/media-types/
. It has a default value:image/gif
.
Note: It is understood here that the character data (
CDATA
) used for the type attribute will be one of the MIME data types. The two most common formats areimage/gif
andimage/jpeg
. Given that fact, it might be nice to specify an attribute list here, using something liketype ("image/gif", "image/jpeg")
That won't work, however, because attribute lists are restricted to name tokens. The forward slash isn't part of the valid set of name-token characters, so this declaration fails. Also, creating an attribute list in the DTD would limit the valid MIME types to those defined today. Leaving it asCDATA
leaves things more open-ended so that the declaration will continue to be valid as additional types are defined.
In the document, a reference to an image named "intro-pic" might look something like this:
The Alternative: Using Entity References
Using a MIME data type as an attribute of an element is a flexible and expandable mechanism. To create an external
ENTITY
reference using the notation mechanism, you need DTDNOTATION
elements for JPEG and GIF data. Those can, of course, be obtained from a central repository. But then you need to define a differentENTITY
element for each image you intend to reference! In other words, adding a new image to your document always requires both a new entity definition in the DTD and a reference to it in the document. Given the anticipated ubiquity of the HTML 4.0 specification, the newer standard is to use the MIME data types and a declaration such asimage
, which assumes that the application knows how to process such elements.Defining Parameter Entities and Conditional Sections
Just as a general entity lets you reuse XML data in multiple places, a parameter entity lets you reuse parts of a DTD in multiple places. In this section you'll see how to define and use parameter entities. You'll also see how to use parameter entities with conditional sections in a DTD.
Creating and Referencing a Parameter Entity
Recall that the existing version of the slide presentation can not be validated because the document uses
<em>
tags, and they are not part of the DTD. In general, we'd like to use a variety of HTML-style tags in the text of a slide, and not just one or two, so using an existing DTD for XHTML makes more sense than defining such tags ourselves. A parameter entity is intended for exactly that kind of purpose.
Note: The DTD specifications shown here are contained in
slideshow2.dtd
andxhtml.dtd
. The XML file that references it isslideSample08.xml
. (The browsable versions areslideshow2-dtd.html
,xhtml-dtd.html
, andslideSample08-xml.html
.)
Open your DTD file for the slide presentation and add the following highlighted text to define a parameter entity that references an external DTD file:
<!ELEMENT slide (image?, title?, item*)> <!ATTLIST slide ... ><!ENTITY % xhtml SYSTEM "xhtml.dtd"> %xhtml;
<!ELEMENT title ...Here, you use an
<!ENTITY>
tag to define a parameter entity, just as for a general entity, but you use a somewhat different syntax. You include a percent sign (%
) before the entity name when you define the entity, and you use the percent sign instead of an ampersand when you reference it.Also, note that there are always two steps to using a parameter entity. The first is to define the entity name. The second is to reference the entity name, which actually does the work of including the external definitions in the current DTD. Because the uniform resource identifier (URI) for an external entity could contain slashes (/) or other characters that are not valid in an XML name, the definition step allows a valid XML name to be associated with an actual document. (This same technique is used in the definition of namespaces and anywhere else that XML constructs need to reference external documents.)
Notes:
- The DTD file referenced by this definition is
xhtml.dtd
. (The browsable version isxhtml-dtd.html
.) You can either copy that file to your system or modify theSYSTEM
identifier in the<!ENTITY>
tag to point to the correct URL.- This file is a small subset of the XHTML specification, loosely modeled after the Modularized XHTML draft, which aims at breaking up the DTD for XHTML into bite-sized chunks, which can then be combined to create different XHTML subsets for different purposes. When work on the modularized XHTML draft has been completed, this version of the DTD should be replaced with something better. For now, this version will suffice for our purposes.
The point of using an XHTML-based DTD is to gain access to an entity it defines that covers HTML-style tags like
<em>
and<b>
. Looking throughxhtml.dtd
reveals the following entity, which does exactly what we want:This entity is a simpler version of those defined in the Modularized XHTML draft. It defines the HTML-style tags we are most likely to want to use--emphasis, bold, and break--plus a couple of others for images and anchors that we may or may not use in a slide presentation. To use the
inline
entity, make the following highlighted changes in your DTD file:These changes replace the simple
#PCDATA
item with theinline
entity. It is important to notice that#PCDATA
is first in theinline
entity and thatinline
is first wherever we use it. That sequence is required by XML's definition of a mixed-content model. To be in accord with that model, you also must add an asterisk at the end of thetitle
definition.Save the DTD as
slideshow2.dtd
for use when you experiment with parameter entities.
Note: The Modularized XHTML DTD defines both
inline
andInline
entities, and does so somewhat differently. Rather than specify#PCDATA|em|b|a|img|br
, the definitions are more like(#PCDATA|em|b|a|img|br)*
. Using one of those definitions, therefore, looks more like this:
<!ELEMENT title %Inline; >
Conditional Sections
Before we proceed with the next programming exercise, it is worth mentioning the use of parameter entities to control conditional sections. Although you cannot conditionalize the content of an XML document, you can define conditional sections in a DTD that become part of the DTD only if you specify
include
. If you specifyignore
, on the other hand, then the conditional section is not included.Suppose, for example, that you wanted to use slightly different versions of a DTD, depending on whether you were treating the document as an XML document or as a SGML document. You can do that with DTD definitions such as the following:
someExternal.dtd: <![ INCLUDE [ ... XML-only definitions ]]> <![IGNORE
[ ... SGML-only definitions ]]> ... common definitionsThe conditional sections are introduced by
<![
, followed by theINCLUDE
orIGNORE
keyword and another[
. After that comes the contents of the conditional section, followed by the terminator:]]>
. In this case, the XML definitions are included, and the SGML definitions are excluded. That's fine for XML documents, but you can't use the DTD for SGML documents. You could change the keywords, of course, but that only reverses the problem.The solution is to use references to parameter entities in place of the
INCLUDE
andIGNORE
keywords:someExternal.dtd: <![%XML;
[ ... XML-only definitions ]]> <![%SGML;
[ ... SGML-only definitions ]]> ... common definitionsThen each document that uses the DTD can set up the appropriate entity definitions:
<!DOCTYPE foo SYSTEM "someExternal.dtd" [<!ENTITY % XML "INCLUDE" > <!ENTITY % SGML "IGNORE" >
]> <foo> ... </foo>This procedure puts each document in control of the DTD. It also replaces the
INCLUDE
andIGNORE
keywords with variable names that more accurately reflect the purpose of the conditional section, producing a more readable, self-documenting version of the DTD.Resolving a Naming Conflict
The XML structures you have created thus far have actually encountered a small naming conflict. It seems that
xhtml.dtd
defines atitle
element that is entirely different from thetitle
element defined in the slide-show DTD. Because there is no hierarchy in the DTD, these two definitions conflict.
Note: The Modularized XHTML DTD also defines a
title
element that is intended to be the document title, so we can't avoid the conflict by changingxhtml.dtd
. The problem would only come back to haunt us later.
You can use XML namespaces to resolve the conflict. You'll take a look at that approach in the next section. Alternatively, you can use one of the more hierarchical schema proposals described in Schema Standards. The simplest way to solve the problem for now is to rename the
title
element inslideshow.dtd
.
Note: The XML shown here is contained in
slideshow3.dtd
andslideSample09.xml
, which referencescopyright.xml
andxhtml.dtd
. (The browsable versions areslideshow3-dtd.html
,slideSample09-xml.html
,copyright-xml.html
, andxhtml-dtd.html
.)
To keep the two title elements separate, you'll create a hyphenation hierarchy. Make the following highlighted changes to change the name of the
title
element inslideshow.dtd
toslide-title
:<!ELEMENT slide (image?,slide
-title?, item*)> <!ATTLIST slide type (tech | exec | all) #IMPLIED > <!-- Defines the %inline; declaration --> <!ENTITY % xhtml SYSTEM "xhtml.dtd"> %xhtml; <!ELEMENTslide
-title (%inline;)*>Save this DTD as
slideshow3.dtd
.The next step is to modify the XML file to use the new element name. To do that, make the following highlighted changes:
... <slide type="all"> <slide-
title>Wake up to ... </slide-
title> </slide> ... <!-- OVERVIEW --> <slide type="all"> <slide-
title>Overview</slide-
title> <item>...Save a copy of this file as
slideSample09.xml
.Using Namespaces
As you saw earlier, one way or another it is necessary to resolve the conflict between the
title
element defined inslideshow.dtd
and the one defined inxhtml.dtd
when the same name is used for different purposes. In the preceding exercise, you hyphenated the name in order to put it into a different namespace. In this section, you'll see how to use the XML namespace standard to do the same thing without renaming the element.The primary goal of the namespace specification is to let the document author tell the parser which DTD or schema to use when parsing a given element. The parser can then consult the appropriate DTD or schema for an element definition. Of course, it is also important to keep the parser from aborting when a "duplicate" definition is found and yet still generate an error if the document references an element such as
title
without qualifying it (identifying the DTD or schema to use for the definition).
Note: Namespaces apply to attributes as well as to elements. In this section, we consider only elements. For more information on attributes, consult the namespace specification at
http://www.w3.org/TR/REC-xml-names/
.
Defining a Namespace in a DTD
In a DTD, you define a namespace that an element belongs to by adding an attribute to the element's definition, where the attribute name is
xmlns
("xml namespace"). For example, you can do that inslideshow.dtd
by adding an entry such as the following in thetitle
element's attribute-list definition:<!ELEMENT title (%inline;)*><!ATTLIST title xmlns CDATA #FIXED "http://www.example.com/slideshow" >
Declaring the attribute as
FIXED
has several important features:
- It prevents the document from specifying any nonmatching value for the
xmlns
attribute.- The element defined in this DTD is made unique (because the parser understands the
xmlns
attribute), so it does not conflict with an element that has the same name in another DTD. That allows multiple DTDs to use the same element name without generating a parser error.- When a document specifies the
xmlns
attribute for a tag, the document selects the element definition that has a matching attribute.To be thorough, every element name in your DTD would get exactly the same attribute, with the same value. (Here, though, we're concerned only about the
title
element.) Note, too, that you are using aCDATA
string to supply the URI. In this case, we've specified a URL. But you could also specify a universal resource name (URN), possibly by specifying a prefix such asurn:
instead ofhttp:
. (URNs are currently being researched. They're not seeing a lot of action at the moment, but that could change in the future.)Referencing a Namespace
When a document uses an element name that exists in only one of the DTDs or schemas it references, the name does not need to be qualified. But when an element name that has multiple definitions is used, some sort of qualification is a necessity.
Note: In fact, an element name is always qualified by its default namespace, as defined by the name of the DTD file it resides in. As long as there is only one definition for the name, the qualification is implicit.
You qualify a reference to an element name by specifying the
xmlns
attribute, as shown here:The specified namespace applies to that element and to any elements contained within it.
Defining a Namespace Prefix
When you need only one namespace reference, it's not a big deal. But when you need to make the same reference several times, adding
xmlns
attributes becomes unwieldy. It also makes it harder to change the name of the namespace later.The alternative is to define a namespace prefix, which is as simple as specifying
xmlns
, a colon (:), and the prefix name before the attribute value:This definition sets up
SL
as a prefix that can be used to qualify the current element name and any element within it. Because the prefix can be used on any of the contained elements, it makes the most sense to define it on the XML document's root element, as shown here.
Note: The namespace URI can contain characters that are not valid in an XML name, so it cannot be used directly as a prefix. The prefix definition associates an XML name with the URI, and that allows the prefix name to be used instead. It also makes it easier to change references to the URI in the future.
When the prefix is used to qualify an element name, the end tag also includes the prefix, as highlighted here:
<SL:
slideshow xmlns:SL='http:/www.example.com/slideshow' ...> ... <slide> <SL:title
>Overview</SL:title
> </slide> ... </SL:
slideshow>Finally, note that multiple prefixes can be defined in the same element:
With this kind of arrangement, all the prefix definitions are together in one place, and you can use them anywhere they are needed in the document. This example also suggests the use of a URN instead of a URL to define the
xhtml
prefix. That definition would conceivably allow the application to reference a local copy of the XHTML DTD or some mirrored version, with a potentially beneficial impact on performance.
Download
FAQ History |
API
Search Feedback |
All of the material in The J2EE(TM) 1.4 Tutorial is copyright-protected and may not be published in other works without express written permission from Sun Microsystems.