Приглашаем посетить

The HTML Structure and Collections

This section focuses on how the document's collections are constructed while the document is parsed. HTML documents are supposed to satisfy the rules defined by the HTML DTD (document type definition). The HTML object model relies on these rules and some real-world exceptions to ensure that the document's structure is properly maintained. This section introduces the relationship between the DTD and the collections exposed on the document.

**Building the all Collection**

The all collection of elements correlates directly to the HTML document's tree. The following simple HTML document demonstrates this relationship:

<HTML>
   <HEAD>
      <TITLE>My Document</TITLE>
   </HEAD>
   <BODY>
      <H1>Welcome to My Page</H1>
      <P>This is an <STRONG>important document.</STRONG></P>
   </BODY>
</HTML>

Figure 7-1 displays the containment relationships between the elements in this document.

The HTML Structure and Collections

Figure 7-1. Containment relationships between the elements in an HTML document.

The document's all collection, which contains every element in the document, represents this tree; it contains the elements in the tree in the order found in the source code. The parser creates the all collection by performing an operation known as a preorder traversal of the tree. In this example, the contents and order of the all collection are initially as follows: HTML, Head, Title, Body, H1, Paragraph, Strong.

The all collection always represents the current state of the document. You can change the elements in the all collection by dynamically manipulating the document's contents, but the all collection always maintains the order of the elements, even when scripts modify the contents. Dynamic contents manipulation is discussed in detail in Chapter 13, "Dynamic Contents."

Scope of Influence

The HTML tree contains information not immediately apparent from the all collection—namely, the scope of each element. The scope of an element is the set of elements it contains. For example, in the preceding document the Paragraph element contains the Strong element, so the Strong element is within the scope of the Paragraph element. You can determine the scope of an element by analyzing the parentElement and children properties of each element in the all collection, a process described in Chapter 8, "Scripts and Elements."

NOTE: The all collection, as do all the element collections, represents the element as a single object. The elements, rather than individual begin and end tags, are sufficient for manipulating the document's structure. Because elements are represented rather than individual tags, fewer complexities are involved in understanding and working with the collections. The exception to this rule is for unrecognized tags. In this case, any unrecognized tag, whether a begin or an end tag, is added to the collection. Unrecognized tags have no scope of influence over any children. This limitation is discussed in greater detail in the section "Unrecognized Elements" later in this chapter.

Implied Elements

The DTD for HTML specifies that tags for the HTML, Head, Body, and TBody elements are optional in the HTML document because these elements can be inferred from the content, as shown here:

<TITLE>Welcome to My Document</TITLE>
<H1>Welcome to My Page</H1>
<P>This is an <STRONG>important document.</STRONG></P>

This document is equivalent to the preceding document. The trees, and therefore the contents of the all collections, are the same. The all collection always exposes the HTML, Head, and Body elements for every document, regardless of whether you explicitly authored them.

Differentiating the Head from the Body

In a document without <HEAD> and <BODY> tags, the split between head and body is determined by the rules of HTML as defined by the DTD. The Head element contains a specific set of elements that are different from the set in the Body element. Therefore, when the first Body element is encountered (for example, H1), the scope automatically changes from the head to the body.

The following code fragment represents the DTD for the head of the document. By examining this DTD, you can more clearly see the distinction between head and body:

<!ENTITY % head.misc "SCRIPT|STYLE|META|LINK"
   -- repeatable head elements -->
<!ENTITY % head.content "TITLE & ISINDEX? & BASE?">
<!ELEMENT HEAD O O  (%head.content) +(%head.misc)>

This code shows that there can be at most one IsIndex element and one Base element, that there must be exactly one Title element, and that there can be any number of elements specified by the head.misc entity. With two exceptions, the Style and Script elements, the entities available in the head are mutually exclusive from the entities available in the body. Therefore, it is quite easy for a parser to determine when the scope has switched from the head to the body.

The Style and Script elements are ambiguous cases because they can exist in both the head and the body. If a Style or Script element is encountered before any body contents, the element is considered contents of the head. This rule has no impact on the rendering or behavior of the document, but it is important to understand because it affects the scope of influence of the Head and Body elements.

Optional End Tags

A few elements in HTML do not require an end tag. For example, a <P> tag does not require a </P> to end its scope of influence. To determine when a Paragraph or other element ends, the DTD is used. When an element that cannot be contained within the current scope is encountered, the prior scope is considered to be terminated. As shown in the following example, if a <P> tag is followed by an <H2> tag, the Paragraph element ends with the <H2> tag because an H2 element cannot be a child of a Paragraph element.

<HTML>
   <H1>Scott's Home Page</H1>
   <P>Welcome to my page.<H2>New Cool Stuff</H2>
</HTML>

The tree in Figure 7-2 represents this HTML document. Notice that the H2 element is a child element of the Body element, not the Paragraph element.

The HTML Structure and Collections

Figure 7-2. Tree diagram of a document with an implied end tag.

In general, documents are more readable and maintainable when end tags are explicitly defined. Without end tags, anyone viewing the source must have knowledge of the HTML DTD to ascertain the relationship between various elements.

Unrecognized Elements

Parsing of unrecognized elements in the HTML document is an important consideration as HTML and browsers evolve. Imagine the introduction of an <H7> tag. New browsers will understand how to interpret <H7> as a block container tag, but down-level browsers will not recognize it. In accordance with the rules of HTML, the <H7> begin and end tags are ignored when the document is rendered by this hypothetical down-level browser because for unrecognized tags DTD information is unavailable to determine the rules and scope of the tag.

Because there is no DTD for unrecognized elements, the unrecognized begin and end tags are exposed in the all collection. Unlike the rules of HTML specifying that unrecognized elements should be ignored, the object model includes unrecognized tags, in order to provide complete information about the document to the developer.

The unrecognized end tag is also exposed because there is no way to accurately determine whether the element is a container. Even if a begin and end tag appear in sequence in the document, there are no assurances that their use would be in conformance with a DTD rule if one did exist for the element. For example, the element might not be defined in the DTD as a container element. Therefore, both unrecognized begin and end tags are always exposed as leaf nodes in the tree:

<HTML>
   <P>Welcome to my <FOO><B>cool</B> document.</FOO></P>
</HTML>

The tree in Figure 7-3 demonstrates how the internal parser represents this document with unrecognized elements.

The HTML Structure and Collections

Figure 7-3. Tree diagram of a document with unrecognized begin and end <FOO> tags.

The all collection in the preceding example contains the following elements: HTML, Head, Body, Paragraph, <FOO>, Bold, </FOO>. Notice that the Bold element is not considered a child of the Foo element, but rather a child of the Paragraph element. Because DTD information about Foo is unavailable, there is no way to reliably determine whether the Foo element is a container. For unrecognized elements, exposing both the begin and end tags allows the developer to calculate the scope of the element by manually walking through the all collection.

If in a future version of Internet Explorer Foo becomes a valid HTML element that can contain text, the document's tree will change. Figure 7-4 demonstrates this new tree.

The HTML Structure and Collections

Figure 7-4. Tree diagram that would result if the browser recognized <FOO> tags.

While the ordering will be consistent across implementations, the number of elements and the document's tree may vary depending on whether the Foo element is supported. This difference might cause problems if your code relies on ordinal positions of elements in the collection because the number of elements exposed can change from browser to browser. Instead, code that accesses a specific element should always use an ID or identify the element in a more explicit context.

All unrecognized end tags are also exposed in the object model because the object model makes no attempt to associate invalid begin and end tags and accepts them into the collection as specified in the document. Therefore, if a </BAR> end tag is floating in the middle of the document, it will be represented in the all collection, even if no <BAR> begin tag was ever encountered.

From the point of view of the DTD, all unrecognized begin and end tags are considered to have no contents. Any attributes and style sheet information found on an unrecognized tag will have no effect on the document's rendering but will be represented in the object model.

Unmatched End Tags

When an unmatched end tag that is recognized by the parser is encountered, HTML specifies that the end tag should be ignored. However, as with unrecognized tags, unmatched end tags are exposed in the all collection. In the following example, the </B> end tag is exposed in the object model:

<HTML>
   This is not bold.</B>
</HTML>

The end tag is exposed because the object model attempts to maintain an accurate representation of the document.

Overlapping Elements

Overlapping elements occur when a true containership hierarchy is not followed by the document. The following example demonstrates an overlap of Strong and EM elements:

<HTML>
   <BODY>
      <P>This is a <STRONG>demonstration of
         <EM>overlapping</STRONG> elements.</EM></P>
   </BODY>
</HTML>

Even though elements overlap, they do not affect the composition or ordering of the all collection. The all collection consists of the following elements in this order: HTML, Head, Body, Paragraph, Strong, EM. The tree for this document, shown in Figure 7-5, does not represent the overlapping of elements or the true scope of influence for each element.

Overlapping elements are actually invalid HTML. To achieve the desired behavior without using overlapping tags, you should create the document with a clean containership hierarchy:

<HTML>
   <BODY>
      <P>This is a <STRONG>demonstration of
         <EM>overlapping</EM></STRONG><EM> elements.</EM>
      </P>
   </BODY>
</HTML>

Overlapping elements have little effect on most collections, but an element's children collection may be inaccurate. The relationship between overlapping elements and the document's contents is a strong one, and is discussed in Chapter 13, "Dynamic Contents."

The HTML Structure and Collections

Figure 7-5. Tree diagram of a document with overlapping elements.

Tagless Contents

Tagless contents—text that is not contained within any element—often occurs within the body:

<HTML>
   <BODY>
      These contents are without a tag.
      <P>These contents are within a Paragraph element.</P>
      These contents follow a Paragraph element without a tag.
   </BODY>
</HTML>

This HTML document would have only HTML, Head, Body, and Paragraph elements in its all collection. There is no element that represents text outside of containers. In strict HTML, this text is defined to be within a Paragraph element. However, a <P> tag cannot be synthesized in this case because explicitly defined paragraphs have a slightly different rendering scheme from implicit paragraphs.

Invalid HTML

Dynamic HTML is designed to work with valid HTML. Therefore, tags that are placed outside of their proper scope are usually parsed as unrecognized elements. This rule is not fixed, however, and in some cases the HTML may be cleaned up automatically during parsing. For example, imagine the following invalid definition of a table:

<HTML>
   <BODY>
      <TD>This is a table cell outside of a table.</TD>
   </BODY>
</HTML>

In this document, a table cell appears where it doesn't belong—namely, outside the scope of a table. When the document is parsed, the table cell is not recognized and is parsed as an unrecognized element. Therefore, both the begin and end tags are considered invalid by the parser. The all collection exposes the elements of this HTML document in the following order: HTML, Head, Body, <TD>, </TD>.

You should not write documents to rely on this behavior. Browsers may choose to clean up the HTML or may choose to not do any cleanup and ignore the invalidly scoped elements. The only way to ensure that the element collection is built consistently is to create valid HTML documents.

There are a couple of known exceptions for which the document's tree will not conform to the HTML DTD. These exceptions exist because they appear in a large number of documents on the Internet. The exceptions discussed here are by no means the only exceptions, but they are ones that occur commonly in HTML documents.

Lists

Lists are one of the few areas in which the HTML is not cleaned up by the parser. To ensure compatibility, the object model recognizes two cases of invalid HTML as valid markup:

LI elements can exist outside of UL and OL list containers.
A list container can directly contain other list containers.

The first exception was allowed in Netscape Navigator 2.0 for creating bulleted items that are not indented; the second exception came about through the common, illegal practice of nesting lists.

When the first exception occurs, Netscape Navigator 2.0, whose implementation was followed by Internet Explorer 3.0, renders the list item without indenting it. Even though the DTD for LI elements prohibits them from existing outside of lists, the DTD used to create the tree, shown in the following code, is lax and will not automatically wrap these LI elements.

<HTML>
   <BODY>
      <LI>This is an LI element outside of a list.</LI>
   </BODY>
</HTML>

The all collection for this document is ordered as follows: HTML, Head, Body, LI.

The second exception, in which nested lists are used entirely for increasing the amount of indentation for bullets, is shown here:

<HTML>
   <BODY>
      <UL>
         <UL>
            <LI>This is a deeply indented bulleted list item.</LI>
         </UL>
      </UL>
   </BODY>
</HTML>

This HTML violates the DTD because UL elements can only contain LI elements, not other ULs. When this situation is encountered, no cleanup occurs. The ordering for the all collection for this document is as follows: HTML, Head, Body, UL, UL, LI.

Form Elements in Tables

Another common practice is to use forms in tables (outside of cells) to create a form that spans multiple rows or cells, as shown here:

<HTML>
   <HEAD>
      <TITLE>Forms in Tables</TITLE>
   </HEAD>
   <BODY>
      <TABLE>
         <FORM NAME="Form1">
            <TR><TD>Form1-related fields</TD></TR>
            <TR><TD>More Form1-related fields</TD></TR>
         </FORM>
         <FORM NAME="Form2">
            <TR><TD>Form2-related fields</TD></TR>
         </FORM>
      </TABLE>
   </BODY>
</HTML>

In this document, the forms will be maintained with the correct scope inside the table.

The tree for this document is represented by Figure 7-6.

The HTML Structure and Collections

Figure 7-6. Tree diagram for a document with Form elements inside a Table element.

There are probably other exceptions to the DTD. In general, invalid HTML may result in an unpredictable tree that may not be consistent in each browser release. Therefore, you should be careful to write HTML that corresponds to the DTD. Doing so not only makes the object model more consistent, but it also improves the likelihood that different browsers will render the document the same way.

[Содержание]