Flexibility of XML versus HTML

Text, HTML, and XML
Web pages are most often created using HTML (HyperText Markup Language). A Web page can, however, consist of just plain text, or a combination of HTML and pre-formatted text.

A great deal of information is lost in the process of converting data into HTML:


<H1>New Millennium Software Company</H1>
144 West Villa Theresa Dr.<BR>
Phoenix, AZ 85023<BR>
Telephone: 602-368-8141

XML (eXtensible Markup Language) is a more structured language for the representation of data on the Web:


<NAME>New Millennium Software Company</NAME>
<STREET>144 West Villa Theresa</STREET>


Parsing is the process of reading a Text, HTML, or XML Web page to discover the structure of the document. The result of parsing is an object-based representation of the elements of a Web page.

Objects correspond to the HTML or XML elements that are discovered during the parsing. After parsing, arrays of like objects can be accessed by either zero-based indexing or wildcard matching.

For the HTML example above, object references can be used to access the company address and phone number:



For the XML example above, the object references are more straightforward:





Pattern Matching
Pattern matching is an alternate way of collecting arrays of objects from a parsed document.

In pattern matching, the elements of an HTML or XML document are viewed as a stream of tokens. For example, given the following HTML:


<LI> Games for only: <B>$19.95</B>
<LI> ART: $100.95
<LI> Novel  for only: <B>$19.95</B>

A pattern of "LI B" would collect every instance where the <LI> and <B> tokens occur in the specified order; in this case, only the lines for Shirts and Sweaters would be collected.




HTTP (Hyper Text Transfer Protocol)is the protocol used to transfer documents over the Web. HTTPS requires SSL (Secure Sockets Layer) libraries, available as a separate product from webMethods.

Web Automation applications behave exactly like Web browsers in their use of HTTP/HTTPS to submit requests to Web Servers. In fact, Web servers cannnot distinguish a Web Automation application from a Web browser.


Web Servers


Web servers respond to HTTP requests by delivering a stream of data (typically Text, HTML, or XML) to a calling application (typically a browser).

Web servers can deliver documents from a local file system, invoke CGI-BIN scripts, or access databases and legacy systems through any number of integration technologies. Regardless of the source of the data, Web servers always speak HTTP.

Web Automation applications leverage the fact that Web servers provide a common protocol for requesting data from diverse back-end systems.


A standard for representing characters as integers. Unlike ASCII, which uses 8 bits for each character, Unicode uses 16 bits, which means that it can represent more than 65,000 unique characters. This is a bit of overkill for English and Western-European languages, but it is necessary for some other languages, such as Greek, Chinese and Japanese. Many analysts believe that as the software industry becomes increasingly global, Unicode will eventually supplant ASCII as the standard character coding format.


A document type definition (DTD) is a series of definitions for element types, attributes, entities and notations.

DTD provides the concept formal markup declarations.

Markup declarations:


<!-- This allows: question, answer, question, answer ... -->


<!!-- Questions are just make up of text  -->


<!-- Answeres are just text -->

Well-formedness and validity

 XML rules consist of two notions of correct: well-formed document is a document that is intelligible markup.

Using the right word in the appropriate locations is validity. Valid documents declare conformance to DTD.

HYPERLINKS: Extended Links

XLink provides a notation to extract combined information from related links! Partial web information may be dangerous!

XML extended links furthermore point to multiple resources. Instead of linking to one word you link to multiple definitions simultaneously.


Stylesheets provide personalized visual formats for webpages based upon the style we want!  Cascading Style Sheets (CSS) provide standardized ways of visually structuring the formats of web pages. Extensible Stylesheet Language (XSL) combines many features from CSS, with inclusion of ISO's DSSSL stylesheet language. XSL is extensible as XML.

Module for XSL

This module implements the W3C's XSLT specification.

XML::XSLT makes use of XML::DOM and LWP::Simple, while XML::DOM uses XML::Parser. Therefore XML::Parser, XML::DOM and LWP::Simple have to be installed properly for XML::XSLT to run. IE5 and IE6 have the DOM embedded.

Stylesheets and Documents

The stylesheets and the documents may be passed as filenames, file handles regular strings, string references or DOM-trees. Functions that require sources (e.g. new), will accept either a named parameter or simply the argument.

Either of the following are allowed:

 my $xslt = XML::XSLT->new($xsl);
 my $xslt = XML::XSLT->new(Source => $xsl);

In documentation, the named parameter `Source' is always shown, but it is never required.


new(Source => $xml [, %args])
Returns a new XSLT parser object. Valid flags are:
Hashref of arguments to pass to the XML::DOM::Parser object's parse method.
Hashref of variables and their values for the stylesheet.
Base of URL for file inclusion.
Turn on debugging messages.
Turn on warning messages.
Starting amount of indention for debug messages. Defaults to 0.
Amount to indent each level of debug message. Defaults to 1.
open_xml(Source => $xml [, %args])
Gives the XSLT object new XML to process. Returns an XML::DOM object corresponding to the XML.
The base URL to use for opening documents.
Arguments to pass to the parser.
open_xsl(Source => $xml, [, %args])
Gives the XSLT object a new stylesheet to use in processing XML. Returns an XML::DOM object corresponding to the stylesheet. Any arguments present are passed to the XML::DOM::Parser.
The base URL to use for opening documents.
Arguments to pass to the parser.
Processes the previously loaded XML through the stylesheet using the variables set in the argument.
transform(Source => $xml [, %args])
Processes the given XML through the stylesheet. Returns an XML::DOM object corresponding to the transformed XML. Any arguments present are passed to the XML::DOM::Parser.
serve(Source => $xml [, %args])
Processes the given XML through the stylesheet. Returns a string containing the result. Example:
  use XML::XSLT qw(serve);
  $xslt = XML::XSLT->new($xsl);
  print $xslt->serve $xml;
If true, then prepends the appropriate HTTP headers (e.g. Content-Type, Content-Length);

Defaults to true.

If true, then the result contains the appropriate <?xml?> header.

Defaults to true.

The version of the XML.

Defaults to 1.0.

The type of DOCTYPE this document is. Defaults to SYSTEM.
Returns the result of transforming the XML with the stylesheet as a string.
Returns the result of transforming the XML with the stylesheet as an XML::DOM object.
Returns the media type (aka mime type) of the object.
Executes this method on each XML::DOM object.