XML MARKUP
Flexibility of XML versus HTML

Text, HTML, and XML
Web pages are most often created using HTML (HyperText Markup Language). A Web page can, however, consist of just plain text, or a combination of HTML and pre-formatted text.

A great deal of information is lost in the process of converting data into HTML:

 

<H1>New Millennium Software Company</H1>
<P>
144 West Villa Theresa Dr.<BR>
Phoenix, AZ 85023<BR>
Telephone: 602-368-8141

XML (eXtensible Markup Language) is a more structured language for the representation of data on the Web:

 

<COMPANY>
<NAME>New Millennium Software Company</NAME>
<ADDRESS>
<STREET>144 West Villa Theresa</STREET>
<CITY>Phoenix</CITY>
<STATE>AZ </STATE>
<ZIP>85023</ZIP>
</ADDRESS>
<PHONE>602-368-8141</PHONE>
</COMPANY>
 

 

Parsing
Parsing is the process of reading a Text, HTML, or XML Web page to discover the structure of the document. The result of parsing is an object-based representation of the elements of a Web page.

Objects correspond to the HTML or XML elements that are discovered during the parsing. After parsing, arrays of like objects can be accessed by either zero-based indexing or wildcard matching.

For the HTML example above, object references can be used to access the company address and phone number:

 

doc.p[0].line[0-1].text
doc.p[0].line['*Telephone*'].text

For the XML example above, the object references are more straightforward:

 

doc.company[0].address[0].text
doc.company[0].phone[0].text

 

 

Pattern Matching
Pattern matching is an alternate way of collecting arrays of objects from a parsed document.

In pattern matching, the elements of an HTML or XML document are viewed as a stream of tokens. For example, given the following HTML:

 

<UL>
<LI> Games for only: <B>$19.95</B>
<LI> ART: $100.95
<LI> Novel  for only: <B>$19.95</B>
</UL>

A pattern of "LI B" would collect every instance where the <LI> and <B> tokens occur in the specified order; in this case, only the lines for Shirts and Sweaters would be collected.

 

HTTP/HTTPS

 

HTTP (Hyper Text Transfer Protocol)is the protocol used to transfer documents over the Web. HTTPS requires SSL (Secure Sockets Layer) libraries, available as a separate product from webMethods.

Web Automation applications behave exactly like Web browsers in their use of HTTP/HTTPS to submit requests to Web Servers. In fact, Web servers cannnot distinguish a Web Automation application from a Web browser.

 

Web Servers

 

Web servers respond to HTTP requests by delivering a stream of data (typically Text, HTML, or XML) to a calling application (typically a browser).

Web servers can deliver documents from a local file system, invoke CGI-BIN scripts, or access databases and legacy systems through any number of integration technologies. Regardless of the source of the data, Web servers always speak HTTP.

Web Automation applications leverage the fact that Web servers provide a common protocol for requesting data from diverse back-end systems.

UNICODE: NEW STANDARD

A standard for representing characters as integers. Unlike ASCII, which uses 8 bits for each character, Unicode uses 16 bits, which means that it can represent more than 65,000 unique characters. This is a bit of overkill for English and Western-European languages, but it is necessary for some other languages, such as Greek, Chinese and Japanese. Many analysts believe that as the software industry becomes increasingly global, Unicode will eventually supplant ASCII as the standard character coding format.

 

A document type definition (DTD) is a series of definitions for element types, attributes, entities and notations.

DTD provides the concept formal markup declarations.

Markup declarations:

<! ELEMENT Q-AND-A (QUESTION,ANSWER) +>

<!-- This allows: question, answer, question, answer ... -->

<!ELEMENT QUESTION (#PCDATA) +>

<!!-- Questions are just make up of text  -->

<!ELEMENT ANSWER (#PCDATA)+>

<!-- Answeres are just text -->

Well-formedness and validity

 XML rules consist of two notions of correct: well-formed document is a document that is intelligible markup.

Using the right word in the appropriate locations is validity. Valid documents declare conformance to DTD.

HYPERLINKS: Extended Links

XLink provides a notation to extract combined information from related links! Partial web information may be dangerous!

XML extended links furthermore point to multiple resources. Instead of linking to one word you link to multiple definitions simultaneously.

Stylesheets

Stylesheets provide personalized visual formats for webpages based upon the style we want!  Cascading Style Sheets (CSS) provide standardized ways of visually structuring the formats of web pages. Extensible Stylesheet Language (XSL) combines many features from CSS, with inclusion of ISO's DSSSL stylesheet language. XSL is extensible as XML.

Module for XSL

This module implements the W3C's XSLT specification.

XML::XSLT makes use of XML::DOM and LWP::Simple, while XML::DOM uses XML::Parser. Therefore XML::Parser, XML::DOM and LWP::Simple have to be installed properly for XML::XSLT to run. IE5 and IE6 have the DOM embedded.

Stylesheets and Documents

The stylesheets and the documents may be passed as filenames, file handles regular strings, string references or DOM-trees. Functions that require sources (e.g. new), will accept either a named parameter or simply the argument.

Either of the following are allowed:

 my $xslt = XML::XSLT->new($xsl);
 my $xslt = XML::XSLT->new(Source => $xsl);

In documentation, the named parameter `Source' is always shown, but it is never required.

MODULES

new(Source => $xml [, %args])
 
Returns a new XSLT parser object. Valid flags are:
DOMparser_args
 
Hashref of arguments to pass to the XML::DOM::Parser object's parse method.
variables
 
Hashref of variables and their values for the stylesheet.
base
 
Base of URL for file inclusion.
debug
 
Turn on debugging messages.
warnings
 
Turn on warning messages.
indent
 
Starting amount of indention for debug messages. Defaults to 0.
indent_incr
 
Amount to indent each level of debug message. Defaults to 1.
open_xml(Source => $xml [, %args])
 
Gives the XSLT object new XML to process. Returns an XML::DOM object corresponding to the XML.
base
 
The base URL to use for opening documents.
parser_args
 
Arguments to pass to the parser.
open_xsl(Source => $xml, [, %args])
 
Gives the XSLT object a new stylesheet to use in processing XML. Returns an XML::DOM object corresponding to the stylesheet. Any arguments present are passed to the XML::DOM::Parser.
base
 
The base URL to use for opening documents.
parser_args
 
Arguments to pass to the parser.
process(%variables)
 
Processes the previously loaded XML through the stylesheet using the variables set in the argument.
transform(Source => $xml [, %args])
 
Processes the given XML through the stylesheet. Returns an XML::DOM object corresponding to the transformed XML. Any arguments present are passed to the XML::DOM::Parser.
serve(Source => $xml [, %args])
 
Processes the given XML through the stylesheet. Returns a string containing the result. Example:
  use XML::XSLT qw(serve);
  $xslt = XML::XSLT->new($xsl);
  print $xslt->serve $xml;
http_headers
 
If true, then prepends the appropriate HTTP headers (e.g. Content-Type, Content-Length);

Defaults to true.

xml_declaration
 
If true, then the result contains the appropriate <?xml?> header.

Defaults to true.

xml_version
 
The version of the XML.

Defaults to 1.0.

doctype
 
The type of DOCTYPE this document is. Defaults to SYSTEM.
toString
 
Returns the result of transforming the XML with the stylesheet as a string.
to_dom
 
Returns the result of transforming the XML with the stylesheet as an XML::DOM object.
media_type
 
Returns the media type (aka mime type) of the object.
dispose
 
Executes this method on each XML::DOM object.