Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Adapting GRML

0.00/5 (No votes)
7 Oct 2004 1  
Convert a HTML web page to GRML.

Introduction

This article introduces the process of adapting (or converting) a web page from one markup language to another. It discusses how to adapt a HTML web page to GRML. Two examples are provided. The first demonstrates how to extract hyperlinks from an HTML web page and convert it to GRML. The second demonstrates how to do this with images. These examples require server-side processing. Here, IIS, Active Server Pages (ASP), and PERL are used.

Background

It is recommended to have some experience with ASP and PERL. PERL has regular expression support that is used to extract the hyperlinks and images from the web page. Any server-side scripting environment does this, including .NET, CGI, or PHP. However, PERL and ASP are used for this article. While PERL is required, the server-side scripting language specifically used is PerlScript. To use PerlScript, download a PERL interpreter. To get one that works with IIS, try ActivePerl. If not done already, read Introducing GRML and Using GRML. These articles provide explanations of what GRML is and how it is used.

What is an Adapter?

An adapter is generally defined as...

an object that converts the original interface of a component to another interface.

For the purposes of this article, the definition of an adapter is...

server-side processing or scripting that converts one markup language to another.

This definition describes converting HTML to GRML using ASP. The adapter object is the ASP scripting, the original interface is HTML, and the other interface is GRML.

Depending on what is being converted, adapters need to read from the original interface and write to the other interface. In other words, an adapter needs an interface reader and an interface writer. A HTML to GRML adapter requires a HTML reader and a GRML writer.

Adapting hyperlinks and images

HTML does not describe many elements of its content. For example, there is no way to determine the attributes of one text block from another. However, not all HTML content is without description. It does have specific tags for hyperlinks and images.

Using a specific tag to identify content makes it possible to create a script that reads only those tags. When found, the unwanted tag elements are removed, leaving only the content. The script then writes the content in the new format or markup language. This is how it adapts HTML hyperlinks or images to GRML.

The Hyperlink adapter

The use of the <a href=> tag allows a HTML web browser to identify which text is a hyperlink. This tag is the basis for the hyperlink adapter. It extracts all hyperlinks from an HTML web page and converts them to GRML.

Below is an example of a HTML to GRML hyperlinks adapter, using PerlScript:

<%@ Language="PerlScript%">
<HTML>
<center>
<form action=links.asp>
URL to extract: <input type=text name=url1 length=60>
<input type=submit>
</form>
</center>
<!--
<grml>
<edit url1>
<title>Enter URL:>
<%
use HTML::LinkExtor;
use URI::URL;
use LWP;

my $url, $html;

# Parsing the Request
$url = $Request->QueryString("url1")->Item();

$Response->Write("<submit>\n");
$Response->Write("<location>GRMLBrowser.com/links.asp\n");
$Response->Write("</submit>\n");
$Response->Write("<edit url1>\n");
$Response->Write("<text>$url\n");
$Response->Write("</edit>\n");

if ($url eq "")
{
        $Response->Write("</GRML>\n");
}
else
{
        if ($url !~ /http:\/\//)
        {
            $url = "http://". $url;
        }
}

# Constructing the Request
    $_ = $sites;

# Retrieving the Response/Resultset
#    - Filtering the Resultset (optional)
my $ua = LWP::UserAgent->new(agent => "Mozilla 4.0");
my $request  = HTTP::Request->new('GET', $url);
my $response = $ua->request($request);

unless ($response->is_success)
{
    print $response->error_as_HTML . "\n";
    exit(1);
}

my $res = $response->content(); # content without HTTP header

$Response->Write("<column>\n");
$Response->Write("<Title>\n");
$Response->Write("<Request>\n");
$Response->Write("<link>\n");
$Response->Write("</column>\n");

$Response->Write("<result>\n");

$res =~ s/\n/ /gsi;

while($res =~ m|href=(.+?)>(.*?)</A>|gsi)   ## that's all ...
{
    my $temp_link = $1;
    my $temp_item = $2;
    
    $temp_link =~ s/\'//gsi;
    $temp_link =~ s/\"//gsi;
    $temp_link =~ s/ (.*)//gsi;
    $temp_link =~ s/<b>//gsi;
    $temp_link =~ s/<\/b>//gsi;
    $temp_link =~ s/&amp;/\&/gsi;
    $temp_link =~ s/\n(.*)//gsi;
    $temp_item =~ s/<b>//gsi;
    $temp_item =~ s/<\/b>//gsi;
    $temp_item =~ s/<(.+?)>//gsi;
    $temp_item =~ s/<\/font>//gsi;
    $temp_item =~ s/&amp;/\&/gsi;
    $temp_item =~ s/ / /gsi;
    $temp_item =~ s/&quot;/\"/gsi;
    $temp_item =~ s/\n(.*)//gsi;
    $temp_item =~ s/\n/  /gsi;
    $temp_item =~ s/  (.*)//gsi;
    $temp_item =~ s/   (.*)//gsi;
   

    if ($temp_item !~ /img src=/)
    {
        if ($temp_link !~ /$url/ && $temp_link !~ /\/\//)
        {
            $temp_link = $url . "\/" . $temp_link;
        }

        $temp_item =~ s/\n//gsi;
        $temp_link =~ s/\n//gsi;

        $Response->Write("<link>$temp_link\n");
        $Response->Write("<title>$temp_item\n");    
    }

    $Response->Write("<request>$url\n");
    $Response->Write("\n\n");
}

$Response->Write("</result>\n");
$Response->Write("</GRML>\n");
%>
-->
</html>

What the above code does is it creates a form in HTML that extracts all the hyperlinks from a web page. The hyperlinks (and their titles) are formatted using GRML. To view GRML, a GRML web browser is required (such as Pioneer Report MDI).

All of the server-side scripting is used as the HTML reader. Only the following lines are used as the GRML writer. They are:

  • $Response->Write("\n");
  • $Response->Write("\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<Request>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<link>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("</column>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<result>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<link>$temp_link\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<title>$temp_item\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<request>$url\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("</result>\n");</CODE> </LI></UL> <P>Only the last three lines format the hyperlinks using GRML. The first two lines create the form in the browser window of a GRML web browser and do not use the adapted HTML hyperlinks.</P> <P>To see the above in action, go to <A href="http://grmlbrowser.com/links.asp" target=_blank>Hyperlink adapter</A> or copy the above script to a file and host it from a local web server. Once the web page is displayed, enter a URL and press the 'Submit' button. It displays all the hyperlinks extracted from the HTML web page formatted in GRML.</P> <P>After adapting hyperlinks from HTML to GRML, this is how it appears in a GRML web browser (using Pioneer Report MDI):</P> <P><IMG height=351 src="adaptingGRML/PRM1_001.jpg" width=425></P> <H2>The Image adapter</H2> <P>Using the <CODE lang=html><span class="code-keyword"><</span><span class="code-leadattribute">img</span> <span class="code-attribute">src</span><span class="code-keyword">=</span><span class="code-keyword">></span></CODE> tag, a script is able to find and extract images from HTML. By reading this tag and removing unwanted tag elements, the HTML images are converted to GRML. The following script demonstrates this:</P><PRE lang=perlscript><span class="code-pagedirective"><%@</span><span class="code-leadattribute"> Language</span><span class="code-keyword">="</span><span class="code-keyword">PerlScript%"</span><span class="code-attribute">> <center> <form action</span><span class="code-keyword">=translate.asp</span><span class="code-attribute">> URL to translate: <input type</span><span class="code-keyword">=text</span><span class="code-attribute"> name</span><span class="code-keyword">=url1</span><span class="code-attribute"> length</span><span class="code-keyword">=60</span><span class="code-attribute">> <input type</span><span class="code-keyword">=submit</span><span class="code-attribute">> </form> </center> <!-- <grml> <edit url1> <title>Enter URL: </edit> <% use HTML::LinkExtor; use URI::URL; use LWP; my $url, $html; # Parsing the Request $url </span><span class="code-keyword">=</span><span class="code-attribute"> $Request->QueryString("url1")->Item(); if ($url eq "") { $Response->Write("</GRML>\n"); } else { if ($url !~ /http:\/\//) { $url </span><span class="code-keyword">= "</span><span class="code-keyword">http://"</span><span class="code-attribute"> . $url; } } $Response->Write("### URL ###\n\n"); $Response->Write("The url is: $url\n\n"); # Constructing the Request $_ </span><span class="code-keyword">=</span><span class="code-attribute"> $sites; # Retrieving the Response/Results # - Filtering the Results (optional) my $ua </span><span class="code-keyword">=</span><span class="code-attribute"> LWP::UserAgent->new(agent </span><span class="code-keyword">=</span><span class="code-attribute">> "my agent V1.00"); my $request </span><span class="code-keyword">=</span><span class="code-attribute"> HTTP::Request->new('GET', $url); my $response </span><span class="code-keyword">=</span><span class="code-attribute"> $ua->request($request); unless ($response->is_success) { print $response->error_as_HTML . "\n"; exit(1); } my $res </span><span class="code-keyword">=</span><span class="code-attribute"> $response->content(); # content without HTTP header my @imgs </span><span class="code-keyword">=</span><span class="code-attribute"> (); my @hrefs </span><span class="code-keyword">=</span><span class="code-attribute"> (); # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) my $p </span><span class="code-keyword">=</span><span class="code-attribute"> HTML::LinkExtor->new(\&callback); $p->parse($res); # Expand all image URLs to absolute ones my $base </span><span class="code-keyword">=</span><span class="code-attribute"> $response->base; @imgs </span><span class="code-keyword">=</span><span class="code-attribute"> map { $_ </span><span class="code-keyword">=</span><span class="code-attribute"> url($_, $base)->abs; } @imgs; $Response->Write("<column>\n"); $Response->Write("<image>\n"); $Response->Write("<link>\n"); $Response->Write("</column>\n\n"); $Response->Write("<result>\n"); foreach (@imgs) { $Response->Write("<image>$_\n"); } $Response->Write("\nLinks:\n"); foreach (@hrefs) { my $temp </span><span class="code-keyword">=</span><span class="code-attribute"> $_; if ($temp !~ /$url/ && $temp !~ /\/\//) { $temp </span><span class="code-keyword">=</span><span class="code-attribute"> $url . "\/" . $temp; } $Response->Write("<link>$temp\n"); } sub callback { my($tag, %attr) </span><span class="code-keyword">=</span><span class="code-attribute"> @_; push(@imgs , values %attr) if $tag eq 'img'; push(@hrefs, values %attr) if $tag eq 'a'; } </span><span class="code-pagedirective">%></span> <span class="code-keyword"><</span><span class="code-keyword">/</span><span class="code-leadattribute">result</span><span class="code-keyword">></span> <span class="code-keyword"><</span><span class="code-keyword">/</span><span class="code-leadattribute">GRML</span><span class="code-keyword">></span> --></PRE> <P>The above script is used as an HTML reader, except for the lines used to build the columns and each result. These lines are the GRML writer:</P> <UL> <LI><CODE lang=perlscript>$Response->Write("<column>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<image>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<link>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("</column>\n\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<result>\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<image>$_\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("<link>$temp\n");</CODE> <LI><CODE lang=perlscript>$Response->Write("</result>\n");</CODE> </LI></UL> <P>Once the image content has been adapted to GRML, this is how it looks in a GRML web browser (using Pioneer Report MDI):</P> <P><IMG height=351 src="adaptingGRML/PRM1_002.jpg" width=425></P> <H2>Conclusion</H2> <P>Converting HTML to GRML is possible when using an adapter. Only the content with identifiable tags are adaptable from one markup language to another. In the case of HTML, there are tags to identify hyperlinks and images.</P> <P>The examples described for adapting content show how to convert HTML hyperlinks or images to GRML. The adapter consists of a HTML reader and a GRML writer. Using this adapter, a web page viewed with a HTML web browser is viewable using a GRML web browser.</P> <H2>Latest changes</H2> <UL> <LI>09/03/04 <UL> <LI>Using GRML v1.2 in code samples. </LI></UL> <LI>10/08/04 <UL> <LI>Using GRML v2.3 in code samples. Pioneer Report MDI 3.64 uses GRML v1.2 while all other GRML web browsers use v2.3. </LI></UL></LI></UL> <!-- Article Ends --> </div> <h2>License</h2> <div id="LicenseTerms"><p>This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.</p><p>A list of licenses authors might use can be found <a href="/info/Licenses.aspx">here</a></p></div> <br /> <br /> <div class="clearfix"></div> <div style="padding-top:8px"> </div> </form> </div> <div class="bottom-promo"> </div> </div> </div> <div class="site-footer"> <div class="align-left"> <a id="ctl00_PermaLink" href="/Articles/8170/Adapting-GRML">Permalink</a><br> <br> <a id="ctl00_PrivacyLink" href="/info/privacy.aspx">Privacy</a><br> <a id="ctl00_CookiePolicyLink" href="/info/cookie.aspx">Cookies</a><br> <a id="ctl00_TermsOfUseLink" href="/info/TermsOfUse.aspx">Terms of Use</a><br> </div> <div class="align-center"> <div class="page-width"> Layout: <a id="ctl00_PageWidth_FixedT" title="Fixed width layout" rel="nofollow" class=" active" href="/Articles/8170/Adapting-GRML?PageFlow=FixedWidth">fixed</a> | <a id="ctl00_PageWidth_FluidT" title="Fluid layout" rel="nofollow" href="/Articles/8170/Adapting-GRML?PageFlow=Fluid">fluid</a> </div> <br> </div> <div class="align-right"> Article Copyright 2004 by Toby Jacob Rhodes<br />Everything else Copyright © <a href="mailto:webmaster@codeproject.com">CodeProject</a>, 1999-2024<br /> <br> Web01 2.8:2024-10-20:1<br> </div> </div> <br clear="all" /> </div> </div> </div> <script type="text/javascript"> // DEFERRED script document.addEventListener('DOMContentLoaded', function() { new CodeBlocks().initialise('#contentdiv'); $('.author-wrapper .description').shorten({showChars: 400}); anchorAnimate(); $('#__EVENTVALIDATION').attr('autocomplete', 'off'); }) </script> </body> </html>