simavian.com

JARS top rated 25%

SourceForge.net Logo

How it works

The following picture describes WebNavigator architecture.
architecture (click to enlarge the picture)

Web Agents

Web Agents retrieve information from web sites using standard http protocol. Web Agents are guided by Search Templates and take care of connecting to web site and navigating to result page. This component is based on httpunit and jwebunit.

Search Templates

Templates are group of commands that drive the navigation and examine the result. Template are xml-based files. For example:

<?xml version="1.0" encoding="UTF-8"?>
<webscript name="test ws">
    <gatherdata>
        <webcommands name="google webnavigator">
            <setUp value="http://www.google.com"/>
            <beginAt value = "/" />
            <setFormElement name="q" value= "webnavigator" />
            <submit value="btnG"/>
            <clickLinkWithText value ="webnavigator" />
            <result_selectTableStartingWithPrefix value="news" />
            <result_setIfNew />
        </webcommands>
        <webcommands name="altavista webnavigator"> 
        ......   
        </webcommands>
    </gatherdata>
    <sendgathereddata email="sc@sourceforge.net" />
</webscript>

WebCommands are used to interact with the html pages as if we are commanding a web browser, so it is possible to:
  • set form elements value
  • set options
  • click images
  • click text
  • click button
  • submit pages
To ease the creation of template, we use directly the java method exposed by httpunit.

Store

A persistent storage is used to cache the results, and to perform differencing analysis to define updated results. WebNavigator uses hsqldb.

Scheduler

Takes care of scheduling repetitive search.

Result Filter

The basic idea is to define if last search reported new results. At present time, there are two groups of function in this area:
  • identify the Result Area (the area of the page that presents results, extracted from the surrounding information). Right now it is possible to define the result area as an html table starting with some text, or to define it as the whole page.
  • identify the differences from the previous result set. right now, we are exploring xml differencing. This requires, at least, an xml representation of html page.

Mail / Web Alert

Result retrieved from web agents may be sent via email to a defined account, together with the retrieved result pages. Results are also stored on a local repository and can be consulted to have an idea of updated results.