Creating a search template should be a fairly trivial task, but be warned that
knowledge of html is preferred. Take a look at the following script sample:
1. <webcommands name="fao"> 2. [mandatory] <setUp value="http://www.fao.org/" /> 3. [mandatory] <beginAt value="/" /> 4. <clickLinkWithText value="Employment" /> 5. <clickLinkWithText value="Professional vacancies" /> 6. <result_selectTableStartingWithPrefix value="Title" /> 7. [mandatory] <result_setIfNew /> 8. </webcommands>The <webcommand name="fao"> </webcommand> directives define the commands to:
- navigate the dynamic site
- retrieve the results
- define the results area
- compare the actual results with the previous version
Commands enclosed are executed against jwebunit framework. The command meaning is usually self-explanatory but you can refer to jwebunit documentation if you need more information.
Let's to browse through the sample:
setUp declares the URL of the sites to enquiry beginAt declares the relative path to start from.
Note that these two directives, together with the final result_setIfNew are mandatory.
clickLinkWithText follows text links, and result_selectTableStartingWithPrefix define the result area as the one inside the table that starts with the declared text.
result_setIfNew takes care of updating the result area and comparing with the previous version.
Set up an XML editor
To ease as much as possible the search template creation, we provide the webscript schema (webscript.xsd). You can easily set up an XML editor to use the schema to add code assist and code completion for your XML template.We use the XML editor included with eclipse platform.
Navigate
To Navigate the dynamic site, use the commands provided by the jwebunit framework. These command are read from your XML script and executed via reflection.If you need information on the name of HTML element, e.g. to set an element of a form, you may look at the HTML source or take the chance of using an HTML editor.
Retrieve results
Once you've finished setting form elements, you can use the submit command to post the data. It is also possible to use the clicklinkwithtext and clicklinkwithImage to submit data.The web agent automatically retrieves the result page.
Define the results area
It is important to define the area that contains the results. This allow for the `updated' status to be meaningful, and not influenced by changes to the page but unrelated with the results (e.g. a global news table repeated on every page could lead to `updated' results, even if the actual results contained in a central table have not been updated).- result_selectTableStartingWithPrefix
defines the result area as a table starting with a declared text string - result_set
defines the result area as the whole page - result_selectRegEx
defines the result area as the group identified by the regular expression
Define the results data structure
Usually query results are grouped into a repetitive frame, such as a table, rows of a table, or other elements such as div or li elements.Defining the data structure of results allow the product to identify every single change in the result set, giving back an update status only when an element has been updated, and highlighting that element (e.g.. as RSS feed).
To define the results data structure, you can use:
- result_define_data_structure_as_tables
if results are grouped in tables - result_define_data_structure_as_rows
if results are contained in rows of a table. Note that in this case it is supposed that only one table exist in the selected result area (see Define the result area) - result_define_data_structure_as_regex
if results are contained in other elements. There are two different behaviors:- group behavior: the regex define a group to identify every single element.
- start/end behavior: if no group is declared in the regex, the elements are defined from the match of the regex to the next match of the regex. See the 4 step tutorial to create a search script for more information.
Compare
To compare the actual results with the previous version use:result_setifnew
Comparison is made in the following way:
- if no result data structure has been defined, the whole result area is taken into consideration.
- if the result data structure has been defined with the regex, the single elements are taken into consideration as text elements (eg the presence of url parameters referring to a session cookie will bring to non equal comparison, even if the data displayed on the html page are identical).
- if the result data structure has been defined using the table or row functions, html comparison is performed (this usually meas that visually equal pages are defined equals).
Common Errors (and their solution)
Once completed the script, you may want to run it.You need first to upload the script to your local webnavigator repository. Then you can execute it.
If you see the icon indicating execution problems, click to see the details. Together with the problems, the last retrieved page is returned.
Note: you will always see InvocationTargetException as the thrown exception. This is because of the reflection mechanism used.
Common problems
- webnavigator, setUp,
- chances are that you need to set up proxy the support.
See proxySupport for gatherdata tag.
- webnavigator, setUp, beginAt
- chances are that the sites produce an html not supported by
the html parser used by webnavigator.
- webnavigator, setUp, beginAt, ClicklinkwithXXXX
- chances are that the value of the link / image / etc you provided isn't correct.
Inspect the HTML source of the returned page (you see it together with the error) to check.
- webnavigator, setUp, beginAt, setFormElement/submit
- same as above
- webnavigator, setUp, beginAt, ...., result_set
- chances are that the site doesn't support the XML representation (the deep copy of XML structure fails). This error will prevent xml comparing features (actually not yet released).
XML Commands
The following commands are available to navigate the site. The commands are provided by the jwebunit framework, and are invoked using reflection. To avoid common spelling problems, a schema is provided to validate the xml (and to help with code completion, when supported by the xml editor).Most of the commands are self-explanatory, more information are available on the jwebunit web site jwebunit.sourceforge.net
Proxy support
If you need to set up a proxy to access the Internet, you may define the proxy host and port using the user settings options.General script structure
A webscript is defined with a name and a list of sites to be searched for results.
The general structure is:
<webscript name="NAME OF THE WEBSCRIPT"> <gatherdata> <webcommands name="SITE/RESULT NAME 1"> ... </webcommands> ... <webcommands name="SITE/RESULT NAME x"> ... </webcommands> </gatherdata> <sendgathereddata email="email address" /> </webscript>
You may define as many scripts as you want, provided that the WEBSCRIPT NAME is
UNIQUE. Inside a script, you may decide to gather data from as many sites as
you want.