simavian.com

JARS top rated 25%

SourceForge.net Logo

Tutorial: create a search script in 4 steps

Creating a search script isn't an easy task. And it is also very dependent on the site structure you are about to query. The following pages give some ideas on how to proceed.

XML commands

1: set up the navigation

Define the script name and the commands to be performed on the site to navigate to the results.

<?xml version="1.0" encoding="UTF-8"?>
<webscript
	xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
	xsi:noNamespaceSchemaLocation='webscript.xsd'
	name='demo shop sites'
/>
<gatherdata>
	<webcommands name="kelkoo</span>">
	setUp value="http://www.kelkoo.co.uk</span>"/>
	beginAt value="/"/>
	<setFormElement name="siteSearchQuery" value="toilet roll"/>
	<submit/>

To find the name of input field, open the page in an HTML editor (we use Mozilla) and click on the INPUT field.

XML commands

2: Define the result area

First of all visually identify the result area. Then, use an HTML editor to find the exact starting and ending string that defines the result area. Use a regular expression to select the whole area.

Use the group expression (.)* to represent the result area.

<result_selectRegEx>
<![CDATA
<div class="mod_std_sub">(.)*<div id="pages" class="pageDiv">
]]>
</result_selectRegEx>



XML commands

To check the regular expression you typed, use a regex editor (we use the QuickREx eclipse plugin).

Verify that the highlight area is what you expected.

XML commands

3: Define the result data structure

This step can be tricky.

If the results are included in tables, or in rows of a table, use the corresponding

<result_define_data_structure_as_tables/>



XML commands

This is the case, but to make an example let's say that we want to define results data structure as regular expression.

<result_define_data_structure_as_regex>
<![CDATA[
<div class="width">\s*
]]>
</result_define_data_structure_as_regex>

This way, we tells to the parsing engine that every result is defined from the matching <div class="width">\s* to the next matching <div class="width">\s* (we call this a start/end strategy). It is also possible to define results as group expression (we call this group strategy). Note that it is not easy to balance the tags in HTML using Regex expressions.

4: Upload and run the script

Use the script management menu to upload the script, run it and examine both the whole results and the detailed data parsed.


The whole script that we have created is:

<?xml version="1.0" encoding="UTF-8"?>
<webscript
	xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
	xsi:noNamespaceSchemaLocation='webscript.xsd'
	name='demo shop sites'>
	<gatherdata>
	<webcommands name="kelkoo">
		setUp value="http://www.kelkoo.co.uk"/>
		beginAt value="/"/>
		<setFormElement name="siteSearchQuery" value="toilet roll"/>
		<submit/>
<!-- 		define result area -->
		<result_selectRegEx>
		<![CDATA[
		<div class="mod_std_sub">(.)*<div id="pages" class="pageDiv">
		]]>
		</result_selectRegEx>
<!-- 		define result data structure  -->
		<result_define_data_structure_as_tables/>
		<result_setIfNew/>
	</webcommands>
	</gatherdata>
</webscript>