wiki:PlugIns/HTML/parsing

Parsing HTML Log Data

To parse the data recorded using the HTML monitor, you may simply start AutoQUEST and type the command parseDirHTML <directory>. This will then parse the whole content inclusive subfolders. As a result, the events are stored in a event sequence object in the tab sequences. Single files can be parsed using the command parseHTML <file>. To get more information on the usage of these commands, type man parseDirHTML or man parseHTML.

Different pages of a website may contain the same element. For example, the menu structures of websites are usually repeated on all pages. To consider them in subsequent analysis as identical, they should have set the same id in the HTML source code. If this is not the case in monitored websites, you may add these ids subsequently. Furthermore, you may consider several website elements as the same. For example, many different rows of a table can be considered as the same row. For both, the parsing of log files can be parameterized. This allows to add and replace ids or ignore indexes of website elements in the log files. If they are replaced or ignored, the website model in AutoQUEST is more harmonized and website elements are considered equal although they are not.

The parameterization is done in a separate properties file. The keys in the file specify the tags for which either the id shall be added/replaced or the index shall be ignored. A specification for a simple tag either simply by its name, by its name and index or by its name and id looks as follows:

tagName
tagName[index]
tagName(htmlId\=id)

Furthermore, tags can be specified as paths through the DOM in that several tags specifications are given and concatenated using /. An example with three specified tag (tag1 with index 5, tag 2, and tag 3 with id "id") is the following:

tag1[5]/tag2/tag3(htmlId\=id)

The specification of a tag id may contain the # character to denote a wildcard. This allows matching several website elements with similar ids at once and to give them the same id. An example entry of this is:

div(htmlId\=id_number_#)

This line would match all divs with an id starting with id_number_ where # denotes any character.

It is also possible to specify the page in which the tag path should match. A page is specified by giving a part of the pages path in the URL. After the page specification, the full path to the specified tag must be given. An example is the following:

document(path\=accounts)/html/body/div[0]/ul/li(htmlId\=breadcrumb1)/a

Please note that for specifying the keys in this properties file, it is required to escape any = sign in the key specification. This is usually required if the path to the denoted GUI elements denotes elements by their id as shown in the example.

To remove the id of a specified tag, the value must be empty. To set the id, the value must be the id the tag shall have. To clear the index of the specified tag, that value must be CLEAR_INDEX. Here are some further example entries:

body/div/div/div/form=
body/p/small/a=imprint-link
document(path\=accounts/login)/html/body/div[0]/div[1]/div[0]/form/p/a=password-reset-link
document(path\=accounts/login)/html/body/div[0]/div[1]/div[0]/form/div/button=CLEAR_INDEX
body/div[5]=date-chooser
div(htmlId\=date-chooser)/div[0]=date-chooser_day
Last modified 7 years ago Last modified on 10/04/17 11:52:47