Cross-browser testing using machine learning

The first goal in our R&D process was to develop a web page screenshot comparison technique. We started with a simple pixel-by-pixel diff comparison, but after some time we had developed a cross-browser testing algorithm that consisted of complex image segmentation and comparison procedures (described in previous blog posts).

HT Comp

Example of image segmentation

Unfortunately, the purely image processing based approach produced a large number of false positive results. The main problem is that human perception of cross-browser incompatibilities is very difficult to mimic without the use of a complex classification system. Based on our initial measurements, Browserbite had a recall of 98%, but precision was only 66%. To remind our blog readers, recall and precision are calculated as follows:


This means 34% of the overall results were false positive. Simply put, our system was oversensitive. Two opposite examples of this can be seen in Figures 1 and 2. In Figure 1, there are clear differences and it is considered a true positive. Figure 2 shows minute differences that are not considered to be incompatibilities – therefore, this case is classified as false positive.

True positive

Figure 1. Example of true positive

false positive

Figure 2. Example of false positive

Human decision making is greatly influenced by different subjective parameters, e.g. the background colour, element size, position on the web page etc. In order to mimic this, we decided to build our own non-linear classification system to reduce the number of false positives.



Schematics of comparison process with added classifier

Building the classification system

We started by selecting the 140 most popular websites of Estonia according to We ran our existing image processing based solution on cross-browser captures of these web pages, which found a large number of potential incompatibilities. From these results, we selected a few thousand samples for our training dataset, which we asked 40 people from six different countries to help us classify. These people had different backgrounds from IT to economy. Each participant had to classify our dataset into two different classes: true positive and false positive. There was very little disagreement between our judges, based on the inter-rater reliability index, which was 0,94.


Artificial Neural Network (

Based on our initial empirical experiments, we decided on an artificial neural network (ANN) as the most effective classification system for our need. A neural network consists of a large number of interconnected neurons. These networks mimic biological neural networks in performing functions in parallel. We used a collected dataset to train the neural network based classifier. As input features we used image specific parameters like histogram and size, and as a result, the precision of the Browserbite system improved from 66% to 96%. This means that we can produce test results that are very similar to human testers.

You can find a more detailed description of our research in paper N. Semenenko, M. Dumas, and T. Saar, “Browserbite: Accurate Cross-Browser Testing via Machine Learning Over Image Features.” We would like to thank people from University of Tartu, especially Nataliia Semenenko, for finding the human judges and for collecting the dataset.

Posted by

Comments are closed.