Nowadays, Machine Learning technique rely on enormous amounts of data to achieve their results. These techniques, such as Google’s WaveNet achieve stellar results, at a high cost in terms of computational power.

The main motto of this work is to investigate the possibility of reducing the workload of training these systems while maintaining acceptable performance. We focus on Bayesian Classifiers as our main test-bench. The main concept behind the technique is to have a pre-processing step, based on Shannon’s Surprisal which filters out any training data that does not contribute significantly to the total information on the system:

Overview of the Technique

In essence, each dataset is filtered through our technique, and used to train a classifier, from which we extract performance metrics

With a prelliminary technique, we can already observe interesting results, using a binary classifier as a test-bench:

Very Prelliminary Results

This work is under development, and a detailed report is currently being considered for publication.

Extra Results

The remainder of this page contains extra results to support the paper entitled “Surprisal-driven filtering of learning data”. Results are split by dataset.

Multi-Class Datasets

A number of multi-class, ranging from 3 to 100 output classes, were generated to study the behaviour of the system when dealing with these types of data. We include the performance of the system for several of these datasets. All of these datasets were generated to with 10000 examples.

4 Output Classes

4 Output Classes

5 Output Classes

5 Output Classes

25 Output Classes

25 Output Classes

100 Output Classes

100 Output Classes

We can observe that, for all examples, the system was able to achieve very significant dataset reductions, effectively reducing the dataset to the very few samples needed to adjust the Gaussians just enough to ensure high accuracy. It is interesting to note that, as the number of output classes increases:

  • the accuracy drops become more sudden;
  • the optimal sensitivity threshold becomes closer to zero.

Multilabel Dataset

A multilabel dataset was generated, based on the technique described in this paper.

Number of Inputs Number of Input States Number of Outputs Number of Output States
2 1000 1 64

Learning Curve

The learning curve shows that surprisal does not settle in low values, instead showing occasional spikes due to misclassified examples.

Reduction Results

Despite the increased surprisal in the data, it can be reduced to less than 1% of its initial size with little impact on classifier performance.

USPS Dataset

The USPS Dataset was used for comparison with previous works. We implemented a Naïve Bayes classifier to work on this dataset.

Number of Inputs Number of Input States Number of Outputs Number of Output States
128 1000 1 10

Learning Curve

We can observe that the learning curve stagnates after approximately 1000 samples, roughly 10% of the total size of the dataset.

Reduction Results

We can observe that, even for low values of tau, the dataset can be extremely reduced. This figure shows that, at approximately tau=73, the system starts discarding valuable examples and classifier performance drops.