Abstract
Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers
to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process.
Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated
measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values
forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median),
we discover that the accuracy of a decision tree classifier can be much improved if the “complete information” of a data item
(taking into account the probability density function (pdf)) is utilized.
We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive
experiments have been conducted that show that the resulting classifiers are more accurate than those using value averages.
Since processing pdf’s is computationally more costly than processing single values (e.g., averages), decision tree construction
on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning
techniques that can greatly improve construction efficiency.