Abstract
The importance of datasets in data mining for knowledge discovery is universally recognized. Raw data sets that are available are not directly amenable for data mining analysis. Vertical and horizontal aggregations have been extensively used to transform the original data set for this purpose. However, vertical aggregations contribute only minimally for preparation of data sets in data mining analysis. Horizontally aggregated data sets are extensively used. Horizontal data sets are generated directly by using simple, yet powerful methods such as CASE, PIVOT, and SPJ. Basic SQL aggregations limitations to return one column per aggregated group using group functions is overcome by these three methods to generate aggregated columns in a horizontal tabular layout that are suitable for data mining analysis. Of these three methods SPJ employs relational operators to realize horizontal aggregations which are a better strategy in comparison to case and pivot. However SPJ’s is weak in performance. In order to enhance the performance of SPJ method which improves horizontal aggregation performance in parallel, we proposed technique to improve SPJ methodology by using Join Enumeration strategies which includes a query tree generation with quantifier’s algorithm. Then horizontal aggregations performance improvement is attempted using secondary indexes on common grouping columns. In conclusion we found that the above two changes improved Horizontal aggregations performance significantly and produce efficient dataset in horizontal format.