Abstract
According to Google index , the number of web pages now exceeds 50 billion, and is increasing by millions per day. The global population of internet users is also growing rapidly and then a web page classification problem arises. According to that, automatic classification method is required to deal with this problem of the World Wide Web (WWW). The traditional methods that use text determine the class of the document, but usually retrieve unrelated web pages. In order to effectively classify web pages, we apply different feature extraction techniques with different web page classification methods to find an efficient method for web page classification. The three feature extraction methods used in the study are Term Occurrence (TO), Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF) and the three classifiers used in the study are K-nearest neighbor (K-NN), Naive Bayes (NB) and Decision Tree (DT). Each web page is represented by the three feature extraction methods. The principal component analysis (PCA) is used to select the most relevant features for the classification as the number of unique words in the collection set is big. The final output of the PCA is sent to the three different classifiers to find the best method for web page classification. The experimental evaluation used demonstrates that the combination of Naive Bayes (NB) and Term Frequency-Inverse Document Frequency provides an efficient classification accuracy compared to other methods