Abstract
In recent years the WWW has turned into one of the most important distribution channels for private, scientific and business information. The reason for this development is relatively low cost of publishing a website and more up-to-date view on a business for millions of users. As a result the WWW has been growing tremendously for the last five years. The Google recently reported that it is currently indexing over 7 billion text documents. The number of registered international top level domains has increased more than 9 times over the last 5 years.
The main aim of this paper is to retrieve the effective and efficient retrieval of required documents from various web pages of various websites. Here the efficient and effectiveness can be measured in terms of relevancy and similarity. For achieving more relevance during retrieving the required documents we can use some KDD techniques to extract specific knowledge from the WWW. For achieving relevancy and similarity, some classification methods have been used. For this purpose we have analyzed various classification methods in data mining to evaluate their own performance. The factors like precision and recall have been used to measure the performance and to calculate the accuracy and the number of documents retrieved during a particular period of time. To identify the level of relevancy and similarity the level of accuracy plays an important role. For this we have defined a threshold for comparison. If the evaluated accuracy is greater than the threshold then the similarity level is high. Otherwise the similarity level will be low. If the number of relevant document retrieved during a particular period of time is large the level of efficiency increases. Otherwise it will be low. For the purpose of testing we have taken 30,000 single HTML documents from 300 web sites. We have taken 4 existing classification techniques to compare the efficiency of newly developed classifier.