Abstract
Internet is being used at a greater extent nowadays. All the types of data are available very easily on the internet. The user submits a query to the search engine and thousands of related documents are retuned as a result to the query. The web documents contain different types of data like text, images, videos, etc. So, the web documents are not structured properly and are unorganized. It becomes much difficult for users to find specific document from thousands of documents. The solution to this problem is clustering the web documents. Clustering congregates the documents showing similar context to the user query. The similar documents are assembled in a cluster. So, clustering reduces user’s task to discriminate among the thousands documents returned as a result to a query. Also, ranking can be applied further to view the most relevant documents at the top. Different documents in a cluster are ranked and the documents can be arranged according to their similarity. Different functions can be used to calculate the similarity measure among the documents. We combine these two concepts and propose a tf-idf based apriori scheme for web document clustering and ranking. In this scheme, first clustering is applied on the documents. The modified tf-idf based apriori algorithm is used to serve this purpose. And then, ranking is performed to arrange the most pertinent documents at the top with regard to the user query. We use online web pages returned as results for a query as the dataset for our experimental work. This approach gives a good F-measure value, i.e. 81%. The proposed method is found superior to some traditional clustering approaches.