Abstract
Clustering techniques are used for automatically organizing or summarizing a large collection of text; there have been many approaches to clustering. As described below, for the purpose of the work, we are particularly interested in two of them: coclustering and constrained clustering. This thesis proposes a novel constrained coclustering method to achieve two goals. First, it combines information-theoretic coclustering and constrained clustering to improve clustering performance. Second, it adopts both supervised and unsupervised constraints to demonstrate the effectiveness of the algorithm.
The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. It then used an alternating expectation maximization (EM) algorithm to optimize the model. It also proposes two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering. 1) Automatically construct document constraints 2) Automatically construct word constraints The results of the evaluation demonstrates the superiority of our approaches against a number of existing approaches.Unlike existing approaches, this thesis applies stop word removal, stemming and synonym word replacement to apply semantic similarity between words in the documents. In addition, content can be retrieved from text files, HTML pages as well as XML pages. Tags are eliminated from HTML files. Attribute name and values are taken as normal paragraph words in XML files and then preprocessing (stop word removal, stemming and synonym word replacement) is applied