Abstract
Sentiment Analysis involves in building a system to collect and examine opinions about the product made in blog posts, comments, reviews or tweets. Automatic classification of sentiment is important for applications such as opinion mining, opinion summarization, contextual advertising and market analysis. Sentiment is expressed differently in different domains and it is costly to annotate data for each new domain. In Cross-Domain Sentiment Classification, the features or words that appear in the source domain do not always appear in the target domain. So a classifier trained on one domain might not perform well on a different domain because it fails to learn the sentiment of the unseen words. One solution to this issue is to use a thesaurus which groups different words that express the same sentiment. Hence, feature expansion is required to augment a feature vector with additional related features to reduce the mismatch between features. The proposed method creates a thesaurus that is sensitive to the sentiment of words expressed in different domains. It utilizes both labeled as well as unlabeled data of the source domains and unlabeled data of the target domain. It uses pointwise mutual information to compute relatedness measure which in turn used to create thesaurus. The pointwise mutual information is biased towards infrequent elements/features. So a discounting factor is multiplied to the pointwise mutual information to overcome this problem. Then the proposed method uses the created thesaurus to expand feature vectors. Using these extended vectors, a Lasso Regularized Logistic Regression based binary classifier is trained to classify sentiment of the reviews in target domain. It gives improved prediction accuracy than existing Cross-Domain Sentiment Classification system.