Abstract
Big Data refers to the large set of data that range beyond exa-bytes(1018).It is due to the exceeding capacity of online storage and processing system. The evolution of web technology results in a huge amount of data present in the web for the internet user. Those data’s exist in the form of text, audio, video, image, etc. Text Mining, knowledge discovery in text (KDT) is the process of extracting knowledge from the huge amount of data that has been available in the form of text. The rapid growth of internet usage in the social media leads to the generation of more volume of data. It has been estimated that 80% of the business information are available in the form of unstructured and semi-structured text data. K-means is the most common clustering algorithm used to cluster the similar content in hadoop environment. Since it is based on “bag of words” the lexical semantic analysis is not used. Our proposed RCDC algorithm is based on latent semantic analysis, which is more efficient, scalable and accurate.