Abstract
Today large amount of data is generated from various and heterogeneous sources in day to day life. There 300 million of users posts the images, messages and other type of data on Facebook per day. 3.5 billion Searches are processed by google per day. Traditional system that handles the data in megabytes and gigabytes cannot efficiently handle such a huge volume, distributed data that comes from heterogeneous sources.And such data which is huge in volume, complex, distributed, unmanageable and heterogeneous is called as Big Data. In 2004 google proposed MapReduce parallel processing model that provides the parallel processing.Whenever user hits the query, MapReduce model splits it and assigns to the parallel nodes to process that query parallely.The results evaluated by all the nodes are collected and delivered to the user. The Apache open source model called “Hadoop” adopted this MapReduce framework. This paper represent system that uses collaborative filtering on the big data that have been clustered. This way of mining and managing big data is more efficient than using another traditional systems. The main challenges of big data are storing, searching, manipulating and security. The efficient system should be able to overcome all these challenges, and those what the parameters on which this paper focuses. The use of Kmeans algorithm for clustering has been recognized by many of big data handlers.Clustering divides the big data into clusters, with the data having same characteristics on one cluster. Clustering increases accuracy and takes less time to compute the results. Clustering are techniques that can reduce the data sizes by a large factor by grouping similar services together. This paper proposes of two stages: clustering and collaborative filtering. Clustering is a pre-processing step to separate Big Data into manageable parts.