Abstract
Big data is prevalent in both industry and scientific research applications where the data is generated with high volume and velocity it is difficult to process using on-hand database management tools or traditional data processing applications. Some techniques have been developed in recent years for processing large object data on cloud, such as Cloud Analytics. However, these techniques do not provide efficient support for parallel processing and cluster technology. Big data platforms often need to support emerging Data sources and applications while accommodating existing ones. Since different data and applications have varying requirements, multiple types of data stores (e.g. file-based and object-based), frequently co-exist in the same solution today without proper integration. Hence cross-store data access, key to effective data analytics, cannot be achieved without laborious application Re-programming, prohibitively expensive data migration, and/or costly maintenance of multiple data copies. We address this vital issue by introducing a first unified big data platform over heterogeneous storage. In particular, we present a prototype joining Apache Hadoop Map Reduce and Flume technology. A retail data analysis application using data of real Twitter application is employed to test and showcase our prototype. We have found that our prototype achieves 50% data capacity savings, eliminates data migration overhead, and offers stronger reliability and enterprise support. Through our case study, we have learned important theoretical lessons concerning performance and reliability, as well as practical ones related to platform configuration. We have also identified several potentially high-impact research directions.