Abstract
The increasing demand for large-size data mining and data analysis applications drives both industry and academia to create new types of highly scalable data-intensive computing platforms. MapReduce is one of the most popular platforms in which the dataflow is in the form of a directed acyclic graph of operators. This paper presents a modified version of the MapReduce framework that is developed to manage unstructured and semi-structured data. Since, almost most kinds of database systems are designed to manage well-structured data requiring users to design a schema before storing and querying data. However, there are significant amount of unstructured data and semistructured data that cannot be effectively managed this way. In this paper, we develop the engineering principles and practices to manage unstructured and semi-structured data in a MapReduce platform. Having a single data platform for managing both well-structured data, unstructured and semi-structured data is beneficial to users; this approach reduces significantly integration, migration, development, maintenance, and operational issues. The Hadoop environment is used to write SQL/XML schemas first, then, all commands are translated to Hadoop as MapReduce jobs. The efficiency of using this method in MapReduce software is discussed and evaluated.