Abstract
Enterprises collect huge volume of data from different sources such as web logs, click stream, social media, sensor data and the like. This includes numeric and textual data and needs to be converted onto a format (mostly numeric) that is acceptable by the data mining and machine learning algorithms. Thus, we need an efficient, scalable approach to enable not only simple conversion, but also transformations like normalization. In this paper, we present the IGATE In-Place Transformation Engine (IPTE), which can be used for conversion and transformation operations. A comparative evaluation of commercially available ETL tools and the need for a custom-tool (IPTE) is justified. Three distinct flavours of IPTE implementation are described -- (a) Stand-alone using Java Multi-threading (b) Distributed using Hadoop API’s, and (c) In-memory using Apache Spark. By comparing the performance of the transformation process on different data sizes using these three flavours, specific recommendations are made when each of these should be used. Some use cases where IPTE has been used effectively are also presented.