Abstract
The proliferation of big data and the widespread adoption of cloud computing have significantly transformed how organizations handle data ingestion, transformation, and analysis. In this evolving digital landscape, the optimization of data pipelines has become a cornerstone of operational efficiency and strategic decision-making. At the heart of this process lies the Extract, Transform, Load (ETL) mechanism, which plays a critical role in ensuring that data is processed and made analytics-ready in a timely, scalable, and cost-effective manner.
This paper conducts an in-depth comparative study of five modern ETL tools—Apache NiFi, Talend Data Integration, AWS Glue, Google Cloud Dataflow, and Azure Data Factory—with a focus on their performance within cloud-based big data ecosystems. The study evaluates each tool using six core metrics: latency, scalability, integration capabilities, streaming support, ease of use, and cost efficiency. By leveraging a combination of academic literature review, technical documentation, and industry benchmarks, the paper synthesizes both theoretical insights and practical findings.
The analysis is supported by detailed tables and visual graphs that compare latency performance and cost per data volume, offering a transparent and data-driven perspective on the suitability of each tool. The results highlight that while tools like AWS Glue and Google Cloud Dataflow outperform others in latency and scalability, open-source alternatives such as Apache NiFi provide unmatched flexibility and cost benefits for organizations seeking vendor-neutral solutions.
This study aims to guide data architects, engineers, and decision-makers in selecting the most appropriate ETL solution based on their cloud environment, data workload characteristics, and business priorities. The conclusions drawn underscore the importance of aligning ETL tool selection with the strategic goals of digital transformation, operational efficiency, and long-term scalability. Furthermore, the paper recommends future exploration into AI-enhanced ETL pipelines, containerized orchestration, and real-time observability as emerging frontiers in data engineering.
Keywords
- fdsa
References
- 1. Sharma, S. (2016). Expanded cloud plumes hiding Big Data ecosystem. Future Generation Computer Systems, 59, 63-92.
- 2. Akanbi, A., & Masinde, M. (2020). A distributed stream processing middleware framework for real-time analysis of heterogeneous data on big data platform: Case of environmental monitoring. Sensors, 20(11), 3166.
- 3. Thumburu, S. K. R. (2020). A Comparative Analysis of ETL Tools for Large-Scale EDI Data Integration. Journal of Innovative Technologies, 3(1).
- 4. Jaiswal, J. K. (2018). Cloud Computing for Big Data Analytics Projects.
- 5. Sikeridis, D., Papapanagiotou, I., Rimal, B. P., & Devetsikiotis, M. (2017). A Comparative taxonomy and survey of public cloud infrastructure vendors. arXiv preprint arXiv:1710.01476.
- 6. Islam, M. Z. (2014). A cloud based platform for big data science.
- 7. Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence (pp. 21-62). Academic Press.
- 8. Agrawal, M., Joshi, A. S., & Velez, A. F. (2017). Best Practices in Data Management for Analytics Projects.
- 9. Hafsa, M., & Jemili, F. (2018). Comparative study between big data analysis techniques in intrusion detection. Big Data and Cognitive Computing, 3(1), 1.
- 10. Raj, P. (Ed.). (2014). Handbook of research on cloud infrastructures for big data analytics. IGI Global.
- 11. Asch, M., Moore, T., Badia, R., Beck, M., Beckman, P., Bidot, T., ... & Zacharov, I. (2018). Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. The International Journal of High Performance Computing Applications, 32(4), 435-479.
- 12. Mazumder, S. (2016). Big data tools and platforms. Big data concepts, theories, and applications, 29-128.
- 13. Firouzi, F., & Farahani, B. (2020). Architecting iot cloud. Intelligent Internet of Things: From device to fog and cloud, 173-241.
- 14. Gorelik, A. (2019). The enterprise big data lake: Delivering the promise of big data and data science. O'Reilly Media.
- 15. Mohamed, A., Najafabadi, M. K., Wah, Y. B., Zaman, E. A. K., & Maskat, R. (2020). The state of the art and taxonomy of big data analytics: view from new big data framework. Artificial intelligence review, 53, 989-1037.
- 16. Otoo-Arthur, D., & van Zyl, T. L. (2020, August). A scalable heterogeneous big data framework for e-learning systems. In 2020 international conference on artificial intelligence, big data, computing and data communication systems (icABCD) (pp. 1-15). IEEE.
- 17. Ryzko, D. (2020). Modern big data architectures: a multi-agent systems perspective. John Wiley & Sons.
- 18. Suthakar, U. (2017). A scalable data store and analytic platform for real-time monitoring of data-intensive scientific infrastructure (Doctoral dissertation, Brunel University London).
- 19. Tan, R., Chirkova, R., Gadepally, V., & Mattson, T. G. (2017, December). Enabling query processing across heterogeneous data models: A survey. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 3211-3220). IEEE.
- 20. Davoudian, A., & Liu, M. (2020). Big data systems: A software engineering perspective. ACM Computing Surveys (CSUR), 53(5), 1-39.