Real-Time Fraud ML on Spark Structured Streaming: Micro-Batch vs. Continuous Processing

Authors

DOI:

https://doi.org/10.58425/ajt.v4i4.465

Keywords:

Real-time fraud detection, Apache spark structured streaming, micro-batch vs. continuous processing, machine learning inference, scalability, latency

Abstract

Aim: The aim of this study is to evaluate the effectiveness of machine learning based real time fraud detection using Apache Spark Structured Streaming, with specific emphasis on comparing the Micro Batch and Continuous Processing modes.

Methods: The study develops and evaluates a real time fraud detection pipeline trained on the IEEE CIS Fraud Detection dataset. The methodology includes feature engineering, supervised machine learning models, and stream processing to support near real time fraud classification. Apache Spark Structured Streaming is implemented using both Micro Batch and Continuous Processing modes. Experimental evaluations compare the two modes across multiple performance metrics, including latency, precision, recall, resource utilization, and system reliability.

Results: The results show that the Micro Batch mode provides strong analytical capabilities, robust fault tolerance, and support for complex transformations, albeit with a slight increase in processing latency. In contrast, the Continuous Processing mode delivers significantly lower latency and higher throughput, making it suitable for environments requiring rapid fraud alerting. However, it demonstrates limitations in supported transformations and recovery mechanisms. The comparative analysis indicates that neither processing mode consistently outperforms the other across all evaluation criteria.

Conclusion: The study concludes that the choice between Micro Batch and Continuous Processing modes in Apache Spark Structured Streaming should be driven by specific application requirements rather than a general preference for one mode. Each mode presents distinct tradeoffs between latency, analytical flexibility, and reliability in real time fraud detection systems.

Recommendations: The study recommends adopting a use case driven approach when selecting Spark processing modes for fraud detection applications. Future research should explore hybrid architectures that combine the strengths of both processing modes, as well as advanced techniques such as active learning and graph-based machine learning to enhance adaptability and accuracy in real time fraud mitigation systems.

References

Al Jawarneh, I. M., Bellavista, P., Corradi, A., Foschini, L., & Montanari, R. (2023). SpatialSSJP: QoS-aware adaptive approximate stream-static spatial join processor. IEEE Transactions on Parallel and Distributed Systems, 35(1), 73-88.

Athlur, S., Saran, N., Sivathanu, M., Ramjee, R., & Kwatra, N. (2022, March). Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems (pp. 472-487).

Batani, J. (2017). An adaptive and real-time fraud detection algorithm in online transactions. International Journal of Computer Science and Business Informatics, 17(2), 1-12. https://www.researchgate.net/profile/John-Batani/publication/322131441_An_Adaptive_and_Real-Time_Fraud_Detection_Algorithm_in_Online_Transactions/links/5a4678a2aca272d2945ec3dd/An-Adaptive-and-Real-Time-Fraud-Detection-Algorithm-in-Online-Transactions.pdf

BATE, A. (2023). Auditable Data Provenance in Streaming Data Processing.

Baud, R., Manzoori, A. R., Ijspeert, A., & Bouri, M. (2021). Review of control strategies for lower-limb exoskeletons to assist gait. Journal of neuroengineering and rehabilitation, 18, 1-34. https://link.springer.com/article/10.1186/s12984-021-00906-3

Bello, H. O., Ige, A. B., & Ameyaw, M. N. (2024). Adaptive machine learning models: concepts for real-time financial fraud prevention in dynamic environments. World Journal of Advanced Engineering Technology and Sciences, 12(02), 021-034.

Chavan, A. (2022). Importance of identifying and establishing context boundaries while migrating from monolith to microservices. Journal of Engineering and Applied Sciences Technology, 4, E168. http://doi.org/10.47363/JEAST/2022(4)E168

Chavan, A. (2024). Fault-tolerant event-driven systems: Techniques and best practices. Journal of Engineering and Applied Sciences Technology, 6, E167. http://doi.org/10.47363/JEAST/2024(6)E167

Dhanagari, M. R. (2024). MongoDB and data consistency: Bridging the gap between performance and reliability. Journal of Computer Science and Technology Studies, 6(2), 183-198. https://doi.org/10.32996/jcsts.2024.6.2.21

Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time. Journal of Computer Science and Technology Studies, 6(5), 246-264. https://doi.org/10.32996/jcsts.2024.6.5.20

Emma, O. T., & Peace, P. (2023). Building an Automated Data Ingestion System: Leveraging Kafka Connect for Predictive Analytics.

Georgiou, E. (2024). DEPLOYING ONLINE MACHINE LEARNING MODELS WITH REAL-TIME DATA PIPELINES.

Hosain, M. T., Zaman, A., Abir, M. R., Akter, S., Mursalin, S., & Khan, S. S. (2024). Synchronizing object detection: applications, advancements and existing challenges. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3388889

Isah, H., Abughofa, T., Mahfuz, S., Ajerla, D., Zulkernine, F., & Khan, S. (2019). A survey of distributed data stream processing frameworks. IEEE Access, 7, 154300-154316.

Karwa, K. (2024). The future of work for industrial and product designers: Preparing students for AI and automation trends. Identifying the skills and knowledge that will be critical for future-proofing design careers. International Journal of Advanced Research in Engineering and Technology, 15(5). https://iaeme.com/MasterAdmin/Journal_uploads/IJARET/VOLUME_15_ISSUE_5/IJARET_15_05_011.pdf

Khan, Z., Anjum, A., Soomro, K., & Tahir, M. A. (2015). Towards cloud-based big data analytics for smart future cities. Journal of Cloud Computing, 4, 1-11.

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST, and SCA tools. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 3(1), 91-99. https://doi.org/10.1016/j.gltp.2022.04.020

Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M. S., & Zeineddine, H. (2019). An experimental study with imbalanced classification approaches for credit card fraud detection. IEEE Access, 7, 93010-93022. https://doi.org/10.1109/ACCESS.2019.2927266

Malempati, M. (2022). Transforming Payment Ecosystems Through The Synergy Of Artificial Intelligence, Big Data Technologies, And Predictive Financial Modeling. Big Data Technologies And Predictive Financial Modeling (November 07, 2022). https://dx.doi.org/10.2139/ssrn.5246665

Mehmood, E., & Anees, T. (2020). Challenges and solutions for processing real-time big data stream: a systematic literature review. IEEE Access, 8, 119123-119143. https://doi.org/10.1109/ACCESS.2020.3005268

Nanfack, G., Temple, P., & Frénay, B. (2022). Constraint enforcement on decision trees: A survey. ACM Computing Surveys (CSUR), 54(10s), 1-36. https://doi.org/10.1145/3506734

Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203184230

Rajpurohit, A. M., Kumar, P., Kumar, R. R., & Kumar, R. (2023). A Review on Apache Spark. Kilby, 100, 7th. https://dx.doi.org/10.2139/ssrn.4492445

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Raptis, T. P., & Passarella, A. (2023). A survey on networked data streaming with Apache Kafka. IEEE Access, 11, 85333-85350. https://doi.org/10.1109/ACCESS.2023.3303810

Reurink, A. (2019). Financial fraud: A literature review. Contemporary topics in finance: A collection of literature surveys, 79-115. https://doi.org/10.1002/9781119565178.ch4

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective. International Journal of Science and Research Archive. https://doi.org/10.30574/ijsra.2022.7.2.0253

Sardana, J. (2022). The role of notification scheduling in improving patient outcomes. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Shetty, S. (2019). Improving processing of real-time Big Data in Smart Grids using Apache Flink and Kafka (Doctoral dissertation, Dublin, National College of Ireland).

Singh, V. (2021). Generative AI in medical diagnostics: Utilizing generative models to create synthetic medical data for training diagnostic algorithms. International Journal of Computer Engineering and Medical Technologies. https://ijcem.in/wp-content/uploads/GENERATIVE-AI-IN-MEDICAL-DIAGNOSTICS-UTILIZING-GENERATIVE-MODELS-TO-CREATE-SYNTHETIC-MEDICAL-DATA-FOR-TRAINING-DIAGNOSTIC-ALGORITHMS.pdf

Singh, V. (2022). Visual question answering using transformer architectures: Applying transformer models to improve performance in VQA tasks. Journal of Artificial Intelligence and Cognitive Computing, 1(E228). https://doi.org/10.47363/JAICC/2022(1)E228

Sukhadiya, J., Pandya, H., & Singh, V. (2018). Comparison of Image Captioning Methods. INTERNATIONAL JOURNAL OF ENGINEERING DEVELOPMENT AND RESEARCH, 6(4), 43-48. https://rjwave.org/ijedr/papers/IJEDR1804011.pdf

Zahra, F. T., Bostanci, Y. S., Tokgozlu, O., Turkoglu, M., & Soyturk, M. (2024). Big Data Streaming and Data Analytics Infrastructure for Efficient AI-Based Processing. In Recent Advances in Microelectronics Reliability: Contributions from the European ECSEL JU project iRel40 (pp. 213-249). Cham: Springer International Publishing. https://link.springer.com/chapter/10.1007/978-3-031-59361-1_9

Downloads

Published

2025-12-27

How to Cite

Vadgama, B. (2025). Real-Time Fraud ML on Spark Structured Streaming: Micro-Batch vs. Continuous Processing. American Journal of Technology, 4(4), 60–86. https://doi.org/10.58425/ajt.v4i4.465