Cache based global memory orchestration for data intensive stream processing pipelines

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Matteussi, Kassiano José
Orientador(a): Geyer, Claudio Fernando Resin
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Palavras-chave em Inglês:
Link de acesso: http://hdl.handle.net/10183/259649
Resumo: A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.