Cloud Data Engineering Patterns Evolve

Cloud data engineering patterns have changed dramatically over the past decade. We no longer rely on batch ETL pipelines for data processing; instead, we use streaming architectures and hybrid batch-streaming systems. The current cloud-native stack is starting to take shape, with a clear understanding of what components work well together.

I spent a night on a 3 am shift trying to recover a corrupted bronze table that had been overwritten by a faulty ingestion job. With Delta Lake's time travel, I was able to roll back to a snapshot from 12 hours prior in just a few seconds. The transaction log kept a record of every write, and the ACID guarantees meant the rollback did not leave orphaned files or corrupt metadata. Without that feature, I would have had to rebuild the entire 2 TB bronze layer from raw logs, a task that would have taken days and risked further data loss.

When deciding between Spark Structured Streaming and Flink for a real-time fraud detection pipeline, we weighed throughput against latency. Spark can ingest 50 k events per second from Kafka but has a 5–10 second micro‑batch delay, while Flink processes events in microseconds but requires more complex state management. In a 2020 production run, we ran into a checkpointing issue on Spark that caused a 15 minute outage; the checkpoint directory was full because we had not configured the retention policy, leading to a failure to recover. Switching to Flink resolved the latency, but we had to build a custom state snapshot mechanism to keep the state size below 4 GB.

Event‑time windows in Structured Streaming can be tricky. In one deployment, we processed 10 k events per second from a Kafka topic that carried timestamps from IoT sensors. We set a 30‑minute watermark, but a burst of 5‑minute delayed events caused our 1‑hour aggregates to be off by 12 %. By tightening the watermark to 1 minute and adding a side output for late events, we reduced the error to less than 1 % and kept the processing latency under 2 seconds. This tweak required a deeper understanding of the data’s arrival patterns and a willingness to trade a small amount of latency for accuracy.

The medallion architecture has become the standard data lake pattern in 2020. It consists of three layers: bronze, silver, and gold. The bronze layer stores raw data from source systems without transformation or curation. The silver layer cleans, normalizes, and validates data with schema enforcement. The gold layer contains aggregated, business‑aligned datasets ready for analytics and machine learning. This layered approach provides a clear lineage from raw data to analytical output, allowing each layer to be rebuilt from the layer below.

Delta Lake, developed by Databricks and open‑sourced, brings ACID transactions, schema enforcement, and time travel to data lakes stored in Parquet on Azure Data Lake Storage or S3. Without Delta Lake, data lakes are limited to append‑only operations, requiring the entire partition to be read and rewritten for DELETE and UPDATE operations. Delta Lake's transaction log enables ACID writes, point‑in‑time queries, and schema evolution, making it a crucial component of Azure Synapse and Databricks data platforms.

Apache Spark has become the de facto standard for large‑scale data transformation. The PySpark, Scala, and .NET for Spark APIs allow data transformations to be expressed as DataFrame operations that Spark optimizes into distributed execution plans. Managed Spark services, such as Azure Databricks, Azure Synapse Spark, and EMR on AWS, handle cluster provisioning and autoscaling. For data volumes beyond what a single machine can handle (typically >100GB), Spark is the appropriate tool.

Apache Spark Structured Streaming treats a stream (Kafka topic, Event Hub) as an unbounded DataFrame. Transformations written as batch DataFrame operations apply incrementally to the stream, providing at‑least‑once or exactly‑once processing semantics. For event‑time aggregations (how many events in the last hour, as of event timestamps), Structured Streaming's watermarking model handles late‑arriving events, allowing for accurate and efficient processing of event streams.