Data engineering patterns have evolved from batch ETL pipelines to streaming architectures and then to hybrid batch-streaming systems. The 2020 cloud-native data engineering stack is well-defined.

The medallion architecture

The medallion architecture (bronze, silver, gold layers) is the dominant data lake pattern in 2020. Bronze layer: raw data landed from source systems as-is, no transformation, append-only. Silver layer: cleaned, normalised, validated data with schema enforcement. Gold layer: aggregated, business-aligned datasets ready for analytics and ML. The layered approach provides a clear lineage from raw data to analytical output and allows each layer to be rebuilt from the layer below.

Delta Lake for ACID on data lakes

Delta Lake (Databricks, open source) brings ACID transactions, schema enforcement, and time travel to data lakes stored in Parquet on Azure Data Lake Storage or S3. Without Delta Lake, data lakes are append-only at best, DELETE and UPDATE require reading and rewriting entire partitions. Delta Lake's transaction log enables ACID writes, point-in-time queries, and schema evolution. Delta Lake is the foundation of Azure Synapse and Databricks data platforms.

Apache Spark for large-scale transformation

Spark is the standard compute engine for large-scale data transformation. The PySpark, Scala, and.NET for Spark APIs allow data transformations to be expressed as DataFrame operations that Spark optimises into distributed execution plans. Managed Spark services (Azure Databricks, Azure Synapse Spark, EMR on AWS) handle cluster provisioning and autoscaling. For data volumes beyond what a single machine can handle (typically >100GB), Spark is the appropriate tool.

Streaming with Structured Streaming

Apache Spark Structured Streaming treats a stream (Kafka topic, Event Hub) as an unbounded DataFrame. Transformations written as batch DataFrame operations apply incrementally to the stream. The micro-batch execution model provides at-least-once or exactly-once processing semantics. For event-time aggregations (how many events in the last hour, as of event timestamps) Structured Streaming's watermarking model handles late-arriving events.