AI Era Hits Modern Data Stack

The modern data stack, which emerged between 2018 and 2022, is adapting to the AI era. Tools that managed analytical data are becoming infrastructure for AI training data, feature stores, and model monitoring.

The modern data stack consists of cloud-native tools for analytical data workflows. It includes a cloud data warehouse like Snowflake, BigQuery, or Redshift for storage and query, dbt for data transformation and lineage, Fivetran or Airbyte for data ingestion, and Looker or Metabase for visualisation. This stack replaced complex on-premises ETL pipelines with composable cloud services connected by SQL.

dbt standardised SQL-based data transformation in the data warehouse. It treats transformation logic as code in the repository, with built-in testing, documentation, and lineage tracking. dbt's adoption is now universal among modern data teams. In 2023, dbt expanded with the Semantic Layer for consistent metric definitions across tools and dbt Cloud for managed execution.

One of the key challenges in building AI models is ensuring that the training data is of high quality. I recall a project where we had to deal with inconsistent data formats, missing values, and incorrect data types. We used Great Expectations for schema validation and data quality checks, and Monte Carlo for data observability. This allowed us to catch data quality issues early on and ensure that our AI models were trained on reliable data.

The modern data stack's focus on data quality, testing in dbt, schema validation in Great Expectations, and data observability in Monte Carlo is becoming crucial for AI development. Teams that invested in data quality for analytics have a head start on AI data infrastructure. Teams that haven't now have two reasons to invest.

Snowflake acquired Streamlit in 2022 and integrated it to let data scientists build ML applications querying Snowflake directly. Snowflake's Cortex adds LLM capabilities like summarisation, classification, and translation directly in SQL queries. The goal is to have data and AI in the same platform, eliminating the ETL step between the data warehouse and ML infrastructure. For example, a team I worked with used Streamlit to build a data science application that queried Snowflake directly, reducing the need for data movement and improving performance.

The modern data stack's tools are evolving to support AI workloads. Snowflake's integration with Streamlit and Cortex enables data scientists to build ML applications directly on the platform. Additionally, tools like Apache Spark and Kubernetes are being used to manage large-scale AI workloads. However, there are trade-offs to consider, such as the need for specialized skills and infrastructure to manage these workloads.

Data quality is a significant challenge for AI adoption. The modern data stack's emphasis on data quality, testing, and observability is essential for building reliable AI models.

The convergence of data and AI is driving innovation in the modern data stack. As AI adoption grows, the stack will continue to evolve to support new use cases and workloads.