DE by DL.AI - C2: Source Systems, Data Ingestion, and Pipelines (W3 - DataOps)

Anh-Thi Dinh

DataOps - Automation

Overview

  • Recall:
    • DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
    • DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.
  • DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)
  • Automation → CD/CI
  • Infrastructure: AWS CLoudFormation, Terraform

Conversation with Chris Bergh (DataOps — Automation)

  • DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.
  • Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.
  • Goal: Build a “data factory” to produce consistent and modifiable data outputs.
  • Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.
  • Key Principle: Build systems around code (e.g., testing, observability) for reliability.
  • Testing: Essential for minimizing future errors and ensuring code quality.
  • Iterative Development: Deliver small updates, get feedback, and iterate quickly.
  • DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.
  • Don’t Be a Hero: Avoid taking on too much; build systems to support your work.
  • Automation and Validation: Always test and validate; measure everything for reliability.
  • Proactive Systems: Build environments that ensure long-term success and reduce stress.
  • Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.

DataOps Automation

  • CI/CD (Continuous Integration and Continuous Delivery)
  • No automation: run all processes manually.
  • Pure scheduling: run stages of your pipeline according to a schedule
  • RAG = A directed acyclic graph (RAG) → use tool like Airflow
  • Just like codes, data also needs version control → track changes and be able to revert.
  • The entire infrastructure also needs version control.
  • In some specs, DataOps is near Data Management and SE.

Infrastructure as Code

  • Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.
    • No need to manually run bash scripts or clicking.
  • Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)
    • HCL is a declarative language. You just have to declare what you want the infracstructure to look like.
  • Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)
    • vs imperative/procedural language like Bash
    • eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.

Terraform

  • Terraform structure
    • 1# Blocks of code
      2keyword labels {
      3	arguments
      4	blocks
      5}
  • Sample code
    • 1# terraform settings
      2terraform {
      3  required_providers {
      4    aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
      5      source  = "hashicorp/aws" # global identifier
      6      version = ">= 4.16"
      7    }
      8  }
      9
      10  required_version = ">= 1.2.0" # version constraint for terraform
      11}
      12
      13# provides
      14# https://registry.terraform.io/browse/providers
      15provider "aws" { # use local name
      16  region = "us-east-1"
      17}
      18
      19# data source
      20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
      21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
      22data "aws_subnet" "selected_subnet" {
      23  id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
      24  # ☝ access this by: data.aws_subnet.selected_subnet.id
      25}
      26
      27# Ask Terraform to get the latest Amazon Linux AMI with specific name and architecture
      28data "aws_ami" "latest_amazon_linux" {
      29  most_recent = true
      30  owners = ["amazon"]
      31  filter {
      32    name = "architecture"
      33    values = ["x86_64"]
      34  }
      35  filter {
      36    name = "name"
      37    values = ["a1202*-ami-202*"]
      38  }
      39}
      40
      41# resouces
      42# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
      43resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
      44  # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
      45  ami = data.aws_ami.latest_amazon_linux.id
      46  instance_type = "t2.micro"
      47  subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
      48  tags = {
      49    Name = var.server_name
      50  }
      51}
      52
      53# input
      54variable "region" { # to use this variable in the code, use "var.region"
      55  type        = string
      56  default     = "us-east-1"
      57  description = "region for aws resouces"
      58}
      59
      60variable "server_name" {
      61  type        = string
      62  # if default is not provided, it will be prompted during terraform apply
      63  # terraform apply -var server_name=ExampleServer
      64  # or use terraform.tfvars file
      65  description = "name of the server running the website"
      66}
      67
      68# output
      69# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
      70# terraform output to show all
      71# terraform output server_id to show only server_id
      72output "server_id" {
      73  value = aws_instance.webserver.id
      74}
      75
      76output "server_arn" {
      77  value = aws_instance.webserver.arn
      78}
    • terraform init to install all providers defined in the config file.
    • terraform plan to create execution plan: create / update / destroy based on the config file.
    • terraform apply to confim the tasks.
    • If we have some changes in the file, we can run terraform apply again.
  • Variables can be put inside terraform.tfvars file.
  • We can split the .tf file into smaller files like variables.tf, outputs.tf, providers.tf and all resouces in main.tfterraform will automatically concatenate all tf files into one.
  • Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)
    • To update changes in module → run terraform initterraform apply
  • Data sources: Data blocks to reference resources created outside Terraform or in another Terraform workspace. Eg: Data source: aws_subnet

Additional terraform configuration

You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.
👉 Check these codes.

Lab 1 - Implementing DataOps with Terraform

☝️
Check the codes and read the markdown file there.
We are going to build a Bastion Host so that users can connect (using an SSH connection) to the Bastion Host.
  • Bastion host is in an EC2 instance
  • The main database is inside RDS which only allows the connection from EC2 (Bastion host). An external user cannot connect to the databse directly. ← use resource aws_security_group
All VPC, public and private subnets (each has 2 instances) are alreacy created using AWS CloudFormation. We will use data to get these resources.
VPC and subnets are provided and we need to complete terraform files.
We will need a SSH key pair: public key inside EC2 and private key is outside.
To create the database in EC2 instances, we need to complete tf files. ← these resources are created using AWS CloudFormation (In CloudFormation, click on the stack name that doesn’t start with cloud9 → output → see the list of resources created → see the ids of vpc and subnets we are going to call values in tf files.).
We need to find ids of vpc and subnets given by AWS CloudFormation
In providers.tf: we are going to use
  • bastion_host → ingress: bastion host can receive ssh connectin from the public internet
  • database → ingress: receive traffic only from the bastion host.
→ we will use these network resources in RDS and EC2 configs
Note: we have to specify 2 private subnets instead of 1 because rds is designed to expect at least 2 subnets in case you later want to switch to multi availability zone deployment.
ec2.tf
  • key_name represents the name of the SSH key pair that you need to associate to the bastion host. ← to do so, you will use tls_private_key to geneate the ssh key pair
  • local_file to create a file to store the private SSH key.
  • aws_key_pair will register the public key with aws so it can use the key_name inside the aws_instance
The variables of vpc and its subnets don’t have a default value → you have to indicate in .tfvars file.
terraform.tfstate
  • keep track of the state of the architecture and its configuration
  • contains information that maps the resouce instances to the actual aws objects.
  • share with your team by storin gin s2 bucket for example.
We will use S3 bucket to store the state of terraform. The DynamoDB table will prevent users from simultaneously editing the file.

Output values
1# db_host of rds
2terraform output db_host
3
4# db master username
5terraform output db_master_username
6
7# db master password
8terraform output db_master_password
9

Try to connect from outside to RDS directly (impossible because RDS only receive connections from EC2)
1psql -h <RDS-HOST> -U postgres_admin -p 5432 -d postgres --password
Just for test, this command is fail because rds locks the outside connection!
We allow users to connect to rds via bastion host (ec2) using an ssh connection, we will forward the connection from rds to ec2 and then to user by using
1ssh -i de-c2w3lab1-bastion-host-key.pem -L 5432:<RDS-HOST>:<DATABASE-PORT> ec2-user@<BASTION-HOST-DNS> -N -f
  • -i : identity_file and it specifies the file that contains the SSH private key.
  • -L : local port forwarding (L LOCAL_PORT:DESTINATION_HOSTNAME:DESTINATION_PORT USER@SERVER_IP): this means that you're forwarding the connection from the bastion host to the RDS instance.
  • -f : the command can be run in the background and -N means "do not execute a remote command".
  • If you'd like learn more about SSH tunneling, you can check this article.
Then we can connect to rds
1# pleasse note that this 5432 is the port that is already ported from 
2# the same port of rds using ssh connection
3psql -h localhost -U postgres_admin -p 5432 -d postgres --password

DataOps - Observability

Data Observability

  • Observability tools to gain visibiklity into system’s health.
  • Metrics: CPU and RAM, response time
  • Purposes: Quickly detect anomalies, Identify problems, Prevent downtime, Ensure reliable software products
  • DE → monitor the health of data systems, the health and quality of data.
  • High quality data? → accurate, complete, discoverable, available in a timely manner. ← well-defined schema, data definitions.
  • Low quality data is worse than no data.
  • ☝️ When there’s a disruption to your data system, how do you make sure you know what happened as soon as possible → Data observability and monitoring come in!

Conversation with Barr Moses (Data Observability & Monitoring)

  1. Introduction to Data Observability:
      • Data observability is akin to software observability but focuses on ensuring data quality and reliability.
      • It addresses the pain point where data teams face inaccurate or inconsistent data, causing disruptions in decision-making.
  1. Importance of Trusted Data:
      • The goal is to help organizations trust and rely on their data, ensuring its accuracy and timeliness.
  1. Core Components of Data Systems:
      • Data systems consist of the data itself, the code transforming it, and the infrastructure managing it.
      • Issues can occur in any of these components, impacting downstream processes and outcomes.
  1. Key Metrics for Data Monitoring:
      • Number of incidents: Tracking how often data issues occur.
      • Time to detection: Measuring how long it takes to identify an issue.
      • Time to resolution: Evaluating the duration needed to fix data issues.
  1. Success Stories:
      • Example: JetBlue used data observability to improve internal team satisfaction (measured through Net Promoter Scores) and enhance operational efficiency.
  1. Importance of Engaging with Stakeholders:
      • Understanding stakeholders' needs is crucial for developing relevant and effective data solutions.
      • Testing solutions with real users is essential for refining products.
  1. Continuous Learning and Adaptation:
      • Staying updated in the rapidly changing data industry is vital.
      • Emphasizing the importance of curiosity and flexibility in learning.
  1. Advice for Learners:
      • Focus on customer needs and market trends rather than relying solely on general advice.
      • Maintain a passion for learning and adapting to new industry developments.

Monitoring Data Quality

  • Question is: Where do you state to monitor data quality?
  • Based on metrics like: volume (in each batch), distribution (range of values), null values, freshness (difference between data now and the most recent timestamp)
Focus on:
  • The most important metrics.
  • Avoid creating confusion and “alert fatigue”
  • What do stakeholders care the most?

Conversation with Abe Gong (about Great Expectations)

The conversation discusses the origins, goals, and functions of the open-source project “Great Expectations,” co-founded by Abe Gong and James Campbell. Abe explains that their shared experience with challenging data quality issues in fields like healthcare inspired the project. Great Expectations enables data quality testing similar to software testing by allowing users to set “expectations” about data at specific points in a pipeline to ensure reliability. Abe highlights flexibility in deployment, suggesting users tailor its use based on team and stakeholder requirements. He also emphasizes the importance of data quality for reliable data systems and hints at the upcoming launch of Great Expectations Cloud, which aims to improve accessibility for non-technical stakeholders.

Great Expectations (GX)

Great Expectations enables you to define expectations for your data and to automatically validate your data against these expectations. It can also notify you of any inconsistencies detected, and you can use it to validate your data at any stage of your data pipeline.
When working with GX, the workflow: (1) Specify the data → (2) Define your expectations → (3) Validate your data against the expectations.

Conversation with Chad Sanderson (Data contract)

There are also other ways to maintain data quality expectations with your upstream stakeholders, including using what's called a data contract.
  • Definition: Data contracts act like APIs, setting data quality expectations.
  • Purpose: Prevents data quality issues between data producers and consumers.
  • Traditional vs. Modern: Older, centralized systems need fewer contracts; cloud-based systems benefit more.
  • Components: Includes schema, business logic, SLAs (Service-level agreement), and compliance rules.
  • Enforcement: Primarily programmatic, not legal.
  • Implementation: Start with high-impact data pipelines; integrate into workflows.
  • Communication: Essential to make producers aware of data impact.

Amazon CloudWatch

Common metrics for RDS:
  • CPU Utilization
    • High value: your RDS instance might be under heavy load
    • Values over 80-90%: can lead to performance bottlenecks
  • RAM Consumption
    • High RAM consumption: can slow down performance
  • Disk Space
    • Value consistently above 85%: you may need to delete or archive data to free up some space
  • Database Connections
    • The number of active connections to your database
    • Number of connections approaching the maximum limit: can lead to connection errors and application failures

Lab 2 — Amazon CloudWatch

Check this code.