DE by DL.AI - C2: Source Systems, Data Ingestion, and Pipelines (W3 - DataOps)

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

DataOps - Automation

Overview

  • Recall:
    • DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
    • DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.
  • DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)
  • Automation → CD/CI
  • Infrastructure: AWS CLoudFormation, Terraform

Conversation with Chris Bergh

  • DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.
  • Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.
  • Goal: Build a “data factory” to produce consistent and modifiable data outputs.
  • Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.
  • Key Principle: Build systems around code (e.g., testing, observability) for reliability.
  • Testing: Essential for minimizing future errors and ensuring code quality.
  • Iterative Development: Deliver small updates, get feedback, and iterate quickly.
  • DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.
  • Don’t Be a Hero: Avoid taking on too much; build systems to support your work.
  • Automation and Validation: Always test and validate; measure everything for reliability.
  • Proactive Systems: Build environments that ensure long-term success and reduce stress.
  • Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.

DataOps Automation

  • CI/CD (Continuous Integration and Continuous Delivery)
  • No automation: run all processes manually.
  • Pure scheduling: run stages of your pipeline according to a schedule
  • RAG = A directed acyclic graph (RAG) → use tool like Airflow
  • Just like codes, data also needs version control → track changes and be able to revert.
  • The entire infrastructure also needs version control.
  • In some specs, DataOps is near Data Management and SE.

Infrastructure as Code

  • Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.
    • No need to manually run bash scripts or clicking.
  • Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)
    • HCL is a declarative language. You just have to declare what you want the infracstructure to look like.
  • Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)
    • vs imperative/procedural language like Bash
    • eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.

Terraform

  • Sample code
    • 1# terraform settings
      2terraform {
      3  required_providers {
      4    aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
      5      source  = "hashicorp/aws" # global identifier
      6      version = ">= 4.16"
      7    }
      8  }
      9
      10  required_version = ">= 1.2.0" # version constraint for terraform
      11}
      12
      13# provides
      14# https://registry.terraform.io/browse/providers
      15provider "aws" { # use local name
      16  region = "us-east-1"
      17}
      18
      19# data source
      20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
      21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
      22data "aws_subnet" "selected_subnet" {
      23  id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
      24  # ☝ access this by: data.aws_subnet.selected_subnet.id
      25}
      26
      27data "aws_ami" "latest_amazon_linux" {
      28  most_recent = true
      29  owners = ["amazon"]
      30  filter {
      31    name = "architecture"
      32    values = ["x86_64"]
      33  }
      34  filter {
      35    name = "name"
      36    values = ["a1202*-ami-202*"]
      37  }
      38}
      39
      40# resouces
      41# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
      42resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
      43  # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
      44  ami = data.aws_ami.latest_amazon_linux.id
      45  instance_type = "t2.micro"
      46  subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
      47  tags = {
      48    Name = var.server_name
      49  }
      50}
      51
      52# input
      53variable "region" { # to use this variable in the code, use "var.region"
      54  type        = string
      55  default     = "us-east-1"
      56  description = "region for aws resouces"
      57}
      58
      59variable "server_name" {
      60  type        = string
      61  # if default is not provided, it will be prompted during terraform apply
      62  # terraform apply -var server_name=ExampleServer
      63  # or use terraform.tfvars file
      64  description = "name of the server running the website"
      65}
      66
      67# output
      68# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
      69# terraform output to show all
      70# terraform output server_id to show only server_id
      71output "server_id" {
      72  value = aws_instance.webserver.id
      73}
      74
      75output "server_arn" {
      76  value = aws_instance.webserver.arn
      77}
    • terraform init to install all providers defined in the config file.
    • terraform plan to create execution plan: create / update / destroy based on the config file.
    • terraform apply to confim the tasks.
    • If we have some changes in the file, we can run terraform apply again.
  • Variables can be put inside terraform.tfvars file.
  • We can split the .tf file into smaller files like variables.tf, outputs.tf, providers.tf and all resouces in main.tfterraform will automatically concatenate all tf files into one.
  • Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)
    • To update changes in module → run terraform initterraform apply

Additional terraform configuration

You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.
👉 Check these codes.

Lab 1 - Implementing DataOps with Terraform

DataOps - Observability