DE by DL.AI - C2: Source Systems, Data Ingestion, and Pipelines (W3 - DataOps)

👉 Lecture notes & Repositoty.

DataOps - Automation

Overview

Recall:

DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.

DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)

Automation → CD/CI

Infrastructure: AWS CLoudFormation, Terraform

Conversation with Chris Bergh

DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.

Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.

Goal: Build a “data factory” to produce consistent and modifiable data outputs.

Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.

Key Principle: Build systems around code (e.g., testing, observability) for reliability.

Testing: Essential for minimizing future errors and ensuring code quality.

Iterative Development: Deliver small updates, get feedback, and iterate quickly.

DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.

Don’t Be a Hero: Avoid taking on too much; build systems to support your work.

Automation and Validation: Always test and validate; measure everything for reliability.

Proactive Systems: Build environments that ensure long-term success and reduce stress.

Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.

DataOps Automation

CI/CD (Continuous Integration and Continuous Delivery)

No automation: run all processes manually.

Pure scheduling: run stages of your pipeline according to a schedule

RAG = A directed acyclic graph (RAG) → use tool like Airflow

Just like codes, data also needs version control → track changes and be able to revert.

The entire infrastructure also needs version control.

In some specs, DataOps is near Data Management and SE.

Infrastructure as Code

Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.

No need to manually run bash scripts or clicking.

Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)

HCL is a declarative language. You just have to declare what you want the infracstructure to look like.

Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)

vs imperative/procedural language like Bash
eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.

Terraform

Documentation | Terraform | HashiCorp Developer

Install Terraform | Terraform | HashiCorp Developer

Sample code

1# terraform settings
2terraform {
3  required_providers {
4    aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
5      source  = "hashicorp/aws" # global identifier
6      version = ">= 4.16"
7    }
8  }
9
10  required_version = ">= 1.2.0" # version constraint for terraform
11}
12
13# provides
14# https://registry.terraform.io/browse/providers
15provider "aws" { # use local name
16  region = "us-east-1"
17}
18
19# data source
20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
22data "aws_subnet" "selected_subnet" {
23  id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
24  # ☝ access this by: data.aws_subnet.selected_subnet.id
25}
26
27data "aws_ami" "latest_amazon_linux" {
28  most_recent = true
29  owners = ["amazon"]
30  filter {
31    name = "architecture"
32    values = ["x86_64"]
33  }
34  filter {
35    name = "name"
36    values = ["a1202*-ami-202*"]
37  }
38}
39
40# resouces
41# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
42resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
43  # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
44  ami = data.aws_ami.latest_amazon_linux.id
45  instance_type = "t2.micro"
46  subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
47  tags = {
48    Name = var.server_name
49  }
50}
51
52# input
53variable "region" { # to use this variable in the code, use "var.region"
54  type        = string
55  default     = "us-east-1"
56  description = "region for aws resouces"
57}
58
59variable "server_name" {
60  type        = string
61  # if default is not provided, it will be prompted during terraform apply
62  # terraform apply -var server_name=ExampleServer
63  # or use terraform.tfvars file
64  description = "name of the server running the website"
65}
66
67# output
68# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
69# terraform output to show all
70# terraform output server_id to show only server_id
71output "server_id" {
72  value = aws_instance.webserver.id
73}
74
75output "server_arn" {
76  value = aws_instance.webserver.arn
77}

main.tf

terraform init to install all providers defined in the config file.
terraform plan to create execution plan: create / update / destroy based on the config file.
terraform apply to confim the tasks.
If we have some changes in the file, we can run terraform apply again.

Variables can be put inside terraform.tfvars file.

A provider is a plugin file or a binary file that Terraform will use to create your resources. → Browse Providers | Terraform Registry

Local Names: unique per-module, can choose any local name. ← you should use a provider's preferred local name.
Resource aws_instance

Find an AMI that meets the requirements for your EC2 instance - Amazon Elastic Compute Cloud

Input Variables - Configuration Language | Terraform | HashiCorp Developer

Output Values - Configuration Language | Terraform | HashiCorp Developer

Use terraform output to show all outputs or terraform output server_id to show the value of a specific output.

We can split the .tf file into smaller files like variables.tf, outputs.tf, providers.tf and all resouces in main.tf ← terraform will automatically concatenate all tf files into one.

aws_subnet

Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)

To update changes in module → run terraform init → terraform apply

Additional terraform configuration

You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.

👉 Check these codes.

Working with a DB instance in a VPC - Amazon Relational Database Service

DB instance classes - Amazon Relational Database Service

aws_db_instance (args & attributes)

aws_subnets (args)

aws_db_subnet_group (args and attributes)