DE by DL.AI - C2 W3 - DataOps

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

DataOps - Automation

Overview

Recall:

DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.

DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)

Automation → CD/CI

Infrastructure: AWS CLoudFormation, Terraform

Conversation with Chris Bergh (DataOps — Automation)

DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.

Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.

Goal: Build a “data factory” to produce consistent and modifiable data outputs.

Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.

Key Principle: Build systems around code (e.g., testing, observability) for reliability.

Testing: Essential for minimizing future errors and ensuring code quality.

Iterative Development: Deliver small updates, get feedback, and iterate quickly.

DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.

Don’t Be a Hero: Avoid taking on too much; build systems to support your work.

Automation and Validation: Always test and validate; measure everything for reliability.

Proactive Systems: Build environments that ensure long-term success and reduce stress.

Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.

DataOps Automation

CI/CD (Continuous Integration and Continuous Delivery)

No automation: run all processes manually.

Pure scheduling: run stages of your pipeline according to a schedule

RAG = A directed acyclic graph (RAG) → use tool like Airflow

Just like codes, data also needs version control → track changes and be able to revert.

The entire infrastructure also needs version control.

In some specs, DataOps is near Data Management and SE.

Infrastructure as Code

Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.

No need to manually run bash scripts or clicking.

Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)

HCL is a declarative language. You just have to declare what you want the infracstructure to look like.

Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)

vs imperative/procedural language like Bash
eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.

Terraform

Documentation | Terraform | HashiCorp Developer

Install Terraform | Terraform | HashiCorp Developer

Terraform structure

1# Blocks of code
2keyword labels {
3	arguments
4	blocks
5}

Sample code

1# terraform settings
2terraform {
3  required_providers {
4    aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
5      source  = "hashicorp/aws" # global identifier
6      version = ">= 4.16"
7    }
8  }
9
10  required_version = ">= 1.2.0" # version constraint for terraform
11}
12
13# provides
14# https://registry.terraform.io/browse/providers
15provider "aws" { # use local name
16  region = "us-east-1"
17}
18
19# data source
20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
22data "aws_subnet" "selected_subnet" {
23  id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
24  # ☝ access this by: data.aws_subnet.selected_subnet.id
25}
26
27# Ask Terraform to get the latest Amazon Linux AMI with specific name and architecture
28data "aws_ami" "latest_amazon_linux" {
29  most_recent = true
30  owners = ["amazon"]
31  filter {
32    name = "architecture"
33    values = ["x86_64"]
34  }
35  filter {
36    name = "name"
37    values = ["a1202*-ami-202*"]
38  }
39}
40
41# resouces
42# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
43resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
44  # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
45  ami = data.aws_ami.latest_amazon_linux.id
46  instance_type = "t2.micro"
47  subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
48  tags = {
49    Name = var.server_name
50  }
51}
52
53# input
54variable "region" { # to use this variable in the code, use "var.region"
55  type        = string
56  default     = "us-east-1"
57  description = "region for aws resouces"
58}
59
60variable "server_name" {
61  type        = string
62  # if default is not provided, it will be prompted during terraform apply
63  # terraform apply -var server_name=ExampleServer
64  # or use terraform.tfvars file
65  description = "name of the server running the website"
66}
67
68# output
69# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
70# terraform output to show all
71# terraform output server_id to show only server_id
72output "server_id" {
73  value = aws_instance.webserver.id
74}
75
76output "server_arn" {
77  value = aws_instance.webserver.arn
78}

main.tf

terraform init to install all providers defined in the config file.
terraform plan to create execution plan: create / update / destroy based on the config file.
terraform apply to confim the tasks.
If we have some changes in the file, we can run terraform apply again.

Variables can be put inside terraform.tfvars file.

A provider is a plugin file or a binary file that Terraform will use to create your resources. → Browse Providers | Terraform Registry

Local Names: unique per-module, can choose any local name. ← you should use a provider's preferred local name.
Resource aws_instance

Amazon Machine Images in Amazon EC2 - Amazon Elastic Compute Cloud

Find an AMI that meets the requirements for your EC2 instance - Amazon Elastic Compute Cloud

Input Variables - Configuration Language | Terraform | HashiCorp Developer

Output Values - Configuration Language | Terraform | HashiCorp Developer

Use terraform output to show all outputs or terraform output server_id to show the value of a specific output.

We can split the .tf file into smaller files like variables.tf, outputs.tf, providers.tf and all resouces in main.tf ← terraform will automatically concatenate all tf files into one.

aws_subnet

Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)

To update changes in module → run terraform init → terraform apply

Data sources: Data blocks to reference resources created outside Terraform or in another Terraform workspace. Eg: Data source: aws_subnet

Additional terraform configuration

You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.

👉 Check these codes.

Working with a DB instance in a VPC - Amazon Relational Database Service

DB instance classes - Amazon Relational Database Service

aws_db_instance (args & attributes)

aws_subnets (args)

aws_db_subnet_group (args and attributes)

Lab 1 - Implementing DataOps with Terraform

☝️

Check the codes and read the markdown file there.

We are going to build a Bastion Host so that users can connect (using an SSH connection) to the Bastion Host.

Bastion host is in an EC2 instance

The main database is inside RDS which only allows the connection from EC2 (Bastion host). An external user cannot connect to the databse directly. ← use resource aws_security_group

All VPC, public and private subnets (each has 2 instances) are alreacy created using AWS CloudFormation. We will use data to get these resources.

VPC and subnets are provided and we need to complete terraform files.

We will need a SSH key pair: public key inside EC2 and private key is outside.

To create the database in EC2 instances, we need to complete tf files. ← these resources are created using AWS CloudFormation (In CloudFormation, click on the stack name that doesn’t start with cloud9 → output → see the list of resources created → see the ids of vpc and subnets we are going to call values in tf files.).

We need to find ids of vpc and subnets given by AWS CloudFormation

In providers.tf: we are going to use

hashicorp/local to create a file to store the private SSH key.

hashicorp/random to generate random password for the database

hashicorp/tls to create at the SSH key pair.

In networks.tf

bastion_host → ingress: bastion host can receive ssh connectin from the public internet

database → ingress: receive traffic only from the bastion host.

→ we will use these network resources in RDS and EC2 configs

Note: we have to specify 2 private subnets instead of 1 because rds is designed to expect at least 2 subnets in case you later want to switch to multi availability zone deployment.

ec2.tf

key_name represents the name of the SSH key pair that you need to associate to the bastion host. ← to do so, you will use tls_private_key to geneate the ssh key pair

local_file to create a file to store the private SSH key.

aws_key_pair will register the public key with aws so it can use the key_name inside the aws_instance

The variables of vpc and its subnets don’t have a default value → you have to indicate in .tfvars file.

terraform.tfstate

keep track of the state of the architecture and its configuration

contains information that maps the resouce instances to the actual aws objects.

share with your team by storing in s3 bucket for example.

We will use S3 bucket to store the state of terraform. The DynamoDB table will prevent users from simultaneously editing the file.

Output values

1# db_host of rds
2terraform output db_host
3
4# db master username
5terraform output db_master_username
6
7# db master password
8terraform output db_master_password
9

Try to connect from outside to RDS directly (impossible because RDS only receive connections from EC2)

1psql -h <RDS-HOST> -U postgres_admin -p 5432 -d postgres --password

Just for test, this command is fail because rds locks the outside connection!

We allow users to connect to rds via bastion host (ec2) using an ssh connection, we will forward the connection from rds to ec2 and then to user by using

1ssh -i de-c2w3lab1-bastion-host-key.pem -L 5432:<RDS-HOST>:<DATABASE-PORT> ec2-user@<BASTION-HOST-DNS> -N -f

-i : identity_file and it specifies the file that contains the SSH private key.

-L : local port forwarding (L LOCAL_PORT:DESTINATION_HOSTNAME:DESTINATION_PORT USER@SERVER_IP): this means that you're forwarding the connection from the bastion host to the RDS instance.

-f : the command can be run in the background and -N means "do not execute a remote command".

If you'd like learn more about SSH tunneling, you can check this article.

Then we can connect to rds

1# pleasse note that this 5432 is the port that is already ported from 
2# the same port of rds using ssh connection
3psql -h localhost -U postgres_admin -p 5432 -d postgres --password

DataOps - Observability

Data Observability

Observability tools to gain visibiklity into system’s health.

Metrics: CPU and RAM, response time

Purposes: Quickly detect anomalies, Identify problems, Prevent downtime, Ensure reliable software products

DE → monitor the health of data systems, the health and quality of data.

High quality data? → accurate, complete, discoverable, available in a timely manner. ← well-defined schema, data definitions.

Low quality data is worse than no data.

☝️ When there’s a disruption to your data system, how do you make sure you know what happened as soon as possible → Data observability and monitoring come in!

Conversation with Barr Moses (Data Observability & Monitoring)

Introduction to Data Observability:

Data observability is akin to software observability but focuses on ensuring data quality and reliability.

It addresses the pain point where data teams face inaccurate or inconsistent data, causing disruptions in decision-making.

Importance of Trusted Data:

The goal is to help organizations trust and rely on their data, ensuring its accuracy and timeliness.

Core Components of Data Systems:

Data systems consist of the data itself, the code transforming it, and the infrastructure managing it.

Issues can occur in any of these components, impacting downstream processes and outcomes.

Key Metrics for Data Monitoring:

Number of incidents: Tracking how often data issues occur.

Time to detection: Measuring how long it takes to identify an issue.

Time to resolution: Evaluating the duration needed to fix data issues.

Success Stories:

Example: JetBlue used data observability to improve internal team satisfaction (measured through Net Promoter Scores) and enhance operational efficiency.

Importance of Engaging with Stakeholders:

Understanding stakeholders' needs is crucial for developing relevant and effective data solutions.

Testing solutions with real users is essential for refining products.

Continuous Learning and Adaptation:

Staying updated in the rapidly changing data industry is vital.

Emphasizing the importance of curiosity and flexibility in learning.

Advice for Learners:

Focus on customer needs and market trends rather than relying solely on general advice.

Maintain a passion for learning and adapting to new industry developments.

Monitoring Data Quality

Question is: Where do you state to monitor data quality?

Based on metrics like: volume (in each batch), distribution (range of values), null values, freshness (difference between data now and the most recent timestamp)

→ Focus on:

The most important metrics.

Avoid creating confusion and “alert fatigue”

What do stakeholders care the most?

Conversation with Abe Gong (about Great Expectations)

The conversation discusses the origins, goals, and functions of the open-source project “Great Expectations,” co-founded by Abe Gong and James Campbell. Abe explains that their shared experience with challenging data quality issues in fields like healthcare inspired the project. Great Expectations enables data quality testing similar to software testing by allowing users to set “expectations” about data at specific points in a pipeline to ensure reliability. Abe highlights flexibility in deployment, suggesting users tailor its use based on team and stakeholder requirements. He also emphasizes the importance of data quality for reliable data systems and hints at the upcoming launch of Great Expectations Cloud, which aims to improve accessibility for non-technical stakeholders.

Great Expectations (GX)

Great Expectations enables you to define expectations for your data and to automatically validate your data against these expectations. It can also notify you of any inconsistencies detected, and you can use it to validate your data at any stage of your data pipeline.

When working with GX, the workflow: (1) Specify the data → (2) Define your expectations → (3) Validate your data against the expectations.

GX Expectations Gallery • Great Expectations

👉 Check this lab.

Conversation with Chad Sanderson (Data contract)

There are also other ways to maintain data quality expectations with your upstream stakeholders, including using what's called a data contract.

Definition: Data contracts act like APIs, setting data quality expectations.

Purpose: Prevents data quality issues between data producers and consumers.

Traditional vs. Modern: Older, centralized systems need fewer contracts; cloud-based systems benefit more.

Components: Includes schema, business logic, SLAs (Service-level agreement), and compliance rules.

Enforcement: Primarily programmatic, not legal.

Implementation: Start with high-impact data pipelines; integrate into workflows.

Communication: Essential to make producers aware of data impact.

Amazon CloudWatch

Common metrics for RDS:

CPU Utilization

High value: your RDS instance might be under heavy load
Values over 80-90%: can lead to performance bottlenecks

RAM Consumption

High RAM consumption: can slow down performance

Disk Space

Value consistently above 85%: you may need to delete or archive data to free up some space

Database Connections

The number of active connections to your database
Number of connections approaching the maximum limit: can lead to connection errors and application failures

Lab 2 — Amazon CloudWatch

Check this code.

👉

DE by DL.AI - C2 W4 - Orchestration, Monitoring, and Automating your data pipelines