👉 Lecture notes & Repositoty.
- Recall:
- DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
- DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.
- DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)
- Automation → CD/CI
- Infrastructure: AWS CLoudFormation, Terraform
Conversation with Chris Bergh (DataOps — Automation)
- DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.
- Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.
- Goal: Build a “data factory” to produce consistent and modifiable data outputs.
- Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.
- Key Principle: Build systems around code (e.g., testing, observability) for reliability.
- Testing: Essential for minimizing future errors and ensuring code quality.
- Iterative Development: Deliver small updates, get feedback, and iterate quickly.
- DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.
- Don’t Be a Hero: Avoid taking on too much; build systems to support your work.
- Automation and Validation: Always test and validate; measure everything for reliability.
- Proactive Systems: Build environments that ensure long-term success and reduce stress.
- Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.
- CI/CD (Continuous Integration and Continuous Delivery)
- No automation: run all processes manually.
- Pure scheduling: run stages of your pipeline according to a schedule
- RAG = A directed acyclic graph (RAG) → use tool like Airflow
- Just like codes, data also needs version control → track changes and be able to revert.
- The entire infrastructure also needs version control.
- In some specs, DataOps is near Data Management and SE.
- Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.
- No need to manually run bash scripts or clicking.
- Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)
- HCL is a declarative language. You just have to declare what you want the infracstructure to look like.
- Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)
- vs imperative/procedural language like Bash
- eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.
- Terraform structure
1# Blocks of code
2keyword labels {
3 arguments
4 blocks
5}
- Sample code
terraform init
to install all providers defined in the config file.terraform plan
to create execution plan: create / update / destroy based on the config file.terraform apply
to confim the tasks.- If we have some changes in the file, we can run
terraform apply
again.
1# terraform settings
2terraform {
3 required_providers {
4 aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
5 source = "hashicorp/aws" # global identifier
6 version = ">= 4.16"
7 }
8 }
9
10 required_version = ">= 1.2.0" # version constraint for terraform
11}
12
13# provides
14# https://registry.terraform.io/browse/providers
15provider "aws" { # use local name
16 region = "us-east-1"
17}
18
19# data source
20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
22data "aws_subnet" "selected_subnet" {
23 id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
24 # ☝ access this by: data.aws_subnet.selected_subnet.id
25}
26
27# Ask Terraform to get the latest Amazon Linux AMI with specific name and architecture
28data "aws_ami" "latest_amazon_linux" {
29 most_recent = true
30 owners = ["amazon"]
31 filter {
32 name = "architecture"
33 values = ["x86_64"]
34 }
35 filter {
36 name = "name"
37 values = ["a1202*-ami-202*"]
38 }
39}
40
41# resouces
42# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
43resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
44 # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
45 ami = data.aws_ami.latest_amazon_linux.id
46 instance_type = "t2.micro"
47 subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
48 tags = {
49 Name = var.server_name
50 }
51}
52
53# input
54variable "region" { # to use this variable in the code, use "var.region"
55 type = string
56 default = "us-east-1"
57 description = "region for aws resouces"
58}
59
60variable "server_name" {
61 type = string
62 # if default is not provided, it will be prompted during terraform apply
63 # terraform apply -var server_name=ExampleServer
64 # or use terraform.tfvars file
65 description = "name of the server running the website"
66}
67
68# output
69# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
70# terraform output to show all
71# terraform output server_id to show only server_id
72output "server_id" {
73 value = aws_instance.webserver.id
74}
75
76output "server_arn" {
77 value = aws_instance.webserver.arn
78}
- Variables can be put inside
terraform.tfvars
file.
- A provider is a plugin file or a binary file that Terraform will use to create your resources. → Browse Providers | Terraform Registry
- Local Names: unique per-module, can choose any local name. ← you should use a provider's preferred local name.
- Resource
aws_instance
- Output Values - Configuration Language | Terraform | HashiCorp Developer
- Use
terraform output
to show all outputs orterraform output server_id
to show the value of a specific output.
- We can split the
.tf
file into smaller files likevariables.tf
,outputs.tf
,providers.tf
and all resouces inmain.tf
← terraform will automatically concatenate all tf files into one.
- Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)
- To update changes in module → run
terraform init
→terraform apply
- Data sources: Data blocks to reference resources created outside Terraform or in another Terraform workspace. Eg: Data source: aws_subnet
You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.
👉 Check these codes.
Check the codes and read the markdown file there.
We are going to build a Bastion Host so that users can connect (using an SSH connection) to the Bastion Host.
- Bastion host is in an EC2 instance
- The main database is inside RDS which only allows the connection from EC2 (Bastion host). An external user cannot connect to the databse directly. ← use resource
aws_security_group
All VPC, public and private subnets (each has 2 instances) are alreacy created using AWS CloudFormation. We will use
data
to get these resources.We will need a SSH key pair: public key inside EC2 and private key is outside.
To create the database in EC2 instances, we need to complete tf files. ← these resources are created using AWS CloudFormation (In CloudFormation, click on the stack name that doesn’t start with cloud9 → output → see the list of resources created → see the ids of vpc and subnets we are going to call values in tf files.).
In
providers.tf
: we are going to usehashicorp/local
to create a file to store the private SSH key.
hashicorp/random
to generate random password for the database
hashicorp/tls
to create at the SSH key pair.
In
networks.tf
- bastion_host → ingress: bastion host can receive ssh connectin from the public internet
- database → ingress: receive traffic only from the bastion host.
→ we will use these network resources in RDS and EC2 configs
ec2.tf
key_name
represents the name of the SSH key pair that you need to associate to the bastion host. ← to do so, you will usetls_private_key
to geneate the ssh key pair
local_file
to create a file to store the private SSH key.
aws_key_pair
will register the public key with aws so it can use thekey_name
inside theaws_instance
The variables of vpc and its subnets don’t have a default value → you have to indicate in .tfvars file.
terraform.tfstate
- keep track of the state of the architecture and its configuration
- contains information that maps the resouce instances to the actual aws objects.
- share with your team by storin gin s2 bucket for example.
We will use S3 bucket to store the state of terraform. The DynamoDB table will prevent users from simultaneously editing the file.
Output values
1# db_host of rds
2terraform output db_host
3
4# db master username
5terraform output db_master_username
6
7# db master password
8terraform output db_master_password
9
Try to connect from outside to RDS directly (impossible because RDS only receive connections from EC2)
1psql -h <RDS-HOST> -U postgres_admin -p 5432 -d postgres --password
Just for test, this command is fail because rds locks the outside connection!
We allow users to connect to rds via bastion host (ec2) using an ssh connection, we will forward the connection from rds to ec2 and then to user by using
1ssh -i de-c2w3lab1-bastion-host-key.pem -L 5432:<RDS-HOST>:<DATABASE-PORT> ec2-user@<BASTION-HOST-DNS> -N -f
-i
: identity_file and it specifies the file that contains the SSH private key.
-L
: local port forwarding (L LOCAL_PORT:DESTINATION_HOSTNAME:DESTINATION_PORT USER@SERVER_IP
): this means that you're forwarding the connection from the bastion host to the RDS instance.
-f
: the command can be run in the background and-N
means "do not execute a remote command".
- If you'd like learn more about SSH tunneling, you can check this article.
Then we can connect to rds
1# pleasse note that this 5432 is the port that is already ported from
2# the same port of rds using ssh connection
3psql -h localhost -U postgres_admin -p 5432 -d postgres --password
- Observability tools to gain visibiklity into system’s health.
- Metrics: CPU and RAM, response time
- Purposes: Quickly detect anomalies, Identify problems, Prevent downtime, Ensure reliable software products
- DE → monitor the health of data systems, the health and quality of data.
- High quality data? → accurate, complete, discoverable, available in a timely manner. ← well-defined schema, data definitions.
- Low quality data is worse than no data.
- ☝️ When there’s a disruption to your data system, how do you make sure you know what happened as soon as possible → Data observability and monitoring come in!
Conversation with Barr Moses (Data Observability & Monitoring)
- Introduction to Data Observability:
- Data observability is akin to software observability but focuses on ensuring data quality and reliability.
- It addresses the pain point where data teams face inaccurate or inconsistent data, causing disruptions in decision-making.
- Importance of Trusted Data:
- The goal is to help organizations trust and rely on their data, ensuring its accuracy and timeliness.
- Core Components of Data Systems:
- Data systems consist of the data itself, the code transforming it, and the infrastructure managing it.
- Issues can occur in any of these components, impacting downstream processes and outcomes.
- Key Metrics for Data Monitoring:
- Number of incidents: Tracking how often data issues occur.
- Time to detection: Measuring how long it takes to identify an issue.
- Time to resolution: Evaluating the duration needed to fix data issues.
- Success Stories:
- Example: JetBlue used data observability to improve internal team satisfaction (measured through Net Promoter Scores) and enhance operational efficiency.
- Importance of Engaging with Stakeholders:
- Understanding stakeholders' needs is crucial for developing relevant and effective data solutions.
- Testing solutions with real users is essential for refining products.
- Continuous Learning and Adaptation:
- Staying updated in the rapidly changing data industry is vital.
- Emphasizing the importance of curiosity and flexibility in learning.
- Advice for Learners:
- Focus on customer needs and market trends rather than relying solely on general advice.
- Maintain a passion for learning and adapting to new industry developments.
- Question is: Where do you state to monitor data quality?
- Based on metrics like: volume (in each batch), distribution (range of values), null values, freshness (difference between data now and the most recent timestamp)
→ Focus on:
- The most important metrics.
- Avoid creating confusion and “alert fatigue”
- What do stakeholders care the most?
The conversation discusses the origins, goals, and functions of the open-source project “Great Expectations,” co-founded by Abe Gong and James Campbell. Abe explains that their shared experience with challenging data quality issues in fields like healthcare inspired the project. Great Expectations enables data quality testing similar to software testing by allowing users to set “expectations” about data at specific points in a pipeline to ensure reliability. Abe highlights flexibility in deployment, suggesting users tailor its use based on team and stakeholder requirements. He also emphasizes the importance of data quality for reliable data systems and hints at the upcoming launch of Great Expectations Cloud, which aims to improve accessibility for non-technical stakeholders.
Great Expectations enables you to define expectations for your data and to automatically validate your data against these expectations. It can also notify you of any inconsistencies detected, and you can use it to validate your data at any stage of your data pipeline.
When working with GX, the workflow: (1) Specify the data → (2) Define your expectations → (3) Validate your data against the expectations.
There are also other ways to maintain data quality expectations with your upstream stakeholders, including using what's called a data contract.
- Definition: Data contracts act like APIs, setting data quality expectations.
- Purpose: Prevents data quality issues between data producers and consumers.
- Traditional vs. Modern: Older, centralized systems need fewer contracts; cloud-based systems benefit more.
- Components: Includes schema, business logic, SLAs (Service-level agreement), and compliance rules.
- Enforcement: Primarily programmatic, not legal.
- Implementation: Start with high-impact data pipelines; integrate into workflows.
- Communication: Essential to make producers aware of data impact.
Common metrics for RDS:
- CPU Utilization
- High value: your RDS instance might be under heavy load
- Values over 80-90%: can lead to performance bottlenecks
- RAM Consumption
- High RAM consumption: can slow down performance
- Disk Space
- Value consistently above 85%: you may need to delete or archive data to free up some space
- Database Connections
- The number of active connections to your database
- Number of connections approaching the maximum limit: can lead to connection errors and application failures
Check this code.