👉 Lecture notes & Repositoty.
- Recall:
- DataOps: Set of practices and cultural habits centered around building robust data systems and delivering high quality data products.
- DevOps: Set of practices and cultural habits that allow software engineers to efficiently deliver and maintain high quality software products.
- DataOps → Automation, Observability & Monitoring, Incident Response (we don’t consider this in this week because it’s cultural habits side of DataOps)
- Automation → CD/CI
- Infrastructure: AWS CLoudFormation, Terraform
Conversation with Chris Bergh
- DataOps Definition: Methodology for delivering data insights quickly, reliably, and with high quality.
- Inspiration: Derived from lean manufacturing, focusing on efficiency and adaptability.
- Goal: Build a “data factory” to produce consistent and modifiable data outputs.
- Problem in Traditional Data Engineering: Failures are due to flawed systems, not technology or talent.
- Key Principle: Build systems around code (e.g., testing, observability) for reliability.
- Testing: Essential for minimizing future errors and ensuring code quality.
- Iterative Development: Deliver small updates, get feedback, and iterate quickly.
- DataOps vs. DevOps: Both focus on quick, quality delivery; DataOps specifically targets data workflows.
- Don’t Be a Hero: Avoid taking on too much; build systems to support your work.
- Automation and Validation: Always test and validate; measure everything for reliability.
- Proactive Systems: Build environments that ensure long-term success and reduce stress.
- Balance Optimism with Systems: Don’t rely on hope—verify and automate processes for efficiency.
- CI/CD (Continuous Integration and Continuous Delivery)
- No automation: run all processes manually.
- Pure scheduling: run stages of your pipeline according to a schedule
- RAG = A directed acyclic graph (RAG) → use tool like Airflow
- Just like codes, data also needs version control → track changes and be able to revert.
- The entire infrastructure also needs version control.
- In some specs, DataOps is near Data Management and SE.
- Infrastructure as code tools: Terraform, AWS CloudFormation (native to AWS), Ansible → allowed to provision and configure their infrastructure using code-based configuration files.
- No need to manually run bash scripts or clicking.
- Terraform Language → Domain-Specific Language: HCL (HashiCorp Configuration Language)
- HCL is a declarative language. You just have to declare what you want the infracstructure to look like.
- Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)
- vs imperative/procedural language like Bash
- eg: script to create 5 EC2 instances → Terraform just create and make sure only 5 instances are created while bash may create 5x instances.
- Sample code
terraform init
to install all providers defined in the config file.terraform plan
to create execution plan: create / update / destroy based on the config file.terraform apply
to confim the tasks.- If we have some changes in the file, we can run
terraform apply
again.
1# terraform settings
2terraform {
3 required_providers {
4 aws = { # local name: https://developer.hashicorp.com/terraform/language/providers/requirements#local-names
5 source = "hashicorp/aws" # global identifier
6 version = ">= 4.16"
7 }
8 }
9
10 required_version = ">= 1.2.0" # version constraint for terraform
11}
12
13# provides
14# https://registry.terraform.io/browse/providers
15provider "aws" { # use local name
16 region = "us-east-1"
17}
18
19# data source
20# In case you want to create an instance inside a subnet that is already created, you can use the following code:
21# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/subnet
22data "aws_subnet" "selected_subnet" {
23 id = "subnet-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets.html
24 # ☝ access this by: data.aws_subnet.selected_subnet.id
25}
26
27data "aws_ami" "latest_amazon_linux" {
28 most_recent = true
29 owners = ["amazon"]
30 filter {
31 name = "architecture"
32 values = ["x86_64"]
33 }
34 filter {
35 name = "name"
36 values = ["a1202*-ami-202*"]
37 }
38}
39
40# resouces
41# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
42resource "aws_instance" "webserver" { # can be referenced as "aws_instance.webserver" in other parts of the code
43 # ami = "ami-0c55b159cbfafe1f0" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html
44 ami = data.aws_ami.latest_amazon_linux.id
45 instance_type = "t2.micro"
46 subnet_id = data.aws_subnet.selected_subnet.id # Without this, the instance will be created in the default VPC.
47 tags = {
48 Name = var.server_name
49 }
50}
51
52# input
53variable "region" { # to use this variable in the code, use "var.region"
54 type = string
55 default = "us-east-1"
56 description = "region for aws resouces"
57}
58
59variable "server_name" {
60 type = string
61 # if default is not provided, it will be prompted during terraform apply
62 # terraform apply -var server_name=ExampleServer
63 # or use terraform.tfvars file
64 description = "name of the server running the website"
65}
66
67# output
68# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance
69# terraform output to show all
70# terraform output server_id to show only server_id
71output "server_id" {
72 value = aws_instance.webserver.id
73}
74
75output "server_arn" {
76 value = aws_instance.webserver.arn
77}
- Variables can be put inside
terraform.tfvars
file.
- A provider is a plugin file or a binary file that Terraform will use to create your resources. → Browse Providers | Terraform Registry
- Local Names: unique per-module, can choose any local name. ← you should use a provider's preferred local name.
- Resource
aws_instance
- Output Values - Configuration Language | Terraform | HashiCorp Developer
- Use
terraform output
to show all outputs orterraform output server_id
to show the value of a specific output.
- We can split the
.tf
file into smaller files likevariables.tf
,outputs.tf
,providers.tf
and all resouces inmain.tf
← terraform will automatically concatenate all tf files into one.
- Modules: you can put all files created in the above sample inside a module “website” and the create some main files instead (check the codes)
- To update changes in module → run
terraform init
→terraform apply
You are given the ID of a VPC created in the us-east-1 region. You will need to create a MySQL RDS database instance inside a subnet of the given VPC. The VPC contains two private subnets; the IDs of these subnets are not given to you.
👉 Check these codes.