Enterprise Architecture: the design of systems to support change in an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs
Data Architecture: the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.
One-way and Two-way Decisions:
One-way: A decision that is almost impossible to reverse
Two-way: An easily reversible decision. Eg: S3 Object Storage Classes
Can convert one-way to multiple smaller two-way decisions.
If you're thinking like an architect in your role as a data engineer, you'll build technical solutions that exist not just for their own sake, but in direct support of business goals
Conway’s Law
Very important guiding principle that has affected every architecture and system
Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.
As a data engineer, understanding your organization's communication structure is crucial for designing an effective data architecture that aligns with your company's needs and processes.
Common: facilitate team collaboration; break down silos
“Wise” choice: Identify tools that benefit all teams; avoid a one-size-fits-all approach.
As a data engineer, I also recommend you seek mentorship from data architects in your organization or elsewhere because eventually, you may well occupy the role of architect yourself.
Always Architecting
Systems built around reversible decisions (2-ways door) allow you to always be architecting
All teams must use API to communicate as well as to server data and functionality.
Make reversible decisions ↔ Always be architecting ↔ Build loosely coupled systems
Loosely-coupled systems:
Building loosely-coupled systems helps you ensure your decisions are reversible.
Loosely-coupled systems give you the ability to always be architecting.
When your systems fail
Things will go wrong no matter what.
When your systems fail → serve the needs of your organization
Plan for failure
Architect for scalability
Prioritize security
Embrace FinOps
Plan for failure: practical and quantitative approach with these metrics:
Availability: The percentage of time an IT service or a component is expected to be in an operable state.
Reliability: The probability of a particular service or component performing its intended function during a particular time interval
Durability: The ability of a storage system to withstand data loss due to hardware failure, software errors, or natural disasters.
RTO vs RPO
Recovery Time Objective (RTO): The maximum acceptable time for a service or system outage. Ex: Consider the impact to customers.
Recovery Point Objective (RPO): A definition of the acceptable state after recovery. Ex: Consider the maximum acceptable data loss
Prioritize Security
3 aspects: Culture of security, Principle of least privilege and Zero-trust security.
(old and dangerous): Hardened-Perimeter Approach
Zero-trust security: every action requires authentication + you build your system such that no people or applications, internal or external are trusted by default
Embrace FinOps: how to optimize a daily job in terms of cost and performance. ← manage cloud cost: pay-as-you-go models and readily scalable.
Batch Architectures
data is produced as a series of chunks
Example: ETL (Extract-Transform-Load) and ELT
Example: using batches in a process
Sometimes, we need a Data marts
Plan for failure:
When data sources die or schema change → 1st step is to talk to the owners.
Also check the availability and reliability specs.
Make reversible decisions: build flexibility into your system (ingest 1 day’s worth of data or 2 days or different amount of data)
Check also the cost-benefit analysis when building common components → make sure the value for the business.
Streaming Architectures
Data can be produced in a seties of events as clicks, sensor measurements,… ← continuous fashion (near real-time)
Lamda architecture: data engineers needed to figure out how to reconcile batch and streaming data into a single architecture.
Challenges: managing parallel systems with different code bases.
Kappa Architecture: Uses a stream processing platform for all data handling, ingestion, storage, and serving. It's a true event-based architecture where updates are automatically sent to relevant consumers for immediate reaction.
“You can think of batch processing as a special case of streaming” simply means that, since data is just information about events that are happening continuously out in the world, essentially all data is streaming at its source. Therefore, streaming ingestion could be thought of as the most natural or basic approach, while batch ingestion just imposes arbitrary boundaries on that same stream of data.
Architecting for compliance
Regulatory compliance is probably the most boring topic in these courses.
As a DE, you need to plan flexible for different compliances.
Constantly complying: not only regulations of today but also regulations of tomorrow
Use loosely coupled components (can replace at anytime)
Some kinds:
General Data Protection Regulation (GDPR) in Europe (EU) → personal data → right to have your data deleted…
Not only in regions but also in company or industry types. For example: Health Insurance Portability and Accountability Act (HIPAA) in US.
Choosing the Right Technologies
Choosing tools and tecnologies
Different types: Open source software vs Managed open source software vs Proprietary software
Keep in mind the end goal! → Deliver high-quality data products. You focus on DAr (what, why, when), the tools is for How.
Considerations
Location
On-Premises: company owns and maintains the hardware and software for their data stack.
Cloud provider: is responsible for building and maintaining the hardware in data centers.
You rent the compute and storage resources.
It’s easy to scale down and up.
You dont’ need to maintain or provision any hardware.
Or hybrid
Monolith vs Modular systems
Monolithic systems: self-contained systems that made up of tightly-coupled components.
Pros: easy to reason about and to understand + deal only with one technology ← good for simplicity and reasoning about
Cons: hard to maintain + update one component, need to update others too.
Interoperability, flexible and reversible decisions, continuosu improvement.
Cost optimization and Business value
Total Cost of Ownership (TCO): The total estimated cost of a solution, project or initiative over its entire lifecycle. — hardware & software, maintenance, training.
Direct costs: easy to identify costs, directly attributed to the dev of a data product. Ex: salaries, cloud bills, software subscriptions,…
Indirect costs (overhead): expenses that are not directly attributed to the dev of a data product. Ex: network downtime, IT support, loss of productivity.
For hardware & softare: CapEx vs OpEx
Capital Expenses (CapEx): the payment made to purchase long-term fixed assets ← On-Premises
Operational Expenses (OpEx): Expense associated with running the day-to-day operations (pay-as-you-go) ← Cloud
Total Opportunity Cost of Ownership (TOCO): The cost of lost opportunities that you incur in choosing a particular tool or technology
TOCO=0 when you choose the right and optimized stack of techs. However, components change overtime and TOCO will increase.
Minize TOCO → choose loosely-coupled components so that we can replace them with more optimized ones.
Build your own solution: be careful not to reinvent the wheel
Use existing solution → Choose between: open-source (community) vs commercial open-source (vendor) vs proprietary non-opne source.
Considerations:
Your team:
Does your team have the bandwidth and capabilities to implement an open source solution?
Are you a small team? Could using a managed or proprietary service free up your time?
Cost:
How much are licensing fees associated with managed or proprietary services?
What is the total cost to build and maintain a system?
Business value:
Do you get some advantage by building your own system compared to a managed service?
Are you avoiding undifferentiated heavy lifting?
Advices: for most teams (particularly small teams) → open sources, purchase commercial open-sources → if not found a solution, buy proprietary solution. ← your team can focus on the main things.
Avoiding undifferentiated heavy lifting means avoiding doing work that costs significant time and resources but doesn’t add value in terms of cost savings or advantage for the business.
Server, Container, and Serverless Compute Options
Server: you set up and manage the server → update the OS, install/update packages. Ex: EC2 instance.
Container: Modular unit that packages code and dependencies to run on a server → lightweight & portable. You set up the application code and dependencies
Serverless: You don’t need to set up or maintain the server → automatic scaling, availability & fault-tolerance, pay-as-you-go
→ Recommend: using serverless first, then containers and orchestration if possible.
How the undercurrents impact your decisions
Last week, we talked about 6 undercurrents: Security, Data Management, DataOps, Data Architecture, Orchestration, and Software Engineering.
Security:
Check what are the security features of the tool.
Know where your tools come from.
Use tools from reputable sources.
Check the source codes of open-source tools.
Data Management: How are data governance practices implemented by the tool provider? → Data Breach, Compliance, Data Quality.
DataOps: What features does the tool offer in terms of automation and monitoring? → Automation, Monitoring, Service Level Agreement (SLA)
Data Architecture: does the tool provide modularity and interoperability? → choose loosely-coupled components.
How you can develop and run your workloads on AWS more effectively.
Monitor your systems to gain insight into your operations.
Continuously improve your processes and procedures to deliver business value.
Security: How to take advantage of cloud technologies to protect your data, systems, and assets.
Reliability: Everything from designing for reliability to planning for failure and adapting to change.
Performance Efficiency
Taking a data-driven approach to building high-performance architecture.
Evaluating the ability of computing resources to efficiently meet system requirements.
How you can maintain that efficiency as demand changes and technologies evolve.
Cost Optimization
Building systems to deliver maximum business value at the lowest possible price point.
Use AWS Cost Explorer and Cost Optimization Hub to make comparisons and get recommendations about how to optimize costs for your systems.
Sustainability
Consider the environmental impact of the workloads you're running on the cloud.
Reducing energy consumption and increasing efficiency across all components of your system.
Benefits of these pillars:
Set of principles and questions
Helps you design and operate reliable, secure, efficient, cost-effective, and sustainable systems in the cloud
Helps you think through the pros and cons of different architecture choices
Lens is essentially an extension of the AWS Well-Architected Framework that focuses on a particular area, industry, or technology stack, and provides guidance specific to those contexts.
Configure the ALB to only receive certain types of requests.
A request needs the address and the port number.
A port number is a virtual identifier that applications use to differentiate between types of traffic.
Some private data can be leaked via some port due to incorrect configurations. ← to fix this, we need to adjust the security rules known as the security groups of the load balancer. ← Where: EC2 > Network & Security / Security Groups ← In this, there are 2 group:
1 is for ec2 (…-ec2-…) which only receives traffics from ALB ← make sure that!
Check in “Inbound rules”, the “Source” of group is the id of ALB
Another (…-alb-…) is for ALB which receives from users outside (via a some configured ports)
Edit inbound rules: 0.0.0.0/0 = accepts all IP address
Check EC2 Availability Zone (AZ)
Eg: us-east-1a, us-east-1b
To change template from t3.micro to t3.nano (for example) → Modify the Launch template of the auto scaling group → Where: EC2 > Auto Scaling > Auto Scaling Groups > Launch template > Edit.
Don’t forget to terminate all t3.micro instances that are running: EC2 > Instances > Instances > check all instances > (right click on any one of them) > Terminate instance
Enable automatic scaling process: EC2 > Auto Scale > Auto Scaling Groups > (group name) > Automatic scaling > Create dynamic scaling policy
target group (ports receiving the requests)
target value: eg 60 ← when there's more than 60 requests, the auto scaling group may scale out to add more instances to handle the increased load
instance warmup: This warm up time refers to the period during which newly launched instances are allowed to fully initialize before being considered in service for the auto scaling evaluation metrics.
To monitor the auto scaling: EC2 > Auto Scaling > Auto Scaling Groups > (group name) > Monitoring > EC2 ← here we can see some activities in the CPU usage and also in the network metrics.
Stakeholder Management & Gathering Requirements
Overview
What we’ve learned so far,
What we will learn this week 4:
Requirements
Requirements like hierarchy of needs.
Conversation with Matt Housley / with the CTO
Matt Housley plays a role of CTO (co-author writes the book “Fundamentals of Data Engineering”)
His advices:
First, learn and play with core data concepts.
Build your mental framework to think like a DE.
DE asks CTO about the overall goals of the business and technology ← eg. eCommerce:
Challenge: the market evolves, a lot of small brands come up. ← old code running could cause outages.
→ We will do refactors → but how if will affect you as a DE? ← DE: help modernize the systems
CTO: we want to make sure that the output of the refactorred codes is suitable for analytics with minimal processing. ← DE: what tools can I expect to be working with?
CTO: convert more from batch-based approach to streaming-based approach. → use AWS Kinesis + Kafka
CTO: you’re going to work as a consultant with the software side to understand the data they’re producing and then help them to produce better quality data.
DE: Does company has any goals w.r.t AI? ← CTO: we’re working with a recommendation engine, so your work is to dev data pipelines that feed that recommendation engine.
Conversation with Marketing
Back to previous weeks → 2 primary requests
delivering dashboards with metrics on product sales
product recommendation system
Key elements of requirements gathering when talking with the Markerting
Learn what existing data systems or solutions are in place
Learn what pain points or problems there are with the existing solutions
Learn what actions stakeholders plan to take based on the data you serve them
Identify any other stakeholders you’ll need to talk to if you’re still missing information
Tip: after learning → try to repeat what you learned back to the stakeholder.
DE: tall me more abotu thease deman spikes. ←sharply over a span of a few hours and then eventually drops off again, 1 day or 2
…many more but focus of the key elements above
Document the requirements you’ve gathered
Conversation with the Software Engineer
Open lines of communication with the source system owners and discuss how to best anticipate and deal with disruptions or changes in the data when they occur.
Key elements when talking to source system owners:
Learn what existing data systems or solutions are in place
Learn what pain points or problems there are with the existing solutions
Conversation takeaways
Documenting nonfunctional requirements: Nonfunctional requirements can be a little trickier than functional requirements in the sense that they won't, typically, be things that your stakeholders explicitly ask for.
Requirements Gathering Summary
“In my own experience, I've seen data engineering done wrong and more times, I've seen it done right. I just want to make sure that you're set up for success with the data systems that you build.”
Main takeaways:
Identify the stakeholders, understand their needs and the broader goals of the business
Ask open-ended questions
Document alal of your findings in a hierarchy form: business goals > stakeholder needs > system requirements (functional and nonfunctional)
Evaluation of trade offs: timeline vs cost vs scope (features of the system) ← Iron Triangle
→ The fallacty of the iron triangle: Projects done as quickly as possible + Projects done well + Projects are within budget constraints.
The way to break the iron triangle is through the application of principles and processes:
building loosely coupled systems
optimizing for 2 way door decisions
depply understanding the needs of stakeholders.
Translating Requirements to Architecture
Conversation with the Data Scientist
In this section, your goad is to take a project from requirements gathering to implementation. ← micro simulation of what a project in the real world might look like.
Here are the key takeaways from this conversation. You are tasked with building two data pipelines:
A batch data pipeline that serves training data for the data scientists to train the recommender system;
A streaming data pipeline that provides the top product recommendations to users based on their online activity. The streaming data pipeline makes use of the trained recommender system to provide the recommendations.
The recommendation system will recommend products to customers based on their user information and based on the products the customer has been browsing or that are in their cart.
Details of the Recommender System
This section briefly explains how the content-based recommender system works and how it is trained.
Recommendation model
Extract functional and nonfunctional requirements
Functional requirements of batch pipeline:
The data pipeline needs to serve data in a format that is suitable for training the ML models.
The duration the data pipeline should retain or store the training data.
How to handle data updates: Determine whether to replace old data with new data, keep only the latest rating for each customer-product pair, or calculate an average rating over time.
The file format the training data should be served in. (e.g. CSV, Parquet, etc.)
Nonfunctional requirements for batch pipeline:
The data system must be easy to maintain and update, and requires less operational overhead (Irregular / on demand run schedule)
The data system must be cost effective.
Functional requirements of streaming pipeline:
use the recommender system to find the set of products to recommend for a given user based on the user’s information and the products they interacted with online.
stream the products to the sales platform and store them for later analysis.
Nonfunctional requirements of streaming pipeline:
Latency: The system must deliver product recommendations to users in real-time as they browse or checkout. It should process user and product data, generate recommendations, and serve them to the sales platform within seconds.
Scalability & concurrency: The system must be able to scale up to ingest, transform and serve the data volume expected with the maximum level of user activity on the platform, while staying within the latency requirements.
AWS Services for Batch Pipelines
We consider Extract — Transform — Load (ETL) Pipeline
You can use EC2 Instance but you have take the responsibility for installing software, managing security, and managing cloud deployment complexicy. ← In this lab, we use serverless options
☝️ Recommend: using serverless first, then containers and orchestration if possible!
One option of serverless option you could use is AWS Lambda ← It has limitations:
15-minute timeout for each function call.
Memory and CPU allocation for each function.
Requires you to write custom code for your use case. ← might not be the best use of your time!
We consider Amazon EMR vs AWS Glue ETL! ← tradeoff between them is control vs convenience
Normalized Tabular Data → Star Schema → coud use another Amazon RDS instance.
Complex analytical queries on massive datasets → use Amazon Redshift (higher cost than RDS).
This week, we focus on ML use case (data will be used for training a recommender model) → the best and cheapest storage and serving option is Amazon S3 (flexible + scalable + cost-effective). ← It allows you to store virtually any kind of data that easily integrates with other AWS services
AWS Services for Streaming Pipelines
Just like the batch approach, if we use EC2, we are responsible for installing software, managing security, and managing cloud deployment complexity.
Amazon Data Firehose: simplify the process of integrating with Kinesis Data Stream and allow you to get data from a stream, store it in a destination like S3, Redshift or send it to HTTP endpoints or providers like DataDog or Splunk.
Open terraform/main.tf and uncomment section “etl”. Open terraform/outputs.tf and uncomment “ETL” section. ← Remove the space befor # ETL, otherwise the code won’t work!
Terraform
1cd terraform/
2terraform init
3terraform plan
4terraform apply # then "yes"
We get deeper understanding of various terraform files in Course 2.
We can run this command again and again until we see “SUCCEEDED” instead of “RUNNING”.
Go further into the details of AWS Glue in Course 4.
Now, we can check the S3 bucket that contains the trained data. ← back to the console and open S3 → choose the bucket that has “-datalake” in its name.
We are at this step!
Setting up the Vector Database
In this lab, you won’t train the recommender system yourself (Data Scientist’s job).
You will be provided with a model already trained (stored in S3 as ML artifacts).
Go back to S3 and choose the bucket with ml-artifacts in its name. There are 3 folders
embeddings: contains the embeddings of the users and items (or products) that were generated in the model. There are 2 files: item_embeddings.csv and user_embeddings.csv ← will be used to retrieve similar products.
model: contains the trained model that will be used for inference.
Other ways to check VectorDBHost: Open RDS > Databases > click de-c1w4-vector-db > In Connectivity & Security > search for the endpoint (de-c1w4-vector-db.xxxx.us-….rds.amazonws.com)
Add the embeddings to the vector database → open sql/embeddings.sql
To check and replace <BUCKET_NAME>, go to S3 console, open bucket (”…-ml-artifacts”) and then copy the entire name.
1de-c1w4-471112715504-us-east-1-ml-artifacts
An example.
Connect to the database
1psql --host=<VectorDBHost> --username=postgres --password --port=54322# then type the password above
To work with postgres database
1\c postgres;
Run the embeddings.sql
1\i '../sql/embeddings.sql'
Check all available tables
1\dt *.*
2# use arrow down keys to show more3# Q to exit
To quit the psql prompt: \q
Implemeting the Streaming Pipeline
Model inference (Lambda) will use the trained model stored in S3 and the embeddings from the vector database to provide the recommendation based on the online user activity streamed by Kinesis Data streams
Before Kinesis loads the data into S3, it will invoke the Lamnda function labeled as stream transformation to extract user and product features, and then pass these to Model inference to get the recommendation. → then, firehose finally load the recommendation to S3.
You will learn more about the underlying design of Kinesis Data Stream in Couse 2.
Model inference is already implemented and provided but it’s not configured completely. Go to Lambda > de-c1w4-model-inference > Configuration > Environment Variables > Fill in the values of host, username and password in previous steps.
To implement the streaming pipeline, uncomment the streaming section in terraform/main.tf and output.tf. Then run
1terraform init
2terraform plan
3terraform apply # then "yes"
To check the recommendation that are created in S3: Go to S3 > click recommendation bucket