- Jay Kreps, “Questioning the Lambda Architecture,” O’Reilly Radar, July 2, 2014.
- “A Brief Introduction to Two Data Processing Architectures — Lambda and Kappa for Big Data” by Iman Samizadeh
- An article about Amazon's API Mandate
- This week, we dive deeper into what it means to build a good data architecture.
- How Data Architecture fits within Enterprise Architecture (entire org)
- Specific architecture examples
- Choosing technologies: how to translate from stakeholder needs to technology choices.
- Guiding architectural principles to think like an arthictect
- Lab to learn trade-off evaluation between: cost, performance, scalability, security
- Learn about AWS Well-Architected Framework.
- Enterprise Architecture: the design of systems to support change in an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs
- Data Architecture: the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.
- One-way and Two-way Decisions:
- One-way: A decision that is almost impossible to reverse
- Two-way: An easily reversible decision. Eg: S3 Object Storage Classes
- Can convert one-way to multiple smaller two-way decisions.
- If you're thinking like an architect in your role as a data engineer, you'll build technical solutions that exist not just for their own sake, but in direct support of business goals
- Very important guiding principle that has affected every architecture and system
Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.
- The main takeaway for you as a data engineer is that when it comes to understanding, what kind of data architecture is going to work for your organization? You first need to pay attention to and understand the communication structure of the organization
- Check 9 principles in DE by DL.AI - Course 1: Introduction to DE (W1&2)
- Break into 3 groups:
- Common components, ex: object storage, git, observability & monitoring systems, processing engines.
- Common: facilitate team collaboration; break down silos
- “Wise” choice: Identify tools that benefit all teams; avoid a one-size-fits-all approach.
- As a data engineer, I also recommend you seek mentorship from data architects in your organization or elsewhere because eventually, you may well occupy the role of architect yourself.
- Systems built around reversible decisions (2-ways door) allow you to always be architecting
- All teams must use API to communicate as well as to server data and functionality.
- Make reversible decisions ↔ Always be architecting ↔ Build loosely coupled systems
- Loosely-coupled systems:
- Building loosely-coupled systems helps you ensure your decisions are reversible.
- Building loosely-coupled systems helps you ensure your decisions are reversible.
- Loosely-coupled systems give you the ability to always be architecting.
- Things will go wrong no matter what.
- When your systems fail → serve the needs of your organization
- Plan for failure
- Architect for scalability
- Prioritize security
- Embrace FinOps
- Plan for failure: practical and quantitative approach with these metrics:
- Availability: The percentage of time an IT service or a component is expected to be in an operable state.
- Reliability: The probability of a particular service or component performing its intended function during a particular time interval
- Durability: The ability of a storage system to withstand data loss due to hardware failure, software errors, or natural disasters.
- RTO vs RPO
- Recovery Time Objective (RTO): The maximum acceptable time for a service or system outage. Ex: Consider the impact to customers.
- Recovery Point Objective (RPO): A definition of the acceptable state after recovery. Ex: Consider the maximum acceptable data loss
- Prioritize Security
- 3 aspects: Culture of security, Principle of least privilege and Zero-trust security.
- (old and dangerous): Hardened-Perimeter Approach
- Zero-trust security: every action requires authentication + you build your system such that no people or applications, internal or external are trusted by default
- Embrace FinOps: how to optimize a daily job in terms of cost and performance. ← manage cloud cost: pay-as-you-go models and readily scalable.
- Example: ETL (Extract-Transform-Load) and ELT
- Example: using batches in a process
Sometimes, we need a Data marts
- Plan for failure:
- When data sources die or schema change → 1st step is to talk to the owners.
- Also check the availability and reliability specs.
- Make reversible decisions: build flexibility into your system (ingest 1 day’s worth of data or 2 days or different amount of data)
- Check also the cost-benefit analysis when building common components → make sure the value for the business.
- Data can be produced in a seties of events as clicks, sensor measurements,… ← continuous fashion (near real-time)
- Streaming frameworks: Apache kafka (event streaming platform), Apache Storm & Apache samza (streaming & real-time analytics).
- Lamda architecture: data engineers needed to figure out how to reconcile batch and streaming data into a single architecture.
- Challenges: managing parallel systems with different code bases.
- Kappa Architecture: The central idea for Kappa architecture is to use a stream processing platform as the backbone for all data handling, ingestion, storage, and serving ← a true event-based architecture ← information is automatically sent to relevant consumers that need the update so that these consumers can react more immediately to this information.
- Unifying batch and streaming processing: Google Dataflow, Apache beam, Apache Flink ← idea: data viewed as events
- “You can think of batch processing as a special case of streaming” simply means that, since data is just information about events that are happening continuously out in the world, essentially all data is streaming at its source. Therefore, streaming ingestion could be thought of as the most natural or basic approach, while batch ingestion just imposes arbitrary boundaries on that same stream of data.
- Regulatory compliance is probably the most boring topic in these courses.
- As a DE, you need to plan flexible for different compliances.
- Constantly complying: not only regulations of today but also regulations of tomorrow
- Use loosely coupled components (can replace at anytime)
- Some kinds:
- General Data Protection Regulation (GDPR) in Europe (EU) → personal data → right to have your data deleted…
- Not only in regions but also in company or industry types. For example: Health Insurance Portability and Accountability Act (HIPAA) in US.
- Different types: Open source software vs Managed open source software vs Proprietary software
- Keep in mind the end goal! → Deliver high-quality data products. You focus on DAr (what, why, when), the tools is for How.
- Considerations
- On-Premises: company owns and maintains the hardware and software for their data stack.
- Cloud provider: is responsible for building and maintaining the hardware in data centers.
- You rent the compute and storage resources.
- It’s easy to scale down and up.
- You dont’ need to maintain or provision any hardware.
- Or hybrid
- Monolithic systems: self-contained systems that made up of tightly-coupled components.
- Pros: easy to reason about and to understand + deal only with one technology ← good for simplicity and reasoning about
- Cons: hard to maintain + update one component, need to update others too.
- Modular systems: loosely-coupled components → micro-services.
- Interoperability, flexible and reversible decisions, continuosu improvement.
- Total Cost of Ownership (TCO): The total estimated cost of a solution, project or initiative over its entire lifecycle. — hardware & software, maintenance, training.
- Direct costs: easy to identify costs, directly attributed to the dev of a data product. Ex: salaries, cloud bills, software subscriptions,…
- Indirect costs (overhead): expenses that are not directly attributed to the dev of a data product. Ex: network downtime, IT support, loss of productivity.
- For hardware & softare: CapEx vs OpEx
- Capital Expenses (CapEx): the payment made to purchase long-term fixed assets ← On-Premises
- Operational Expenses (OpEx): Expense associated with running the day-to-day operations (pay-as-you-go) ← Cloud
- Total Opportunity Cost of Ownership (TOCO): The cost of lost opportunities that you incur
in choosing a particular tool or technology - TOCO=0 when you choose the right and optimized stack of techs. However, components change overtime and TOCO will increase.
- Minize TOCO → choose loosely-coupled components so that we can replace them with more optimized ones.
- Recognize components that are likely to change:
- Immutable technologies: Object storage, Networking, SQL
- Transitory technologies: stream processing, orchestration, AI.
- FinOps: Minimize TCO and TOCO & Maximize revenue generation opportunities
- Choose Cloud (OpEx first): flexible, pay-as-you-go technologies & Modular options
- Build your own solution: be careful not to reinvent the wheel
- Use existing solution → Choose between: open-source (community) vs commercial open-source (vendor) vs proprietary non-opne source.
- Considerations:
- Your team:
- Does your team have the bandwidth and capabilities to implement an open source solution?
- Are you a small team? Could using a managed or proprietary service free up your time?
- Cost:
- How much are licensing fees associated with managed or proprietary services?
- What is the total cost to build and maintain a system?
- Business value:
- Do you get some advantage by building your own system compared to a managed service?
- Are you avoiding undifferentiated heavy lifting?
- Advices: for most teams (particularly small teams) → open sources, purchase commercial open-sources → if not found a solution, buy proprietary solution. ← your team can focus on the main things.
- Avoiding undifferentiated heavy lifting means avoiding doing work that costs significant time and resources but doesn’t add value in terms of cost savings or advantage for the business.
- Server: you set up and manage the server → update the OS, install/update packages. Ex: EC2 instance.
- Container: Modular unit that packages code and dependencies to run on a server → lightweight & portable. You set up the application code and dependencies
- Serverless: You don’t need to set up or maintain the server → automatic scaling, availability & fault-tolerance, pay-as-you-go
→ Recommend: using serverless first, then containers and orchestration if possible.
- Last week, we talked about 6 undercurrents: Security, Data Management, DataOps, Data Architecture, Orchestration, and Software Engineering.
- Security:
- Check what are the security features of the tool.
- Know where your tools come from.
- Use tools from reputable sources.
- Check the source codes of open-source tools.
- Data Management: How are data governance practices implemented by the tool provider? → Data Breach, Compliance, Data Quality.
- DataOps: What features does the tool offer in terms of automation and monitoring? → Automation, Monitoring, Service Level Agreement (SLA)
- Data Architecture: does the tool provide modularity and interoperability? → choose loosely-coupled components.
- Orchestration: Apache Airflow (most), dagster, prefect, mage.
- Software Engineering: how much do you want to do? → Team’s capabilities, business value → build vs buy → avoid undifferentiated heavy lifting.
The AWS Well-Architected framework (WAF)
- Set of principles and best practices that help you build: scalable & robust architectures on AWS.
- In his book, “Principles of good data architecture” is inspired by this framework.
- WAF helps you evaluate and improve solutions you build on AWS.
- WAF contains 6 key pillars: Operational Excellence, Performance Efficiency, Security, Cost Optimization, Reliability, Sustainability.
- Operational Excellence
- How you can develop and run your workloads on AWS more effectively.
- Monitor your systems to gain insight into your operations.
- Continuously improve your processes and procedures to deliver business value.
- Security: How to take advantage of cloud technologies to protect your data, systems, and assets.
- Reliability: Everything from designing for reliability to planning for failure and adapting to change.
- Performance Efficiency
- Taking a data-driven approach to building high-performance architecture.
- Evaluating the ability of computing resources to efficiently meet system requirements.
- How you can maintain that efficiency as demand changes and technologies evolve.
- Cost Optimization
- Building systems to deliver maximum business value at the lowest possible price point.
- Use AWS Cost Explorer and Cost Optimization Hub to make comparisons and get recommendations about how to optimize costs for your systems.
- Sustainability
- Consider the environmental impact of the workloads you're running on the cloud.
- Reducing energy consumption and increasing efficiency across all components of your system.
- Benefits of these pillars:
- Set of principles and questions
- Helps you design and operate reliable, secure, efficient, cost-effective, and sustainable systems in the cloud
- Helps you think through the pros and cons of different architecture choices
- Lens is essentially an extension of the AWS Well-Architected Framework that focuses on a particular area, industry, or technology stack, and provides guidance specific to those contexts.
- In this lab, we’re going to build a web app on EC2 to serve the data to external clients.
- Ensure that the web app:
- Capable of scaling to meet the clients’ needs
- Uses computing resources efficiently
- Designed in a secure and reliable way
- Use Amazon CloudWatch to monitor the performance of this app.
- Following: Principles of good data architecture and AWS Well-Architected framework.
- The architectural Diagram ← Three-tier architecture overview
- Logic Tier
- To simulate traffic to the web application → use tool: Apache Benchmark
1sudo yum install httpd-tools -y
1# stimulate requests
2ab -n 7000 -c 50 http://<ALB-DNS>/
-n
= #requests, -c
= #concurrent requests- Security:
- Prioritize Security & Security Pillar
- Configure the ALB to only receive certain types of requests.
- A request needs the address and the port number.
- A port number is a virtual identifier that applications use to differentiate between types of traffic.
- Some private data can be leaked via some port due to incorrect configurations. ← to fix this, we need to adjust the security rules known as the security groups of the load balancer. ← Where: EC2 > Network & Security / Security Groups ← In this, there are 2 group:
- 1 is for ec2 (
…-ec2-…
) which only receives traffics from ALB ← make sure that! - Check in “Inbound rules”, the “Source” of group is the id of ALB
- Another (
…-alb-…
) is for ALB which receives from users outside (via a some configured ports) - Edit inbound rules:
0.0.0.0/0
= accepts all IP address
- Check EC2 Availability Zone (AZ)
- Eg:
us-east-1a
,us-east-1b
- To change template from
t3.micro
tot3.nano
(for example) → Modify the Launch template of the auto scaling group → Where: EC2 > Auto Scaling > Auto Scaling Groups > Launch template > Edit. - Don’t forget to terminate all
t3.micro
instances that are running: EC2 > Instances > Instances > check all instances > (right click on any one of them) > Terminate instance
- Enable automatic scaling process: EC2 > Auto Scale > Auto Scaling Groups > (group name) > Automatic scaling > Create dynamic scaling policy
- target group (ports receiving the requests)
- target value: eg 60 ← when there's more than 60 requests, the auto scaling group may scale out to add more instances to handle the increased load
- instance warmup: This warm up time refers to the period during which newly launched instances are allowed to fully initialize before being considered in service for the auto scaling evaluation metrics.
- To monitor the auto scaling: EC2 > Auto Scaling > Auto Scaling Groups > (group name) > Monitoring > EC2 ← here we can see some activities in the CPU usage and also in the network metrics.
What we’ve learned so far,
What we will learn this week 4:
Requirements like hierarchy of needs.
- Matt Housley plays a role of CTO (co-author writes the book “Fundamentals of Data Engineering”)
- His advices:
- First, learn and play with core data concepts.
- Build your mental framework to think like a DE.
- DE asks CTO about the overall goals of the business and technology ← eg. eCommerce:
- Challenge: the market evolves, a lot of small brands come up. ← old code running could cause outages.
- → We will do refactors → but how if will affect you as a DE? ← DE: help modernize the systems
- CTO: we want to make sure that the output of the refactorred codes is suitable for analytics with minimal processing. ← DE: what tools can I expect to be working with?
- CTO: convert more from batch-based approach to streaming-based approach. → use AWS Kinesis + Kafka
- CTO: you’re going to work as a consultant with the software side to understand the data they’re producing and then help them to produce better quality data.
- DE: Does company has any goals w.r.t AI? ← CTO: we’re working with a recommendation engine, so your work is to dev data pipelines that feed that recommendation engine.
- Back to previous weeks → 2 primary requests
- delivering dashboards with metrics on product sales
- product recommendation system
- Key elements of requirements gathering when talking with the Markerting
- Learn what existing data systems or solutions are in place
- Learn what pain points or problems there are with the existing solutions
- Learn what actions stakeholders plan to take based on the data you serve them
- Identify any other stakeholders you’ll need to talk to if you’re still missing information
- Tip: after learning → try to repeat what you learned back to the stakeholder.
- DE: tall me more abotu thease deman spikes. ←sharply over a span of a few hours and then eventually drops off again, 1 day or 2
- …many more but focus of the key elements above
- Document the requirements you’ve gathered
- Open lines of communication with the source system owners and discuss how to best anticipate and deal with disruptions or changes in the data when they occur.
- Key elements when talking to source system owners:
- Learn what existing data systems or solutions are in place
- Learn what pain points or problems there are with the existing solutions
- Conversation takeaways
- Documenting nonfunctional requirements: Nonfunctional requirements can be a little trickier than functional requirements in the sense that they won't, typically, be things that your stakeholders explicitly ask for.
- “In my own experience, I've seen data engineering done wrong and more times, I've seen it done right. I just want to make sure that you're set up for success with the data systems that you build.”
- Main takeaways:
- Identify the stakeholders, understand their needs and the broader goals of the business
- Ask open-ended questions
- Document alal of your findings in a hierarchy form: business goals > stakeholder needs > system requirements (functional and nonfunctional)
- Evaluation of trade offs: timeline vs cost vs scope (features of the system) ← Iron Triangle
→ The fallacty of the iron triangle: Projects done as quickly as possible + Projects done well + Projects are within budget constraints.
- The way to break the iron triangle is through the application of principles and processes:
- building loosely coupled systems
- optimizing for 2 way door decisions
- depply understanding the needs of stakeholders.
In this section, your goad is to take a project from requirements gathering to implementation. ← micro simulation of what a project in the real world might look like.
Here are the key takeaways from this conversation. You are tasked with building two data pipelines:
- A batch data pipeline that serves training data for the data scientists to train the recommender system;
- A streaming data pipeline that provides the top product recommendations to users based on their online activity. The streaming data pipeline makes use of the trained recommender system to provide the recommendations.
The recommendation system will recommend products to customers based on their user information and based on the products the customer has been browsing or that are in their cart.
This section briefly explains how the content-based recommender system works and how it is trained.
- Functional requirements of batch pipeline:
- The data pipeline needs to serve data in a format that is suitable for training the ML models.
- The duration the data pipeline should retain or store the training data.
- How the data pipeline should combine the old with the new dataset. Should the old dataset be discarded? If the same customer updated their rating for a given product, should you keep the last rating or compute the average?
- The file format the training data should be served in. (e.g. CSV, Parquet, etc.)
- Nonfunctional requirements for batch pipeline:
- The data system must be easy to maintain and update, and requires less operational overhead (Irregular / on demand run schedule)
- The data system must be cost effective.
- Functional requirements of streaming pipeline:
- use the recommender system to find the set of products to recommend for a given user based on the user’s information and the products they interacted with online.
- stream the products to the sales platform and store them for later analysis.
- Nonfunctional requirements of streaming pipeline:
- Latency: The system should provide the recommended products to the user instantaneously as they are browsing through different products or during the checkout process. For a given user, the system must be able to ingest and extract user and product data, provide them to the recommender system, and finally serve the product recommendations to the sales platform, all in a few seconds.
- Scalability & concurrency: The system must be able to scale up to ingest, transform and serve the data volume expected with the maximum level of user activity on the platform, while staying within the latency requirements.
We consider Extract — Transform — Load (ETL) Pipeline
- Data Source: Amazon RDS
- Extract & Transform:
- You can use EC2 Instance but you have take the responsibility for installing software, managing security, and managing cloud deployment complexicy. ← In this lab, we use serverless options
- One option of serverless option you could use is AWS Lambda ← It has limitations:
- 15-minute timeout for each function call.
- Memory and CPU allocation for each function.
- Requires you to write custom code for your use case. ← might not be the best use of your time!
- We consider Amazon EMR vs AWS Glue ETL! ← tradeoff between them is control vs convenience
- There are other tools too.
☝️ Recommend: using serverless first, then containers and orchestration if possible!
- Load & Serve:
- Normalized Tabular Data → Star Schema → coud use another Amazon RDS instance.
- Complex analytical queries on massive datasets → use Amazon Redshift (higher cost than RDS).
- This week, we focus on ML use case (data will be used for training a recommender model) → the best and cheapest storage and serving option is Amazon S3 (flexible + scalable + cost-effective). ← It allows you to store virtually any kind of data that easily integrates with other AWS services
- Just like the batch approach, if we use EC2, we are responsible for installing software, managing security, and managing cloud deployment complexity.
- Use AWS Lambda is limited too.
You shouldn’t build your own streaming solution with EC2 or Lambda.
- Amazon Kinesis Data Streams: popular AWS service that enables real time data ingestion.
- Kinesis is data agnostic: data can be JSON, XML, structured, or unstructured
- Data is stored in Kinesis for a configurable time (24 hours or custom)
- Amazon MSK (Amazon Managed Streaming for Apache Kafka) ← MSK runs open-source versions of Kafka
- MSK (more control) — Kinesis (more convenience)
- Amazon Data Firehose: simplify the process of integrating with Kinesis Data Stream and allow you to get data from a stream, store it in a destination like S3, Redshift or send it to HTTP endpoints or providers like DataDog or Splunk.
- We explore an additional table that was added to the
classicmodels
MySQL Sample Database.
- Get the endpoint of the database instance
1aws rds describe-db-instances --db-instance-identifier <MySQL-DB-name> --output text --query "DBInstances[].Endpoint.Address"
- Connect to the database
- We learn more about connecting and working with databases in Course 2.
1mysql --host=<MySQLEndpoint> --user=<DatabaseUserName> --password=<Password> --port=3306
- Check the
classicmodels
→ we should see new table “ratings”
1use classicmodels; show tables;
2
3# check 1st 20 rows
4SELECT * FROM ratings LIMIT 20;
5
6# exit
7exit
- Open
terraform/main.tf
and uncomment section “etl”. Openterraform/outputs.tf
and uncomment “ETL” section. ← Remove the space befor# ETL
, otherwise the code won’t work!
- Terraform
- We get deeper understanding of various terraform files in Course 2.
1cd terraform/
2terraform init
3terraform plan
4terraform apply # then "yes"
- Start the AWS Glue job
- Check status of AWS Glue job
- Go further into the details of AWS Glue in Course 4.
1aws glue start-job-run --job-name de-c1w4-etl-job | jq -r '.JobRunId'
1aws glue get-job-run --job-name de-c1w4-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"
We can run this command again and again until we see “SUCCEEDED” instead of “RUNNING”.
- Now, we can check the S3 bucket that contains the trained data. ← back to the console and open S3 → choose the bucket that has “-datalake” in its name.
- In this lab, you won’t train the recommender system yourself (Data Scientist’s job).
- You will be provided with a model already trained (stored in S3 as ML artifacts).
- Go back to S3 and choose the bucket with ml-artifacts in its name. There are 3 folders
- embeddings: contains the embeddings of the users and items (or products) that were generated in the model. There are 2 files:
item_embeddings.csv
anduser_embeddings.csv
← will be used to retrieve similar products. - model: contains the trained model that will be used for inference.
- scalers: contains the objects used on the preprocessing part for the training, such as One Hot Encoders and Standard Scalers.
- Open
terraform/main.tf
and uncomment section “vector_db”. Openterraform/outputs.tf
and uncomment “Vector-db” section.
- Terraform
- After the last step → need ~7m to create PostgresSQL database!
- You will get informatkon to connect to the database with sensitive infors (username, password). To show these
1cd terraform/
2terraform init
3terraform plan
4terraform apply # then "yes"
1terraform output vector_db_master_username
2terraform output vector_db_master_password
3# ignore double quotes in the output
1# host (<VectorDBHost>)
2de-c1w4-vector-db.cvaw8w688ut3.us-east-1.rds.amazonaws.com
3# username
4postgres
5# password
6V3W_VFGgSpg
7
An example.
Other ways to check
VectorDBHost
: Open RDS > Databases > click de-c1w4-vector-db
> In Connectivity & Security > search for the endpoint (de-c1w4-vector-db.xxxx.us-….rds.amazonws.com
)- Add the embeddings to the vector database → open
sql/embeddings.sql
- To check and replace
<BUCKET_NAME>
, go to S3 console, open bucket (”…-ml-artifacts”) and then copy the entire name.
1de-c1w4-471112715504-us-east-1-ml-artifacts
An example.
- Connect to the database
1psql --host=<VectorDBHost> --username=postgres --password --port=5432
2# then type the password above
- To work with postgres database
1\c postgres;
- Run the
embeddings.sql
1\i '../sql/embeddings.sql'
- Check all available tables
1\dt *.*
2# use arrow down keys to show more
3# Q to exit
- To quit the
psql
prompt:\q
- You will learn more about the underlying design of Kinesis Data Stream in Couse 2.
- Model inference is already implemented and provided but it’s not configured completely. Go to Lambda >
de-c1w4-model-inference
> Configuration > Environment Variables > Fill in the values of host, username and password in previous steps.
- To implement the streaming pipeline, uncomment the streaming section in
terraform/main.tf
andoutput.tf
. Then run
1terraform init
2terraform plan
3terraform apply # then "yes"
- To check the recommendation that are created in S3: Go to S3 > click recommendation bucket
- Check logs of transformation lambda function: Lambda > “…-transformation-lambda” > Monitor > View CloudWatch logs > Log streams.