DE by DL.AI - C1 W3&4 - Data Architecture & Translating Requirements to DA

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Refrences

AWS Well-Architected Framework.

Jay Kreps, “Questioning the Lambda Architecture,” O’Reilly Radar, July 2, 2014.

“A Brief Introduction to Two Data Processing Architectures — Lambda and Kappa for Big Data” by Iman Samizadeh

An article about Amazon's API Mandate

Data Architecture

Overview

This week, we dive deeper into what it means to build a good data architecture.

How Data Architecture fits within Enterprise Architecture (entire org)

Specific architecture examples

Choosing technologies: how to translate from stakeholder needs to technology choices.

Guiding architectural principles to think like an architect

Lab to learn trade-off evaluation between: cost, performance, scalability, security

Learn about AWS Well-Architected Framework.

What is Data Architecture?

Enterprise Architecture: the design of systems to support change in an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs

Data Architecture: the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.

One-way and Two-way Decisions:

One-way: A decision that is almost impossible to reverse
Two-way: An easily reversible decision. Eg: S3 Object Storage Classes

Can convert one-way to multiple smaller two-way decisions.

If you're thinking like an architect in your role as a data engineer, you'll build technical solutions that exist not just for their own sake, but in direct support of business goals

Conway’s Law

Very important guiding principle that has affected every architecture and system

Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.

As a data engineer, understanding your organization's communication structure is crucial for designing an effective data architecture that aligns with your company's needs and processes.

Principles of Good Data Architecture

Check 9 principles in DE by DL.AI - C1 W1&2 - Intro to DE & DE Lifecycle and Undercurrents

Break into 3 groups:

Common components, ex: object storage, git, observability & monitoring systems, processing engines.

Common: facilitate team collaboration; break down silos
“Wise” choice: Identify tools that benefit all teams; avoid a one-size-fits-all approach.

As a data engineer, I also recommend you seek mentorship from data architects in your organization or elsewhere because eventually, you may well occupy the role of architect yourself.

Always Architecting

Systems built around reversible decisions (2-ways door) allow you to always be architecting

All teams must use API to communicate as well as to server data and functionality.

Make reversible decisions ↔ Always be architecting ↔ Build loosely coupled systems

Loosely-coupled systems:

Building loosely-coupled systems helps you ensure your decisions are reversible.
Loosely-coupled systems give you the ability to always be architecting.

When your systems fail

Things will go wrong no matter what.

When your systems fail → serve the needs of your organization

Plan for failure

Architect for scalability

Prioritize security

Embrace FinOps

Plan for failure: practical and quantitative approach with these metrics:

Availability: The percentage of time an IT service or a component is expected to be in an operable state.

Reliability: The probability of a particular service or component performing its intended function during a particular time interval
Durability: The ability of a storage system to withstand data loss due to hardware failure, software errors, or natural disasters.

RTO vs RPO

Recovery Time Objective (RTO): The maximum acceptable time for a service or system outage. Ex: Consider the impact to customers.
Recovery Point Objective (RPO): A definition of the acceptable state after recovery. Ex: Consider the maximum acceptable data loss

Prioritize Security

3 aspects: Culture of security, Principle of least privilege and Zero-trust security.
(old and dangerous): Hardened-Perimeter Approach

Zero-trust security: every action requires authentication + you build your system such that no people or applications, internal or external are trusted by default

Embrace FinOps: how to optimize a daily job in terms of cost and performance. ← manage cloud cost: pay-as-you-go models and readily scalable.

Batch Architectures

data is produced as a series of chunks

Example: ETL (Extract-Transform-Load) and ELT

Example: using batches in a process

Sometimes, we need a Data marts

Plan for failure:

When data sources die or schema change → 1st step is to talk to the owners.
Also check the availability and reliability specs.

Make reversible decisions: build flexibility into your system (ingest 1 day’s worth of data or 2 days or different amount of data)

Check also the cost-benefit analysis when building common components → make sure the value for the business.

Streaming Architectures

Data can be produced in a seties of events as clicks, sensor measurements,… ← continuous fashion (near real-time)

Streaming frameworks: Apache kafka (event streaming platform), Apache Storm & Apache samza (streaming & real-time analytics).

Lamda architecture: data engineers needed to figure out how to reconcile batch and streaming data into a single architecture.

Challenges: managing parallel systems with different code bases.

Kappa Architecture: Uses a stream processing platform for all data handling, ingestion, storage, and serving. It's a true event-based architecture where updates are automatically sent to relevant consumers for immediate reaction.

Unifying batch and streaming processing: Google Dataflow, Apache beam, Apache Flink ← idea: data viewed as events

“You can think of batch processing as a special case of streaming” simply means that, since data is just information about events that are happening continuously out in the world, essentially all data is streaming at its source. Therefore, streaming ingestion could be thought of as the most natural or basic approach, while batch ingestion just imposes arbitrary boundaries on that same stream of data.

Architecting for compliance

Regulatory compliance is probably the most boring topic in these courses.

As a DE, you need to plan flexible for different compliances.

Constantly complying: not only regulations of today but also regulations of tomorrow
Use loosely coupled components (can replace at anytime)

Some kinds:

General Data Protection Regulation (GDPR) in Europe (EU) → personal data → right to have your data deleted…
Not only in regions but also in company or industry types. For example: Health Insurance Portability and Accountability Act (HIPAA) in US.

Choosing the Right Technologies

Choosing tools and tecnologies

Different types: Open source software vs Managed open source software vs Proprietary software

Keep in mind the end goal! → Deliver high-quality data products. You focus on DAr (what, why, when), the tools is for How.

Considerations

Location

On-Premises: company owns and maintains the hardware and software for their data stack.

Cloud provider: is responsible for building and maintaining the hardware in data centers.

You rent the compute and storage resources.
It’s easy to scale down and up.
You dont’ need to maintain or provision any hardware.

Or hybrid

Monolith vs Modular systems

Monolithic systems: self-contained systems that made up of tightly-coupled components.

Pros: easy to reason about and to understand + deal only with one technology ← good for simplicity and reasoning about
Cons: hard to maintain + update one component, need to update others too.

Modular systems: loosely-coupled components → micro-services.

Interoperability, flexible and reversible decisions, continuosu improvement.

Cost optimization and Business value

Total Cost of Ownership (TCO): The total estimated cost of a solution, project or initiative over its entire lifecycle. — hardware & software, maintenance, training.

Direct costs: easy to identify costs, directly attributed to the dev of a data product. Ex: salaries, cloud bills, software subscriptions,…
Indirect costs (overhead): expenses that are not directly attributed to the dev of a data product. Ex: network downtime, IT support, loss of productivity.

For hardware & softare: CapEx vs OpEx

Capital Expenses (CapEx): the payment made to purchase long-term fixed assets ← On-Premises
Operational Expenses (OpEx): Expense associated with running the day-to-day operations (pay-as-you-go) ← Cloud

Total Opportunity Cost of Ownership (TOCO): The cost of lost opportunities that you incur
in choosing a particular tool or technology

TOCO=0 when you choose the right and optimized stack of techs. However, components change overtime and TOCO will increase.
Minize TOCO → choose loosely-coupled components so that we can replace them with more optimized ones.
Recognize components that are likely to change:

Immutable technologies: Object storage, Networking, SQL
Transitory technologies: stream processing, orchestration, AI.

FinOps: Minimize TCO and TOCO & Maximize revenue generation opportunities

Choose Cloud (OpEx first): flexible, pay-as-you-go technologies & Modular options

Build vs Buy

Build your own solution: be careful not to reinvent the wheel

Use existing solution → Choose between: open-source (community) vs commercial open-source (vendor) vs proprietary non-opne source.

Considerations:

Your team:

Does your team have the bandwidth and capabilities to implement an open source solution?
Are you a small team? Could using a managed or proprietary service free up your time?

Cost:

How much are licensing fees associated with managed or proprietary services?
What is the total cost to build and maintain a system?

Business value:

Do you get some advantage by building your own system compared to a managed service?
Are you avoiding undifferentiated heavy lifting?

Advices: for most teams (particularly small teams) → open sources, purchase commercial open-sources → if not found a solution, buy proprietary solution. ← your team can focus on the main things.

Avoiding undifferentiated heavy lifting means avoiding doing work that costs significant time and resources but doesn’t add value in terms of cost savings or advantage for the business.

Server, Container, and Serverless Compute Options

Server: you set up and manage the server → update the OS, install/update packages. Ex: EC2 instance.

Container: Modular unit that packages code and dependencies to run on a server → lightweight & portable. You set up the application code and dependencies

Serverless: You don’t need to set up or maintain the server → automatic scaling, availability & fault-tolerance, pay-as-you-go

→ Recommend: using serverless first, then containers and orchestration if possible.

How the undercurrents impact your decisions

Last week, we talked about 6 undercurrents: Security, Data Management, DataOps, Data Architecture, Orchestration, and Software Engineering.

Security:

Check what are the security features of the tool.
Know where your tools come from.
Use tools from reputable sources.
Check the source codes of open-source tools.

Data Management: How are data governance practices implemented by the tool provider? → Data Breach, Compliance, Data Quality.

DataOps: What features does the tool offer in terms of automation and monitoring? → Automation, Monitoring, Service Level Agreement (SLA)

Data Architecture: does the tool provide modularity and interoperability? → choose loosely-coupled components.

Orchestration: Apache Airflow (most), dagster, prefect, mage.

Software Engineering: how much do you want to do? → Team’s capabilities, business value → build vs buy → avoid undifferentiated heavy lifting.

Investigating Your Architecture on AWS

The AWS Well-Architected framework (WAF)

Set of principles and best practices that help you build: scalable & robust architectures on AWS.

In his book, “Principles of good data architecture” is inspired by this framework.

WAF helps you evaluate and improve solutions you build on AWS.

WAF contains 6 key pillars: Operational Excellence, Performance Efficiency, Security, Cost Optimization, Reliability, Sustainability.

Operational Excellence

How you can develop and run your workloads on AWS more effectively.

Monitor your systems to gain insight into your operations.

Continuously improve your processes and procedures to deliver business value.

Security: How to take advantage of cloud technologies to protect your data, systems, and assets.

Reliability: Everything from designing for reliability to planning for failure and adapting to change.

Performance Efficiency

Taking a data-driven approach to building high-performance architecture.

Evaluating the ability of computing resources to efficiently meet system requirements.

How you can maintain that efficiency as demand changes and technologies evolve.

Cost Optimization

Building systems to deliver maximum business value at the lowest possible price point.

Use AWS Cost Explorer and Cost Optimization Hub to make comparisons and get recommendations about how to optimize costs for your systems.

Sustainability

Consider the environmental impact of the workloads you're running on the cloud.

Reducing energy consumption and increasing efficiency across all components of your system.

Benefits of these pillars:

Set of principles and questions
Helps you design and operate reliable, secure, efficient, cost-effective, and sustainable systems in the cloud
Helps you think through the pros and cons of different architecture choices

Lens is essentially an extension of the AWS Well-Architected Framework that focuses on a particular area, industry, or technology stack, and provides guidance specific to those contexts.

Data Analytics Lens - Data Analytics Lens

Lab walkthrough (Week 3)

In this lab, we’re going to build a web app on EC2 to serve the data to external clients.

Ensure that the web app:

Capable of scaling to meet the clients’ needs
Uses computing resources efficiently
Designed in a secure and reliable way

Use Amazon CloudWatch to monitor the performance of this app.

Following: Principles of good data architecture and AWS Well-Architected framework.

The architectural Diagram ← Three-tier architecture overview

To be used: S3, Athena, EC2, ALB (Application Load Balancer)

Logic Tier

2 Availability Zones (AZ)

To simulate traffic to the web application → use tool: Apache Benchmark

1sudo yum install httpd-tools -y

1# stimulate requests
2ab -n 7000 -c 50 http://<ALB-DNS>/

-n = #requests, -c = #concurrent requests

Security:

Prioritize Security & Security Pillar
Configure the ALB to only receive certain types of requests.
A request needs the address and the port number.
A port number is a virtual identifier that applications use to differentiate between types of traffic.

Some private data can be leaked via some port due to incorrect configurations. ← to fix this, we need to adjust the security rules known as the security groups of the load balancer. ← Where: EC2 > Network & Security / Security Groups ← In this, there are 2 group:

1 is for ec2 (…-ec2-…) which only receives traffics from ALB ← make sure that!

Check in “Inbound rules”, the “Source” of group is the id of ALB

Another (…-alb-…) is for ALB which receives from users outside (via a some configured ports)

Edit inbound rules: 0.0.0.0/0 = accepts all IP address

Check EC2 Availability Zone (AZ)

Eg: us-east-1a, us-east-1b

To change template from t3.micro to t3.nano (for example) → Modify the Launch template of the auto scaling group → Where: EC2 > Auto Scaling > Auto Scaling Groups > Launch template > Edit.

Don’t forget to terminate all t3.micro instances that are running: EC2 > Instances > Instances > check all instances > (right click on any one of them) > Terminate instance

Enable automatic scaling process: EC2 > Auto Scale > Auto Scaling Groups > (group name) > Automatic scaling > Create dynamic scaling policy

target group (ports receiving the requests)
target value: eg 60 ← when there's more than 60 requests, the auto scaling group may scale out to add more instances to handle the increased load
instance warmup: This warm up time refers to the period during which newly launched instances are allowed to fully initialize before being considered in service for the auto scaling evaluation metrics.

To monitor the auto scaling: EC2 > Auto Scaling > Auto Scaling Groups > (group name) > Monitoring > EC2 ← here we can see some activities in the CPU usage and also in the network metrics.

Stakeholder Management & Gathering Requirements

Overview

What we’ve learned so far,

What we will learn this week 4:

Requirements

Requirements like hierarchy of needs.

Conversation with Matt Housley / with the CTO

Matt Housley plays a role of CTO (co-author writes the book “Fundamentals of Data Engineering”)

His advices:

First, learn and play with core data concepts.
Build your mental framework to think like a DE.

DE asks CTO about the overall goals of the business and technology ← eg. eCommerce:

Challenge: the market evolves, a lot of small brands come up. ← old code running could cause outages.
→ We will do refactors → but how if will affect you as a DE? ← DE: help modernize the systems
CTO: we want to make sure that the output of the refactorred codes is suitable for analytics with minimal processing. ← DE: what tools can I expect to be working with?
CTO: convert more from batch-based approach to streaming-based approach. → use AWS Kinesis + Kafka
CTO: you’re going to work as a consultant with the software side to understand the data they’re producing and then help them to produce better quality data.
DE: Does company has any goals w.r.t AI? ← CTO: we’re working with a recommendation engine, so your work is to dev data pipelines that feed that recommendation engine.

Conversation with Marketing

Back to previous weeks → 2 primary requests

delivering dashboards with metrics on product sales
product recommendation system

Key elements of requirements gathering when talking with the Markerting

Learn what existing data systems or solutions are in place
Learn what pain points or problems there are with the existing solutions
Learn what actions stakeholders plan to take based on the data you serve them
Identify any other stakeholders you’ll need to talk to if you’re still missing information

Tip: after learning → try to repeat what you learned back to the stakeholder.

DE: tall me more abotu thease deman spikes. ←sharply over a span of a few hours and then eventually drops off again, 1 day or 2

…many more but focus of the key elements above

Document the requirements you’ve gathered

Conversation with the Software Engineer

Open lines of communication with the source system owners and discuss how to best anticipate and deal with disruptions or changes in the data when they occur.

Key elements when talking to source system owners:

Learn what existing data systems or solutions are in place
Learn what pain points or problems there are with the existing solutions

Conversation takeaways

Documenting nonfunctional requirements: Nonfunctional requirements can be a little trickier than functional requirements in the sense that they won't, typically, be things that your stakeholders explicitly ask for.

Requirements Gathering Summary

“In my own experience, I've seen data engineering done wrong and more times, I've seen it done right. I just want to make sure that you're set up for success with the data systems that you build.”

Main takeaways:

Identify the stakeholders, understand their needs and the broader goals of the business

Ask open-ended questions

Document alal of your findings in a hierarchy form: business goals > stakeholder needs > system requirements (functional and nonfunctional)

Evaluation of trade offs: timeline vs cost vs scope (features of the system) ← Iron Triangle

→ The fallacty of the iron triangle: Projects done as quickly as possible + Projects done well + Projects are within budget constraints.

The way to break the iron triangle is through the application of principles and processes:

building loosely coupled systems
optimizing for 2 way door decisions
depply understanding the needs of stakeholders.

Translating Requirements to Architecture

Conversation with the Data Scientist

In this section, your goad is to take a project from requirements gathering to implementation. ← micro simulation of what a project in the real world might look like.

Here are the key takeaways from this conversation. You are tasked with building two data pipelines:

A batch data pipeline that serves training data for the data scientists to train the recommender system;

A streaming data pipeline that provides the top product recommendations to users based on their online activity. The streaming data pipeline makes use of the trained recommender system to provide the recommendations.

The recommendation system will recommend products to customers based on their user information and based on the products the customer has been browsing or that are in their cart.

Details of the Recommender System

This section briefly explains how the content-based recommender system works and how it is trained.

Recommendation model

Extract functional and nonfunctional requirements

Functional requirements of batch pipeline:

The data pipeline needs to serve data in a format that is suitable for training the ML models.

The duration the data pipeline should retain or store the training data.
How to handle data updates: Determine whether to replace old data with new data, keep only the latest rating for each customer-product pair, or calculate an average rating over time.
The file format the training data should be served in. (e.g. CSV, Parquet, etc.)

Nonfunctional requirements for batch pipeline:

The data system must be easy to maintain and update, and requires less operational overhead (Irregular / on demand run schedule)
The data system must be cost effective.

Functional requirements of streaming pipeline:

use the recommender system to find the set of products to recommend for a given user based on the user’s information and the products they interacted with online.
stream the products to the sales platform and store them for later analysis.

Nonfunctional requirements of streaming pipeline:

Latency: The system must deliver product recommendations to users in real-time as they browse or checkout. It should process user and product data, generate recommendations, and serve them to the sales platform within seconds.
Scalability & concurrency: The system must be able to scale up to ingest, transform and serve the data volume expected with the maximum level of user activity on the platform, while staying within the latency requirements.

AWS Services for Batch Pipelines

We consider Extract — Transform — Load (ETL) Pipeline

Data Source: Amazon RDS

Extract & Transform:

You can use EC2 Instance but you have take the responsibility for installing software, managing security, and managing cloud deployment complexicy. ← In this lab, we use serverless options

☝️ Recommend: using serverless first, then containers and orchestration if possible!

One option of serverless option you could use is AWS Lambda ← It has limitations:

15-minute timeout for each function call.
Memory and CPU allocation for each function.
Requires you to write custom code for your use case. ← might not be the best use of your time!

We consider Amazon EMR vs AWS Glue ETL! ← tradeoff between them is control vs convenience

Spark, Hive, hadoop. AWS Glue Crawler, AWS Glue Data Catelog, Glue visual ETL tool (auto generate Spark codes).

There are other tools too.

Load & Serve:

Normalized Tabular Data → Star Schema → coud use another Amazon RDS instance.
Complex analytical queries on massive datasets → use Amazon Redshift (higher cost than RDS).
This week, we focus on ML use case (data will be used for training a recommender model) → the best and cheapest storage and serving option is Amazon S3 (flexible + scalable + cost-effective). ← It allows you to store virtually any kind of data that easily integrates with other AWS services

AWS Services for Streaming Pipelines

Just like the batch approach, if we use EC2, we are responsible for installing software, managing security, and managing cloud deployment complexity.

Use AWS Lambda is limited too.

You shouldn’t build your own streaming solution with EC2 or Lambda.

Amazon Kinesis Data Streams: popular AWS service that enables real time data ingestion.

Source

Kinesis is data agnostic: data can be JSON, XML, structured, or unstructured
Data is stored in Kinesis for a configurable time (24 hours or custom)

Amazon MSK (Amazon Managed Streaming for Apache Kafka) ← MSK runs open-source versions of Kafka

MSK (more control) — Kinesis (more convenience)

Source

Amazon Data Firehose: simplify the process of integrating with Kinesis Data Stream and allow you to get data from a stream, store it in a destination like S3, Redshift or send it to HTTP endpoints or providers like DataDog or Splunk.

A Guide to Choosing the Right Streaming Solution for you on AWS

Streaming Data on AWS: Amazon Kinesis Data Streams or Amazon MSK? | Programmatic Ponderings

AWS Services to meet your requirements

Lab Walkthrough (Week 4)

mooc-de/c1-w4/environment at main · dinhanhthi/mooc-de

Implementing the Batch Pipeline

This section focuses on this pipeline.

We explore an additional table that was added to the classicmodels MySQL Sample Database.

Get the endpoint of the database instance

1aws rds describe-db-instances --db-instance-identifier <MySQL-DB-name> --output text --query "DBInstances[].Endpoint.Address"

Connect to the database

1mysql --host=<MySQLEndpoint> --user=<DatabaseUserName> --password=<Password> --port=3306

We learn more about connecting and working with databases in Course 2.

Check the classicmodels → we should see new table “ratings”

1use classicmodels; show tables;
2
3# check 1st 20 rows
4SELECT * FROM ratings LIMIT 20;
5
6# exit
7exit

Open terraform/main.tf and uncomment section “etl”. Open terraform/outputs.tf and uncomment “ETL” section. ← Remove the space befor # ETL, otherwise the code won’t work!

Terraform

1cd terraform/
2terraform init
3terraform plan
4terraform apply # then "yes"

We get deeper understanding of various terraform files in Course 2.

Start the AWS Glue job

1aws glue start-job-run --job-name de-c1w4-etl-job | jq -r '.JobRunId'

Check status of AWS Glue job

1aws glue get-job-run --job-name de-c1w4-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"

We can run this command again and again until we see “SUCCEEDED” instead of “RUNNING”.

Go further into the details of AWS Glue in Course 4.

Now, we can check the S3 bucket that contains the trained data. ← back to the console and open S3 → choose the bucket that has “-datalake” in its name.

We are at this step!

Setting up the Vector Database

In this lab, you won’t train the recommender system yourself (Data Scientist’s job).

You will be provided with a model already trained (stored in S3 as ML artifacts).

Go back to S3 and choose the bucket with ml-artifacts in its name. There are 3 folders

embeddings: contains the embeddings of the users and items (or products) that were generated in the model. There are 2 files: item_embeddings.csv and user_embeddings.csv ← will be used to retrieve similar products.

model: contains the trained model that will be used for inference.
scalers: contains the objects used on the preprocessing part for the training, such as One Hot Encoders and Standard Scalers.

Open terraform/main.tf and uncomment section “vector_db”. Open terraform/outputs.tf and uncomment “Vector-db” section.

Terraform

1cd terraform/
2terraform init
3terraform plan
4terraform apply # then "yes"

After the last step → need ~7m to create PostgresSQL database!
You will get informatkon to connect to the database with sensitive infors (username, password). To show these

1terraform output vector_db_master_username
2terraform output vector_db_master_password
3# ignore double quotes in the output

1# host (<VectorDBHost>)
2de-c1w4-vector-db.cvaw8w688ut3.us-east-1.rds.amazonaws.com
3# username
4postgres
5# password
6V3W_VFGgSpg
7

An example.

Other ways to check VectorDBHost: Open RDS > Databases > click de-c1w4-vector-db > In Connectivity & Security > search for the endpoint (de-c1w4-vector-db.xxxx.us-….rds.amazonws.com)

Add the embeddings to the vector database → open sql/embeddings.sql

To check and replace <BUCKET_NAME>, go to S3 console, open bucket (”…-ml-artifacts”) and then copy the entire name.

1de-c1w4-471112715504-us-east-1-ml-artifacts

An example.

Connect to the database

1psql --host=<VectorDBHost> --username=postgres --password --port=5432
2# then type the password above

To work with postgres database

1\c postgres;

Run the embeddings.sql

1\i '../sql/embeddings.sql'

Check all available tables

1\dt *.*
2# use arrow down keys to show more
3# Q to exit

To quit the psql prompt: \q

Implemeting the Streaming Pipeline

Model inference (Lambda) will use the trained model stored in S3 and the embeddings from the vector database to provide the recommendation based on the online user activity streamed by Kinesis Data streams

Before Kinesis loads the data into S3, it will invoke the Lamnda function labeled as stream transformation to extract user and product features, and then pass these to Model inference to get the recommendation. → then, firehose finally load the recommendation to S3.

You will learn more about the underlying design of Kinesis Data Stream in Couse 2.

Model inference is already implemented and provided but it’s not configured completely. Go to Lambda > de-c1w4-model-inference > Configuration > Environment Variables > Fill in the values of host, username and password in previous steps.

To implement the streaming pipeline, uncomment the streaming section in terraform/main.tf and output.tf. Then run

1terraform init
2terraform plan
3terraform apply # then "yes"

To check the recommendation that are created in S3: Go to S3 > click recommendation bucket

Check logs of transformation lambda function: Lambda > “…-transformation-lambda” > Monitor > View CloudWatch logs > Log streams.

👉

DE by DL.AI - C2 W1 - Source Systems