List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- As a data engineer, you’ll write queries to extract data and set up storage solutions for stakeholders to query directly, considering how storage choices affect query speed and performance.
- Query: A statement that you write in a specific query language to retrieve or
act on data.
- In previous course, you’ve worked with some kinds of queries:
- Week 3 plan:
- Database Management System (DBMS)
- Transport System: when you send a query to DBMS, your request arrives in TS and then to Query Processor.
- Query Processor: Query Parser + Query Optimizer
- Execution Engine: excution plan → query result
- Storage Engine: already mentioned in prev courses.
- Use
EXPLAIN
keyword before any query to see the cost.
- Common vs Advanced SQL statements
- The data we will work with
- Use
DISTINCT
and SQL Functions
CASE
- Use
CASE
andIN
- Common Table Expressions (CTE) → Use
WITH...AS
- Compute the percentage of on-time payments for each customer (from the prev query)
- → Improve prev codes: find the maximum of the “percent on time payment” column
- You can also incorporate some temporary results within your main query using subqueries
- Compute the average duration in days that a customer spent on a film category
- SQL Window Functions: Allows you to apply an aggregate or ranking function over a
particular window or a set of rows. Does not group rows into a single output row: each row remains separate - Partitions rows to form a set of rows. (
PARTITION BY
clause is used) - Orders rows within those partitions into a particular order. (
ORDER BY
clause is used)
Window functions apply to aggregate and ranking functions over a particular window (set of rows).
OVER
clause is used with window functions to define that window. OVER
clause does two things : Note: If partitions aren’t done, then
ORDER BY
orders all rows of the table. - Add a column that shows for each customer the running sum over each window
- ⭐ Check the notebook.
- After this assignment you'll be able to:
- Build SQL queries using several techniques to manipulate data.
- Distinguish between Common Table Expressions and Subqueries.
- Apply the
DATE_PART
function to manipulate date values. - Make use of the
LAG
andLEAD
functions to access values from other rows when making a calculation. - Utilize the
CASE
statement to return different values based on conditions. - Use window functions on SQL to aggregate data within a specific window.
- Understand pivot tables and identify how they are built.
- As a data engineer, understanding how indexes are implemented can help you design better index structures and optimize your queries to improve SQL performance.
- Index: A separate data structure that has its own disk space and contains information that refers to the actual table.
- Balanced Search Tree (B-Tree)
CREATE INDEX
to improve the query performance
- Columnar Storage. Cloud DB like Amazon Redshift use Sort Key to sort one or more column and stores the sorted data on disk.
- If
employee_id
is the primary key , then an index is automatically created on the primary key.
- It’s worst when you write a query that not only scans the entire table, but also returns everything. Eg:
SELECT * FROM order
← avoid with noWHERE
- Large amounts of data need to be transferred from disk.
SELECT *
on cloud pas-as-you-go databases can be expensive.
- TIP → Using Pruning (tỉa, lược bớt)
- Row-based pruning
- Column-based pruning
- Partition pruning
JOIN
operation is one of the most time-consuming query operations.
JOIN
can use several methods:- Nested loop join: For each row in “orders”`, go throw all rows in “customers”
- Index-based nested loop:
- Hash join: using hash to map “customer_id” in to bucket
- Another challenge is when 2 tables have a many-to-many relationship. ← row explosion (return more rows than expected)
- Aggregate queries: used to compute summary value of a column like sum, average, max, min, count,…
- Using index if available.
- It’s quicker if using columnar storage vs row storage.
- We'll look at how Redshift queries data and some considerations to keep in mind for table design that can optimize query performance.
- Amazon Redshift is a columnar storage ← speed up query performance ← best suit for Data Warehousing and large-scale data analytics.
- Redshift uses Data compression. ← save storage space + read less data from disk.
- Support Massive Parallel Processing (MPP) → depend much on distribution style and sort key.
- Check the codes for this lab.
- Compare 2 types of storages.
- Row-based storage → PostgreSQL, column-based → Amazon Redshift
- TPC-H benchmark: The goal of this benchmark is to simulate a set of basic scenarios to examine a large dataset and execute queries to answer business questions. It's designed to evaluate the performance of various database systems and how they execute complex queries.
- Leverage Query Caching:
- Prioritize Readability:
- Data resouces → table bloat (data size on disk exceeds the actual data size)
- We can use system like Apache Flink or Apache Spark to apply SQL queries continuously on streaming data.
- To query on streaming data, we can use Windowed Query:
- To join data streams, we have several ways
- You can use several services on AMZ to work with Apache Flink or you can use Amz Managed Service for Apache Flink.
- AMSAF creates the hosted environment for you to run your applications.
- On AMSAF (Console), when creating a new environment with AMSAF:
- Streaming applications: allows you to host your Apache Flink application and run it in a production environment.
- Studio notebooks: Ideal for development and interactive data exploration. ← Apache Zeppelin
- AMSAF uses connectors to interface with databases, storages, message queues,…
- Notebook → Apache Zeppelin: code written in SQL, Python, Java. Data visualization as tables and charts. Integration with tools like Flink.
- In case you wanna setup notebook in your way (not using blueprint)
Check the codes for this lab.