DE by DL.AI - C3 W1 - Storage Ingredients & Storage Systems

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Course 3 overview

Storage and queries are more complex than they appear.

Various decisions in data storage impact performance and efficiency.

Storage solution considerations: data type, data size, data format, access and update pattern.

Storage hierarchy:

Management system: Organizes data in the raw components and allows you to interact with the stored data
OLTP (Online Transactional Processing Systems): focus on performing read and write queries with low latency. ← row oriented storage is more suitable
OLAP systems (Online Analytical Processing Systems): Focus on applying analytical activities on data (e.g. aggregation, summarization) ← column oriented storage is more suitable

Plan:

Week 1: Trade off between storage cost and performance.

Cloud storage paradigms (block, object and file storage)
Data storage in databases

Row vs column-oriented databases
Graph and vector databases

Characteristics of physical components
Serialization and compression

Week 2: How to choose the appropriate abstractions for storing your data.
Week 3: Queries

How queries work
How different storage solutions affect query performance
Techniques for improving query performance

Data Storage Deep Dive

Storage Raw Ingredients - Physical Components of Data Storage

Persistent Storage Medium: HDD, SSD.

Volatile Memory: RAM, CPU cache.

Storage Raw Ingredients - Processes Required for Data Storage

Serialization: Data need to be serialized before saving in storage and need to be de-serialized to be read.

Serialization formats

Compression

Compression ratio = compressed files/original uncompressed files.

Cloud Storage Options: Block, Object and File storage

File Storage:

Organizes data in a hierarchical directory structure.

Suitable for centralized access and easy sharing among users.

Lower read/write performance due to tracking file hierarchy.

Amazon Elastic File System (EFS)

Block Storage:

Stores data in fixed-size blocks.

High performance and low latency for frequent read/write operations.

Ideal for transactional workloads and virtual machine storage.

Amazon Elastic Block Store (EBS)

Object Storage:

Stores data as immutable objects in a flat structure.

Highly scalable, supporting petabytes of data.

Best for analytical workloads, data lakes, and large unstructured data storage.

Amazon S3

Which one to choose?

Storage Tiers - Hot, Warm & Cold Data

Distributed Storage Systems

Storage as cluster (group of nodes). Each node is HDD or SSD.

Easy to horizontal scaling + higher fault tolerance and data durability (die one, there are others)

Methods for distributing data: replication (duplicate the same as many) or partitioning (split and put in mutiple nodes)

Weakness of distributing data: takes time

CAP Theorem: can only choose 2 of 3 — consistency, availability and partition tolerance. The last is almost fix → problem is to choose consistency or availability!

ACID = Atomicity + Consistency + Isolation + Durability, BASE = Basically Available + Soft-state Eventual Consistency.

RDBMS → consistency, NoSQL → Availability

Database Partitioning / Sharding Methods

You want to distribute your data across multiple nodes → you need to split your data into partitions or shards. You can use a database sharding method or rule to construct a shard key that indicates how the data will be partitioned.

Common sharding methods:

Range-based sharding: Splits rows based on a range of values, such as first letters of customer names, but may lead to unbalanced data distribution.
Hashed sharding: Uses a hash function to partition data, resulting in a more balanced distribution across nodes.
Geo sharding: Partitions data by geographical location, allowing data to be stored closer to customers for faster retrieval.
Other methods: Can split data based on meaningful attributes like occupation or favorite color.

🧑‍💻 Lab 1 - Comparing Cloud Storage Options

Check these codes.

Object storage:

Using Amazon S3 buckets and boto3 to upload data.
Object key looks like containing folder, subfolders but actually it’s a flat structure (all are in the same place)
Enable versioning: in case you upload to the same object key, a new version of this file will be created. → use list_object_version to check.

File storage

Block storage: there is a server that emulates the behavior of block storage.

Memory: using cache-pandas package with @timed_lru_cache() decorator to store the data into RAM cache.

Use htop to check the RAM usage.
Use Memcached or Redis.

Storage in Databases

How Databases Store Data

Databases store data via a DBMS (Database Management System). It contains several components:

Transport System
Query Processor
Execution Engine
Storage Engine

The Storage Engine is responsible for:

Serialization
Arrangement of data on disk
Indexing

Indexing helps query data much faster by storing the locations of IDs via indexes. Indexes can be based on key values or foreign key values.

Row vs column storage

Row-oriented storage: more suitable for transactional workloads that require low latency read and write operation, but not efficient analytical workloads

Column-oriented storage: suitable for analytical workloads that involve aggregating operations to columns but not suitable for reading, writing or updating individual rows

The parquet format

Combine both column-oriented and row-oriented approaches. In this hybrid approach, rows are partitioned into groups and each row group is stored in a column-wise format.

Parquet can be used for storing tabular data as well as nested data such as JSON file.

Parquet's portability is another key advantage. It offers improved performance when working with external tools, unlike proprietary columnar formats from cloud data warehouses that need conversion for compatibility.

Parquet, ORC, and Avro: The File Format Fundamentals of Big Data | Upsolver

File Format | Parquet

Wide-column databases

Normal key-value database (NoSQL)

Wide-column databases

Each cell value is versioned and uniquely identified by its version number (timestamp)

Wide-column database doesn’t enforce a strict table schema, adding columns becomes very flexible, and a column is only written if there’s data for it.

Wide-column Database, Column Databases - A Deep Dive - Scaleyourapp

Apache Cassandra

Amazon keyspaces (for Apache Cassandra)

Graph databases

With relational database, things get complicated with complex relationships → graph database helps.

Use cases of Graph Database:

Recommending products
Modeling social networks
Representing network and IT operations
Simulating supply chains logistics
Tracing data lineage
Fraud detection
Knowledge graph (used with RAG technique in LLM)

Graph databases: neo4j, ArangoDB, Amazon Neptune.

Graph query language: Cypher (neo4j), Gremlin, SparQL.

Knowledge Graphs for RAG - DeepLearning.AI

Vector databases

Vector databases enable you to efficiently query data based on semantic similarities.

You could use a vector database for storing any kind of data that is in the form of numbers in an array.

Vector database uses a distance metric to find similar vectors: Euclidean distance, Cosin distance or Manhattan distance.

Similarity Search: KNN (inefficient when data size increases or curse of dimensionality), ANN (more efficient than KNN, vector database buit to support ANN).

Building Applications with Vector Databases - DeepLearning.AI

ANN algorithm: hierarchical navigable small world (HNSW)

Given a query point (e.g. the green node in the image above), the algorithm starts at the entry node of the top layer (i.e. the red node) and navigates through the graph of that layer, each time choosing the neighboring node that is closest to the query point. It stops at the node that does not have any neighboring nodes closer to the query point. At this point, the algorithm shifts to the current node in the next lower layer and begins searching again. It repeats the process until it finds the nearest node at the bottom layer.

References:

What is Similarity Search? | Pinecone

The Illustrated Word2vec – Jay Alammar – Visualizing machine learning one concept at a time.

Neo4j and Cypher query language

Graph databases such as Neo4j allow you to model your data as a graph and interact with it similar to how you interact with relational databases.

This focuses on Property Graph Model.

The model describes what types of node exist in the graph and how these are linked together.

5 types of nodes (cicles)

Edges between each nodes are call relationships and each has a type.
Each relationship type has a source node and a target node.

AN example

To create a graph database in neo4j

Cypher Query Language → use MATH pattern RETURN result

1# return all nodes
2Match (n) return n
3
4# Get total number of nodes
5Match (n) return count(n)
6
7# with specific label
8Math (n:Order) return count(n)
9
10# Explore node labels
11# "distinct" ensures the returned labels aren't repeated
12Match (n) return distinct labels(n)
13
14# Explore properties of a specific node
15Math (n:Order) return Properties(n) limit 1 # limit only 1st node

node → ()

We need this graph for the following commands. Using this script and these CSV files from Northwind dataset (source) to generate this graph.

1# Count all the directed paths
2Match ()-[r]->() return count(r)
3
4# List distinct types of all relationships in the graph
5Match ()-[r]->() return distinct type(r)
6
7# Specify the type of the relationship -> return properties
8# "as" <- define an alias for the return value
9Match ()-[r:ORDERS]->()
10return AVG(r.quantity*r.unitPrice) as average_price
11
12# Get the average price for all orders grouped by product category
13Match ()-[r:ORDERS]->()-[part:PART_OF]->(c:Category)
14return c.categoryName, AVG(r.quantity*r.unitPrice) as average_price

relationship → [], path → (source node)-[r]->(target node)

1# Filter: Retrieve the product name and unit price of all products
2# that belong to category "Meat/Poultry"
3Match (p:Product)-[:PART_OF]->(c:Category)
4where c.categoryName="Meat/Poultry"
5return p.productName, p.unitPrice
6
7# OR
8Match (p:Product)-[:PART_OF]->(c:Category {categoryName:"Meat/Poultry"})
9return p.productName, p.unitPrice

1# Retrieve the product name of all products ordered by the 
2# customer with customerId "QUEDE"
3Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product)
4return p.productName
5
6# Get the ID of other customers who ordered the same products as "QUEDE"
7Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product) <-[:ORDERS]-()<-[:PURCHASED]-(c2:Customer)
8return c2.customerID

1# Retrieve the orders that contain at most 2 products
2# STEP 1: Get the total #products for each order
3Match (o:Order)-[:ORDERS]->(p:Product)
4return o.orderID as ID, count(p) as countProd
5# STEP 2: filter the orders having at most 2 products
6Match (o:Order)-[:ORDERS]->(p:Product)
7with o.orderID as ID, count(p) as countProd # replace "return" with "with"
8where countProd <= 2
9return ID, countProd

Introduction - Cypher Manual

Cypher Cheat Sheet.

Tutorial: Import data from a relational database into Neo4j - Getting Started

Knowledge Graphs for RAG - DeepLearning.AI

🧑‍💻 Lab 2 - Graph databases & vector search with Neo4j

⭐ Should read the notebook, there are explanations there!

Learn Cypher in Y Minutes

Cypher Manual

shortestPath()

Dijkstra Shortest Path Algorithm

openCypher

Conversation with Juan Sequeda

I don’t note this.

👉

DE by DL.AI - C3 W2 - Storage Abstractions