DE by DL.AI - C3 W1 - Storage Ingredients & Storage Systems

Anh-Thi Dinh
List of notes for this specialization + Lecture notes & Repository + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Course 3 overview

  • Storage and queries are more complex than they appear.
  • Various decisions in data storage impact performance and efficiency.
  • Storage solution considerations: data type, data size, data format, access and update pattern.
  • Storage hierarchy:
    • Management system: Organizes data in the raw components and allows you to interact with the stored data
    • OLTP (Online Transactional Processing Systems): focus on performing read and write queries with low latency. ← row oriented storage is more suitable
    • OLAP systems (Online Analytical Processing Systems): Focus on applying analytical activities on data (e.g. aggregation, summarization) ← column oriented storage is more suitable
  • Plan:
    • Week 1: Trade off between storage cost and performance.
      • Cloud storage paradigms (block, object and file storage)
      • Data storage in databases
        • Row vs column-oriented databases
        • Graph and vector databases
      • Characteristics of physical components
      • Serialization and compression
    • Week 2: How to choose the appropriate abstractions for storing your data.
    • Week 3: Queries
      • How queries work
      • How different storage solutions affect query performance
      • Techniques for improving query performance

Data Storage Deep Dive

Storage Raw Ingredients - Physical Components of Data Storage

  • Persistent Storage Medium: HDD, SSD.
  • Volatile Memory: RAM, CPU cache.

Storage Raw Ingredients - Processes Required for Data Storage

  • Serialization: Data need to be serialized before saving in storage and need to be de-serialized to be read.
  • Serialization formats
  • Compression
    • Compression ratio = compressed files/original uncompressed files.

Cloud Storage Options: Block, Object and File storage

  • File Storage:
      • Organizes data in a hierarchical directory structure.
      • Suitable for centralized access and easy sharing among users.
      • Lower read/write performance due to tracking file hierarchy.
  • Block Storage:
      • Stores data in fixed-size blocks.
      • High performance and low latency for frequent read/write operations.
      • Ideal for transactional workloads and virtual machine storage.
  • Object Storage:
      • Stores data as immutable objects in a flat structure.
      • Highly scalable, supporting petabytes of data.
      • Best for analytical workloads, data lakes, and large unstructured data storage.
  • Which one to choose?

Storage Tiers - Hot, Warm & Cold Data

Distributed Storage Systems

  • Storage as cluster (group of nodes). Each node is HDD or SSD.
  • Easy to horizontal scaling + higher fault tolerance and data durability (die one, there are others)
  • Methods for distributing data: replication (duplicate the same as many) or partitioning (split and put in mutiple nodes)
  • Weakness of distributing data: takes time
  • CAP Theorem: can only choose 2 of 3 — consistency, availability and partition tolerance. The last is almost fix → problem is to choose consistency or availability!
    • ACID = Atomicity + Consistency + Isolation + Durability, BASE = Basically Available + Soft-state Eventual Consistency.
    • RDBMS → consistency, NoSQL → Availability

Database Partitioning / Sharding Methods

  • You want to distribute your data across multiple nodes → you need to split your data into partitions or shards. You can use a database sharding method or rule to construct a shard key that indicates how the data will be partitioned.
  • Common sharding methods:
    • Range-based sharding: Splits rows based on a range of values, such as first letters of customer names, but may lead to unbalanced data distribution.
    • Hashed sharding: Uses a hash function to partition data, resulting in a more balanced distribution across nodes.
    • Geo sharding: Partitions data by geographical location, allowing data to be stored closer to customers for faster retrieval.
    • Other methods: Can split data based on meaningful attributes like occupation or favorite color.

🧑‍💻 Lab 1 - Comparing Cloud Storage Options

  • Object storage:
    • Using Amazon S3 buckets and boto3 to upload data.
    • Object key looks like containing folder, subfolders but actually it’s a flat structure (all are in the same place)
    • Enable versioning: in case you upload to the same object key, a new version of this file will be created. → use list_object_version to check.
  • File storage
  • Block storage: there is a server that emulates the behavior of block storage.
  • Memory: using cache-pandas package with @timed_lru_cache() decorator to store the data into RAM cache.

Storage in Databases

How Databases Store Data

  • Databases store data via a DBMS (Database Management System). It contains several components:
    • Transport System
    • Query Processor
    • Execution Engine
    • Storage Engine
  • The Storage Engine is responsible for:
    • Serialization
    • Arrangement of data on disk
    • Indexing
  • Indexing helps query data much faster by storing the locations of IDs via indexes. Indexes can be based on key values or foreign key values.

Row vs column storage

  • Row-oriented storage: more suitable for transactional workloads that require low latency read and write operation, but not efficient analytical workloads
  • Column-oriented storage: suitable for analytical workloads that involve aggregating operations to columns but not suitable for reading, writing or updating individual rows

The parquet format

  • Combine both column-oriented and row-oriented approaches. In this hybrid approach, rows are partitioned into groups and each row group is stored in a column-wise format.
  • Parquet can be used for storing tabular data as well as nested data such as JSON file.
  • Parquet's portability is another key advantage. It offers improved performance when working with external tools, unlike proprietary columnar formats from cloud data warehouses that need conversion for compatibility.

Wide-column databases

  • Normal key-value database (NoSQL)
  • Wide-column databases
    • Each cell value is versioned and uniquely identified by its version number (timestamp)
  • Wide-column database doesn’t enforce a strict table schema, adding columns becomes very flexible, and a column is only written if there’s data for it.

Graph databases

  • With relational database, things get complicated with complex relationships → graph database helps.
  • Use cases of Graph Database:
    • Recommending products
    • Modeling social networks
    • Representing network and IT operations
    • Simulating supply chains logistics
    • Tracing data lineage
    • Fraud detection
    • Knowledge graph (used with RAG technique in LLM)

Vector databases

  • Vector databases enable you to efficiently query data based on semantic similarities.
  • You could use a vector database for storing any kind of data that is in the form of numbers in an array.
  • Vector database uses a distance metric to find similar vectors: Euclidean distance, Cosin distance or Manhattan distance.
  • Similarity Search: KNN (inefficient when data size increases or curse of dimensionality), ANN (more efficient than KNN, vector database buit to support ANN).

ANN algorithm: hierarchical navigable small world (HNSW)

Given a query point (e.g. the green node in the image above), the algorithm starts at the entry node of the top layer (i.e. the red node) and navigates through the graph of that layer, each time choosing the neighboring node that is closest to the query point. It stops at the node that does not have any neighboring nodes closer to the query point. At this point, the algorithm shifts to the current node in the next lower layer and begins searching again. It repeats the process until it finds the nearest node at the bottom layer.
References:

Neo4j and Cypher query language

  • Graph databases such as Neo4j allow you to model your data as a graph and interact with it similar to how you interact with relational databases.
  • This focuses on Property Graph Model.
  • The model describes what types of node exist in the graph and how these are linked together.
    • 5 types of nodes (cicles)
    • Edges between each nodes are call relationships and each has a type.
    • Each relationship type has a source node and a target node.
  • AN example
  • To create a graph database in neo4j
  • Cypher Query Language → use MATH pattern RETURN result
    • 1# return all nodes
      2Match (n) return n
      3
      4# Get total number of nodes
      5Match (n) return count(n)
      6
      7# with specific label
      8Math (n:Order) return count(n)
      9
      10# Explore node labels
      11# "distinct" ensures the returned labels aren't repeated
      12Match (n) return distinct labels(n)
      13
      14# Explore properties of a specific node
      15Math (n:Order) return Properties(n) limit 1 # limit only 1st node
      node → ()
      We need this graph for the following commands. Using this script and these CSV files from Northwind dataset (source) to generate this graph.
      1# Count all the directed paths
      2Match ()-[r]->() return count(r)
      3
      4# List distinct types of all relationships in the graph
      5Match ()-[r]->() return distinct type(r)
      6
      7# Specify the type of the relationship -> return properties
      8# "as" <- define an alias for the return value
      9Match ()-[r:ORDERS]->()
      10return AVG(r.quantity*r.unitPrice) as average_price
      11
      12# Get the average price for all orders grouped by product category
      13Match ()-[r:ORDERS]->()-[part:PART_OF]->(c:Category)
      14return c.categoryName, AVG(r.quantity*r.unitPrice) as average_price
      relationship → [], path → (source node)-[r]->(target node)
      1# Filter: Retrieve the product name and unit price of all products
      2# that belong to category "Meat/Poultry"
      3Match (p:Product)-[:PART_OF]->(c:Category)
      4where c.categoryName="Meat/Poultry"
      5return p.productName, p.unitPrice
      6
      7# OR
      8Match (p:Product)-[:PART_OF]->(c:Category {categoryName:"Meat/Poultry"})
      9return p.productName, p.unitPrice
      1# Retrieve the product name of all products ordered by the 
      2# customer with customerId "QUEDE"
      3Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product)
      4return p.productName
      5
      6# Get the ID of other customers who ordered the same products as "QUEDE"
      7Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product) <-[:ORDERS]-()<-[:PURCHASED]-(c2:Customer)
      8return c2.customerID
      1# Retrieve the orders that contain at most 2 products
      2# STEP 1: Get the total #products for each order
      3Match (o:Order)-[:ORDERS]->(p:Product)
      4return o.orderID as ID, count(p) as countProd
      5# STEP 2: filter the orders having at most 2 products
      6Match (o:Order)-[:ORDERS]->(p:Product)
      7with o.orderID as ID, count(p) as countProd # replace "return" with "with"
      8where countProd <= 2
      9return ID, countProd

🧑‍💻 Lab 2 - Graph databases & vector search with Neo4j

Conversation with Juan Sequeda

I don’t note this.