DP-900: Data Fundamentals

Azure SQL Family of Services
SQL
- Data Definition Language (DDL)
- Data Manipulation Language (DML)
ACID
Database Administrator Tools
JSON
Data Engineering Tools
Batch Vs Stream Processing
Normalized vs Denormalized data
Star vs. Snowflake Schema
Clustered vs Nonclustered Index
Strongly vs Eventual Consistency
Synchronous vs Asychronous Data Trasmission
OLAP vs OLTP
SaaS vs PaaS vs IaaS
Data Storage
- Data Warehouse
- Data Mart
- Data Lake
- Data Lakehouse
Data Structure
- Unstructured data
- Semi-structured data
- Structured data
ELT vs ETL
Data Analytics
Azure Synapse Analytics
Pipelines
Azure Storage Accounts
Storage Tiers
- Azure Blob
- Azure Files
- Azure Tables
Mongo DB
Cosmo DB
- Azure Cosmos DB’s Gremlin API
Azure Tables vs. Cosmo DB for Tables
Read Replica
Connectivity Architecture
Public vs Private Endpoint
Role-Based-Access-Controls (RBAC)
Transparent Data Encryption (TDE)
Dynamic Data Masking
Private Links
Apache Hadoop
Apache Kafka
Apache Spark
Databricks

Azure SQL Family of Services

https://azure.microsoft.com/en-us/blog/the-azure-sql-family-innovation-and-value-in-the-cloud/

SQL

Data Definition Language (DDL)

DDL defines data structures. Use these statements to create, alter, or drop data structures in a database.

ALTER
Collations
CREATE
DROP
DISABLE TRIGGER
ENABLE TRIGGER
RENAME
UPDATE STATISTICS
TRUNCATE TABLE

Data Manipulation Language (DML)

DML affect the information stored in the database. Use these statements to insert, update, and change the rows in the database.

BULK INSERT
DELETE
INSERT
SELECT
UPDATE
MERGE

ACID

Atomicity: each statement is treated as a single unit. Either the entire statement is executed, or none of it is executed.
Consistency: ensures that transactions only make changes to tables in predefined, predictable ways.
Isolation: when multiple users are reading and writing from the same table all at once, eachch request can occur as though they were occurring one by one, even though they’re actually occurring simultaneously.
Durability: ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure.
Database Administrator Tools
Azure Data Studio
- Graphic interphase for managing on-premises and cloud-based data services
- Runs on Windows, Mac and Linux
SQL Server Management Studio
- Runs on Windows
- More mature than Azure Data Studio
Azure Portal and CLI
- Manage SQL database configurations

JSON

 
 "Customer": { 
    "first_name" : "Sammy",
    "last_name" : "Shark",
    "location" : "Ocean",
    "websites" : [
        {
        "description" : "work",
        "URL" : "https://www.digitalocean.com/"
        },
        {
        "desciption" : "tutorials",
        "URL" : "https://www.digitalocean.com/community/tutorials"
        }
    ],
    "social_media" : 
        {
        "description" : "twitter",
        "link" : "https://twitter.com/digitalocean"
        }
    }

Customer is a root object
Website is a nested array, as indicated by the square bracket [ ]
Social_Media is a nested object
Data Engineering Tools
Azure Synapse Studio
- A central portal to manage Azure Synapse, Azure data factory (data ingestion) and Synapse assets (SQL/Spark pools)
Knowledge SQL
- Create databases, tables, views, etc
Azure CLI
- Support SQL command to connect to Microsoft Server
HDInsights
- Streaming data or applying ELT jobs via HIVE, PIG, Apache Spark
Azure dATABRICKS
- Streaming data or applying ELT jobs via Apache Spark

Batch Vs Stream Processing

Batch Processing
- Not real time
- Ideal for large processing workloads
- Cost effective
Stream Processing
- Offer real-time analytics
- More expensive

Normalized vs Denormalized data

Normalized data
- Easier to maintain data integrity
- Little to no redundant data
- Optimized for data storage
- Querying is slower
- Increases throughput for writing
- Better suited for transactional database
Denormalized data
- Harder to maintain data integrity
- Redundant data is common
- Storage is less optimal
- Querying is fast
- Increases throughput for reading
- Better suited for analytical database

Star vs. Snowflake Schema

Star	Snowflake
Fact tables, dimension tables	Fact tables, dimension tables, sub-dimension tables
Top-down model	Bottom-up model
Has redundancy, more storage space	No redundancy, less storage space
Less JOINs, less complex queries	More JOINS, more complex queries
Shorter execution time	Longer execution time

Clustered vs Nonclustered Index

Clustered index
- Sort and store the data rows in the table based on their key values.
- There can be only one clustered index per table, because the data rows themselves can be stored in only one order.
- When there is no clustered index, data rows are stored in an unordered structure called a heap.
Nonclustered index
- Nonclustered indexes have a structure separate from the data rows.
- A nonclustered index contains the nonclustered index key values and each key value entry has a pointer to the data row that contains the key value.
- A table can have multiple nonclustered indexes
  Strongly vs Eventual Consistency
Strongly consistent
- Each query returns the most update to date (consistent) data
- At the cost of high latency
Eventually consistent
- Query may return stale data at first
- Time taken to get consistent may or may not be defined.
- Low latency

Synchronous vs Asychronous Data Trasmission

Synchronous
- Sender and receiver are permanently synchronized using a common timing signal
- Guarantee consistency of data at time of access
- Can only access data once transfer is complete
- e.g. chat app, streaming services
Asychronous
- Sender and receiver are synchronized only during transmission, by a starting bit and a stop bit
- No guarantee of consistency
- Can access data anytime
- e.g. email

OLAP vs OLTP

OLAP: Online Analytical Processing
- Multiple data sources
- Long and complex queries, with an emphasis on read (SELECT)
- Few transactions
- Throughput sensitive
OLTP: Online Transaction Processing
- Single data source
- Short query, with an empahsis on write (INSERT, UPDATE, DELETE)

SaaS vs PaaS vs IaaS

Data Storage

Data Warehouse

A central repository of relational data designed for analytical purpose
Can return queries very fast

Data Mart

A subset of data warehouse, allowing teams or department to have control over their own datasets
Generally read only

Data Lake

A centralized repository that holds vast amount of raw data in structured, semi-structured, or unstructured form
Mainly designed for machine learning and big data analysis

Data Lakehouse

Compoared to a data warehouse:
- Support video, audio and text files
- Support data science and ML workloads
- Support both ELT and streaming
Compared to a data lake:
- Perform BI tasks well
- Easier to set up and maintain
- More management features

Data Structure

Unstructured data

A bunch of loose data
MS products: Sharepoint, Azure Blob Storage, Azure files, Azure Data Lake

Semi-structured data

Has some form of relationship
- XML: a markup lanaguage that looks similar to HTML
- JSON: a text file that is composed of dictionaries and arrays
- RCFiles: a storage format designed for MapReduce framework
- ORC: a columnar data structure that’s 75% more efficient than RCFiles; works well with HIVE
- AVRO: a row wise data structure for Hadoop system
- PARQUET: a columnar data-structure for Hadoop system
MS products: Azure Tables, Azure CosmoDB, MongoDB, Apache Cassandra

Structured data

Most common structure is tabular
MS products: MySQL, Postgres, Azure SQL, Azure Synapse Analytics

ELT vs ETL

Extract, Transform and Load
- Loads data first into a staging server and then to the target system
- Used for on-premise, relational and structured data
- Commonly used for small amount of data
- Doesn’t provide data lake suport
- Easy to implement
Extract, Load and Transform
- Loads data directly into the target system
- Used for scalable structured and unstructured data
- Used for large amount of data
- Provide data lake support
- Require specialized skills to implement

Data Analytics

Description analytics - what happened?
Diagnostic analytics - why did it happen?
Predictive analytics - what will happen?
Prescriptive analytics - how can we make it happen?
Cognitive analytics - what if this happens?

Azure Synapse Analytics

A data warehouse and unified analytics platform
Synapse SQL is a distributed version of T-SQL with built-in streaming capabilities to load data from cloud data sources
Offer both serverless and didicated resource models
- Serverless: for unpredicable workloads
- Dedicated: for predictable workloads
Can seamlessly integrate with Apache Spark

Pipelines

A logical grouping of activities that together perform a task.
Azure Data Factory and Azure Synapse Analytics have three groupings of activities
- Data movement activities: copying data
- Data transformation activities: data flow, HIVE, PIG, MapReduce, etc
- Control activities: lookup, set variable, append, filter, etc
  Azure Storage Accounts

Storage Tiers

Hot tier: An online tier optimized for storing data that is accessed or modified frequently; highest storage costs, but the lowest access costs.
Cool tier: An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should be stored for a minimum of 30 days.
Archive tier: An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours. Data in the archive tier should be stored for a minimum of 180 days.
Azure Blob
Blob is a object-store that is optimized for storing massive amount of unstructured data, such as image files
Azure Storage supports 3 types of blobs:
- Block blobs
  - Store text and binary data
  - Can be managed invidividually
  - Store up to 4.75 TB of data
- Append blobs
  - Optimized for append operations
  - Ideal for scenarios such as logging data
- Page blobs
  - Store random access files up to 8TB
  - Store virtual hard drive files and serve as disks for Azure virtual machine

Azure Files

A fully managed file share in the cloud
Can be mounted concurrently by cloud or on-premises deployments of Windows, Linux, and macOS.

Azure Tables

Stores non-relational structured data - NoSQL key/value data - with a schemaless design
Can be hosted on Account Storage - designed for a single region and single table
Can be hosted on CosmosDB - designed for scale across multiple regions

Mongo DB

An open-source database that stores JSON-like documents
The primary data structure is called Binary JSON (BSON)
- Has more data-types than JSON
- Designed to be efficient both in storage space and scan speed

Cosmo DB

Supports different kinds of NoSQL database engine through API
- Core SQL (document data)
- Azure table (key/value data)
- API for MangoDB (document data)
- API for Gremlin (graph data)
  Azure Cosmos DB’s Gremlin API

Azure Tables vs. Cosmo DB for Tables

Read Replica

A copy of the database that is kept sync with the original database
Used to offload read requests or analytics traffic from the primary instance

Connectivity Architecture

Proxy
- Connections are proxies through a gateway
- Increased latency and reduced throughput
- Intended for workloads from outside the Azure network
Redirect
- Direct connection
- Reduced lantecy and improved throughput
- Intending for workloads from inside the Azure network

Public vs Private Endpoint

Public: reachable outside the Azure Network over the Internet
Private: only reachable within the Azure Network
Role-Based-Access-Controls (RBAC)
SQL DB Contributor
- Manage SQL databases
- Cannot give access to others
SQL Managed Instance Contribute
- Manage SQL Managed Instances and require network configuration
- Cannot give access to others
SQL Server Contributor
- Manage SQL servers and databases
- Cannot give access to others
SQL Security Manager
- Manage the security-related policies of SQL servers and databses
- Cannot give access to others

Transparent Data Encryption (TDE)

Encrpyts data-at-rest
Uses a database entryption key (DEK) which is a symmetric key (same key used for both encryption and decryption)
Can be applied to
- SQL Server
- Azure SQL Database
- Azure Synapse Analytics

Dynamic Data Masking

A central data masking policy acts directly on sensitive fields in the database.
Designate privileged users or roles that do have access to the sensitive data.

Private Links

Allows secure connections between Azure resources so traffic remains within the Azure network

Apache Hadoop

An open source framework for distributed processing of large datasets
Hadoop Distributed File System (HDFS): a resilient and redundant file storage distributed on clusters of common hardware
Hadoop MapReduce: writes apps that can process multi-terabyte data in parallel on large clusters of common hardware
HBase: an open-source non-relational distributed database
YARN: manages resources, nodes, containers and performs scheduling
HIVE: generating reports using SQL
PIG: a high level scripting language to write complex data transformations

Apache Kafka

An open source streaming platform to create high-performance data pipelines, streaming analytics, data integration…
Use JAVA

Apache Spark

An open source unified analytics engine for big data and machine learning
Run worloads much faster than Hadoop

Databricks

Provides fully managed Apache Spark clusters

Dian Z