Skip to content

Latest commit

 

History

History
125 lines (113 loc) · 5.87 KB

File metadata and controls

125 lines (113 loc) · 5.87 KB

Storing in cloud

Durability of data

  • A transaction is set of operations.
    • Seek to achieve some or all ACID properties.
      • Atomic
        • A transaction is executed only once; all work completes or none does.
        • Why?
          • Operations in a transaction often share common intent or depend on each other.
          • Performing only subset => intent can be missed.
      • Consistent
        • A transaction preserves the consistency of data.
          • Performed on consistent state and leads to consistent state.
          • Typically, developers are responsible for maintaining consistency.
      • Isolated
        • Concurrent transactions behave as if each were the only transaction running in the system.
        • Some applications reduce isolation level for better throughput
          • High isolation => limits number of concurrent transactions
      • Durable
        • A transaction must be recoverable.
        • It must be persisted if e.g. computer crashes.
          • Special logging solves this.
    • In relational database systems (RDBMS) it's a single unit of work.
    • All-or-none => If it fails, DB is rolled back, all modification are erased.

Caching

  • Caching aims to improve performance & scalability of a system.
  • It's done by temporarily copying frequently accessed data to a fast storage, close to application.
  • Most effective when
    • Same data is repeatedly read.
    • Original data store =>
      • Relatively static
      • Slow compared to cache's speed
      • Subject to significant level of contention
        • Contention in DB systems =>
          • multiple processes or instances competing for access to the same index or data block at the same time
      • It's far away & network latency cause access to be slow.
  • Distributed applications typically implement either or both when caching data:
    • Private cache : Locally held on computer that's running application.
      • In-memory store: Accessed by single process.
        • Quick & affective, size is typically constrained to host machine.
      • Local file system
        • Slower than in-memory, but faster than retrieving across network.
        • Each application holds its own copy of the data.
      • Problem:
        • Snapshot of the original data at a point of past.
          • Different application instance can hold different versions.
    • Shared cache : Common source which multiple processes/machines can access.
      • All instances see same view of data as opposed to in-memory.
      • It's highly scalable
        • Cache services uses cluster of servers and software for distribution.
        • Easy to scale by adding to / removing from a cluster.
      • Disadvantages:
        • Slower to access => Held locally to each application instance.
        • Implementing separate cache service => increases complexity.
  • Caching considerations
    • When?
      • The more data you have, the larger number of users that need to access this data => minimum load on the original data store.
      • If original data store is unavailable, cache can be used.
    • How to cache data effectively?
      • Determine the post appropriate data to cache
      • Cache it at the appropriate time.
        • Add data to the cache on demand when it's retrieved first time.
        • Populate in advance
          • Seeding: when the application start.
          • Not good for large cache as it can cause sudden high load.
    • Manage data expiration.
      • Cached data becomes stale after a while.
      • Expire caches so they're removed, and retrieved on next read.
      • Set a default policy, many cache services you can set period for individual objects while storing them programmatically.
  • Redis Cache
    • Recommended by Azure, replaces Azure Cache (deprecated).
    • NoSQL key-value database.
      • Unique: Allows complex data structure for its keys.
    • SKUs: Basic (single node), Standard (2 nodes + SLA)

Measuring throughput

  • Normalized units
    • Relative performance guarantees by cloud vendors.
    • your application uses 20 units, 40 unit will give you appr. double performance.
  • DTUs – Database throughput units (Azure SQL Database)
    • Based on compute, storage and IO.
    • DTUs for single databases, eDTUs for elastic pools.
    • Fixed per pricing plans, e.g.: Basic = 5 DTU, Standard 2 = 50 DTU
  • RUs – Request unit processing per second (Azure Cosmos DB)
    • Each operation incurs a request charge, which is expressed in Rus.
      • Single request unit (normalized) => 1 read of 1 KB document.
      • Create, replace, delete consumes more processing = more request units.

Structure of data

  • Polyglot persistence
    • Solutions that uses mix of data store technologies.

Structured data stores

  • Most vendors use SQL.
  • Have RDMS (relational database management system)
    • Conforms to be ACID.
    • Supports schema-on-write
      • You define data structure, all read+write use same schema.
  • Hard to scale out.
  • E.g. Azure SQL Database, Azure Database for MySQL, Azure Database for PostgreSQL

Unstructured / semi-structured data stores

  • Doesn't use tabular schema of rows & columns.
  • Can store as key/value pairs, JSON documents, or as a graph (edges + vertices)
  • Have no relational model.
  • Graph databases => • Cosmos DB • Gremlin API
    • Optimized for exploring weighted relationships between entities.
    • Stores edges (entities) and nodes (relationship between nodes).
  • Document databases => Azure Cosmos DB
  • NoSQL => Most systems supports SQL compatible queries, but non-SQL DBs.
  • Column family: HBase in HDInsights
    • Key-value pair, where key is mapped to a value that's a set of column.
  • Massively parallel & distributed solutions for ingesting, storing, and analyzing data
    • Azure Synapse Analytics (older name: SQL Data Warehouse)
    • Azure Data Lake
    • Time series data stores => Time Series Insights
      • Optimized for queries over time-based sequences of data, indexed by datetime.
  • Others: Object storage => Blob storage, Shared files => File storage