- A transaction is set of operations.
- Seek to achieve some or all ACID properties.
- Atomic
- A transaction is executed only once; all work completes or none does.
- Why?
- Operations in a transaction often share common intent or depend on each other.
- Performing only subset => intent can be missed.
- Consistent
- A transaction preserves the consistency of data.
- Performed on consistent state and leads to consistent state.
- Typically, developers are responsible for maintaining consistency.
- A transaction preserves the consistency of data.
- Isolated
- Concurrent transactions behave as if each were the only transaction running in the system.
- Some applications reduce isolation level for better throughput
- High isolation => limits number of concurrent transactions
- Durable
- A transaction must be recoverable.
- It must be persisted if e.g. computer crashes.
- Special logging solves this.
- Atomic
- In relational database systems (RDBMS) it's a single unit of work.
- All-or-none => If it fails, DB is rolled back, all modification are erased.
- Seek to achieve some or all ACID properties.
- Caching aims to improve performance & scalability of a system.
- It's done by temporarily copying frequently accessed data to a fast storage, close to application.
- Most effective when
- Same data is repeatedly read.
- Original data store =>
- Relatively static
- Slow compared to cache's speed
- Subject to significant level of contention
- Contention in DB systems =>
- multiple processes or instances competing for access to the same index or data block at the same time
- Contention in DB systems =>
- It's far away & network latency cause access to be slow.
- Distributed applications typically implement either or both when caching data:
- Private cache : Locally held on computer that's running application.
- In-memory store: Accessed by single process.
- Quick & affective, size is typically constrained to host machine.
- Local file system
- Slower than in-memory, but faster than retrieving across network.
- Each application holds its own copy of the data.
- Problem:
- Snapshot of the original data at a point of past.
- Different application instance can hold different versions.
- Snapshot of the original data at a point of past.
- In-memory store: Accessed by single process.
- Shared cache : Common source which multiple processes/machines can access.
- All instances see same view of data as opposed to in-memory.
- It's highly scalable
- Cache services uses cluster of servers and software for distribution.
- Easy to scale by adding to / removing from a cluster.
- Disadvantages:
- Slower to access => Held locally to each application instance.
- Implementing separate cache service => increases complexity.
- Private cache : Locally held on computer that's running application.
- Caching considerations
- When?
- The more data you have, the larger number of users that need to access this data => minimum load on the original data store.
- If original data store is unavailable, cache can be used.
- How to cache data effectively?
- Determine the post appropriate data to cache
- Cache it at the appropriate time.
- Add data to the cache on demand when it's retrieved first time.
- Populate in advance
- Seeding: when the application start.
- Not good for large cache as it can cause sudden high load.
- Manage data expiration.
- Cached data becomes stale after a while.
- Expire caches so they're removed, and retrieved on next read.
- Set a default policy, many cache services you can set period for individual objects while storing them programmatically.
- When?
- Redis Cache
- Recommended by Azure, replaces Azure Cache (deprecated).
- NoSQL key-value database.
- Unique: Allows complex data structure for its keys.
- SKUs: Basic (single node), Standard (2 nodes + SLA)
- Normalized units
- Relative performance guarantees by cloud vendors.
- your application uses 20 units, 40 unit will give you appr. double performance.
- DTUs – Database throughput units (Azure SQL Database)
- Based on compute, storage and IO.
- DTUs for single databases, eDTUs for elastic pools.
- Fixed per pricing plans, e.g.: Basic = 5 DTU, Standard 2 = 50 DTU
- RUs – Request unit processing per second (Azure Cosmos DB)
- Each operation incurs a request charge, which is expressed in Rus.
- Single request unit (normalized) => 1 read of 1 KB document.
- Create, replace, delete consumes more processing = more request units.
- Each operation incurs a request charge, which is expressed in Rus.
- Polyglot persistence
- Solutions that uses mix of data store technologies.
- Most vendors use SQL.
- Have RDMS (relational database management system)
- Conforms to be ACID.
- Supports schema-on-write
- You define data structure, all read+write use same schema.
- Hard to scale out.
- E.g. Azure SQL Database, Azure Database for MySQL, Azure Database for PostgreSQL
- Doesn't use tabular schema of rows & columns.
- Can store as key/value pairs, JSON documents, or as a graph (edges + vertices)
- Have no relational model.
- Graph databases => • Cosmos DB • Gremlin API
- Optimized for exploring weighted relationships between entities.
- Stores edges (entities) and nodes (relationship between nodes).
- Document databases => Azure Cosmos DB
- NoSQL => Most systems supports SQL compatible queries, but non-SQL DBs.
- Column family: HBase in HDInsights
- Key-value pair, where key is mapped to a value that's a set of column.
- Massively parallel & distributed solutions for ingesting, storing, and analyzing data
- Azure Synapse Analytics (older name: SQL Data Warehouse)
- Azure Data Lake
- Time series data stores => Time Series Insights
- Optimized for queries over time-based sequences of data, indexed by datetime.
- Others: Object storage => Blob storage, Shared files => File storage