-
Notifications
You must be signed in to change notification settings - Fork 2
/
usage.tex
121 lines (102 loc) · 6.05 KB
/
usage.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
\section{Usage Scenarios}
\label{sec:scenarios}
Quantitative consistency metrics are useful for a variety of services
that can tolerate eventual consistency. In this section, we outline
three use cases via short vignettes in the context of a hypothetical
microblogging service, {\systemname}.
\subsection{Dynamic Reconfiguration}
\label{sec:dynamic}
Without quantitative guidance, choosing the correct values for
replication parameters is a difficult task. We believe that, instead
of reasoning about low-level replication settings, application writes
should instead declaratively specify their objectives in the form of
higher-level service agreements and allow the system to adjust its
parameters to meet this goal. For example, in Dynamo-style systems
like Cassandra, applications can choose the minimum number of replicas
to read from ($R$) and write to ($W$). If $R$+$W$ is greater than the
number of replicas ($N$), operations will achieve ``strong''
consistency (regular register semantics), and, if not, the system
provides eventual consistency. With a replication factor of $N=3$, we
have several options for eventually consistent operation: for example,
$\langle R$$=$$W$$=$$1\rangle$ is guaranteed to be fastest, but it is
also the ``least consistent,'' while $\langle R$$=$$2$$,
W$$=$$1\rangle$ is ``more consistent'' but slower. Choosing the right
configuration is non-trivial, especially without data regarding the
effect of a given choice. Instead of reasoning about $R$ and $W$,
service operators should express their requirements in terms of tolerable staleness:
\vignette{{\systemname}'s data scientists have learned that their users
respond negatively to slow update propagation, and the company sets
a $t$-visibility target of 500ms at the 99.9th percentile for their
back-end data store requests (while minimizing latency).}
How should we configure the replication parameters for a given SLA?
One approach is to manually tune the system---this is straightforward
but is not %necessarily%
always robust to changing conditions:
\vignette{Based on feedback from their infrastructure team, the
developers set $R$=$W$=$1$ for their Dynamo-based data store. This
meets the $t$-visibility consistency SLA, but, during peak traffic,
the consistency SLA is sporadically violated.}
Instead of making a static assignment, metrics-enabled systems can
auto-tune the replication parameters for each request. Consistency
monitoring allows us to monitor SLA violations, while consistency
prediction allows us the system to test replication parameter changes
before making them:
\vignette{A consistency auto-tuner chooses R=W=1 for most traffic but
switches to R=2, W=1 during peak workload times. While R=1, W=2 is
also a viable solution, the auto-tuner determines that the 99.9th
percentile operation latency would suffer since most reads are
served from the data store's buffer cache.}
While the literature suggests several dynamic replication
schemes~\cite{vahdat-article}, we are pleased that these techniques
can now become a reality for production-ready data stores and
real-world services.
\subsection{DBA Tooling and Diagnoses}
\label{sec:dba}
Consistency metrics can also be used in diagnostic tasks and to
understand \textit{why} a system is misbehaving. There are a number of
system parameters that affect the performance and observed
consistency of a distributed data store. However, system
administrators currently face two major feature challenges: there is
limited information available in terms of real-time consistency
properties and a lack of mechanisms to understand how the system will
behave as parameters are varied.
\vignette{The database administrators at {\systemname} have received
reports that a high-profile user is seeing very stale data.}
There are a host of consistency configuration options available to
data store administrators. Taking Cassandra as an example, the
administrators can configure read repair rates, perform active
anti-entropy value exchange, and enable or disable replicas. Moreover,
there are many causes for inconsistent reads: there may be slow nodes
in the cluster or certain keys may form ``hot spots.''
\vignette{The administrators inspect the consistency metrics for each
data store shard and identify a misbehaving set of nodes
corresponding to a bad top-of-rack switch. They temporarily increase
the rate of background version exchange for the shard and begin to
spin up a new replica set on a different rack before rebooting the
switch.}
We believe consistency metrics should allow standard analytics such as
fine-grained drill-downs and roll-ups across both logical data items
and physical-layer details like placement and hardware details. If, as
in our Cassandra implementation (Section~\ref{sec:architecture}),
monitoring is performed as a white-box, in-database service, low-level
details like network topologies and per-operation latencies will be
available to the administrator. Of course, it may be sufficient for most
operations to simply experiment with common configuration parameters via
prediction, but exposing such advanced analytics functionality is
likely useful for power users.
\subsection{Monitoring and Alerts}
\label{sec:monitoring}
Consistency metrics allow new approaches to monitoring:
\vignette{{\systemname} has a large number of DevOps staff who are
responsible for keeping the service online. Currently, their
monitoring and pager service is triggered when operation latency is
high or if servers fail. As user experience is negatively impacted
by inconsistent reads, the CTO configures custom alerts for the DevOps.}
As we discussed in Section~\ref{sec:dynamic}, some parameters like
per-request quorum settings, are amenable to SLA-based automatic
control. However, there are a number of scenarios where traditional,
manual monitor-and-respond is an acceptable approach. If SLAs cannot
be met under any circumstances (e.g., operation latency and
$t$-visibility bounds are too restrictive), the correct response is to
revise the SLAs or perform more invasive operations (e.g., adding more
replicas) that may require active human oversight.