-
Notifications
You must be signed in to change notification settings - Fork 2
Home
After a short research on the design and implementation of rqlite and actordb, TableDB with Raft replication will be started with the NPL Raft implemetation, which is (probably) the fundamental of this project. If we have finish this, we could consider to add more functional features like actordb.
TableDB with Raft implemetation will be much like rqlite, the only difference is TableDB's replication is on the table(collection) level, while rqlite is on the whole sqlite datebase(which may have multiple tables) level,the replication level can be in a higher layer.
A much coarse roadmap:
-
Implement Raft consensus with NPL, according to the Raft paper and jraft- Leader election
- Log replication
- Cluster membership changes
- Log compaction
- Client interaction
- Add more abstract interface to TableDB to adapt to the Raft consensus, with reference to rqlite
- Test. Test correctness and performance of the implementation, and fix.
In order to get a quick(in one month), full feature and correct NPL Raft implementation, it will be helpful to refer to an existing (full feature and correct) implementation. After several days' digging into NPL and various Raft implementations on the raft consensus website, I choose jraft, a Java implemetation. There are several reasons for this:
- NPL is recommended to an OO style coding
- jraft is full feature, correct and still under maintenance
- jraft is straight forward, and have a good understandability, much thanks to the good Java OO style
On the basis of NPL Raft implementation, TableDB Raft implementation will be much easier. But the implementation can differ much, with the compare between actordb and rqlite.
- rqlite is easy, it simply use SQL statement as Raft Log Entry.
- actordb goes more complicate:
Actors are replicated using the Raft distributed consensus protocol. Raft requires a write log to operate. Because our two engines are connected through the SQLite WAL module, Raft replication is a natural fit. Every write to the database is an append to WAL. For every append we send that data to the entire cluster to be replicated. Pages are simply inserted to WAL on all nodes. This means the leader executes the SQL, but the followers just append to WAL.
becaue we don't have sqlite wal hook in the NPLRuntime and we also want to keep the features in TableDB, actordb and rqlite 's implementation will not be feasible. But we can borrow the consistency levels from rqlite.
it is still not hard.
utilize the msg in IORequest:Send
, the log entry looks like below:
function RaftLogEntryValue:new(query_type, collection, query)
local o = {
query_type = query_type,
collection = collection:ToData(),
query = query,
cb_index = index,
serverId = serverId,
};
setmetatable(o, self);
return o;
end
and commint in state machine looks like below:
--[[
* Commit the log data at the {@code logIndex}
* @param logIndex the log index in the logStore
* @param data
]]--
function RaftTableDB:commit(logIndex, data)
-- data is logEntry.value
local raftLogEntryValue = RaftLogEntryValue:fromBytes(data);
local cbFunc = function(err, data)
local msg = {
err = err,
data = data,
cb_index = raftLogEntryValue.cb_index,
}
-- send Response
RTDBRequestRPC(nil, raftLogEntryValue.serverId, msg)
end;
collection[raftLogEntryValue.query_type](collection, raftLogEntryValue.query.query,
raftLogEntryValue.query.update or raftLogEntryValue.query.replacement,
cbFunc);
self.commitIndex = logIndex;
end
the client interface keeps unchanged, we provide a script/TableDB/RaftSqliteStore.lua
, This also need to add StorageProvider:SetStorageClass(raftSqliteStore)
method to StorageProvider
. RaftSqliteStore will send the Log Entry above to the raft cluster in each interface and could also consider consistency levels.
Like the original interfaces, callbacks is also implemented in a async way. But we make the connect to be sync to alleviate the effect.
Like rqlite, we use sqlite's Online Backup API to make snapshot.
Raft core logic is in RaftServer
. TableDB is implemented as Raft Statemachine, see RaftTableDBStateMachine
.
below is Figure1 in the raft thesis. It tells the process of a Raft replicated statemachine architechture.
TableDBApp:App -> create RpcListener && TableDB StateMachine -> set RaftParameters -> create RaftContext -> RaftConsensus.run
RaftConsensus.run -> create RaftServer -> start TableDB StateMachine -> RpcListener:startListening
RaftServer: has several PeerServers, and ping heartbeat at small random interval when it is the leader
RpcListener: at Raft level, route message to RaftServer
TableDB StateMachine: TableDB level, call messageSender to send message to RaftServer
TableDBApp:App -> create RaftSqliteStore && TableDB StateMachine -> create RaftClient -> RaftSqliteStore:setRaftClient -> send commands to the cluster
commands: appendEntries, addServer, removeServer.
send commands to the cluster:will retry
RaftSqliteStore: use RaftClient send various commands to Cluster and handle response.
TableDB StateMachine: send response to the client when the command is committed.
At each server's start up stage, they should know all their peers, this is a json config file like below (setup/init-cluster.json
:
{
"logIndex": 0,
"lastLogIndex": 0,
"servers":[
{
"id": 1,
"endpoint": "tcp://localhost:9001"
},
{
"id": 2,
"endpoint": "tcp://localhost:9002"
},
{
"id": 3,
"endpoint": "tcp://localhost:9003"
}
]
}
there are several scripts (setup.bat
, addsrv.bat
, stopNPL.bat
) to facillitate the deployment. see setup folder
The Communication use npl_mod/Raft/Rpc.lua
, include RaftRequestRPC
and RTDBRequestRPC
. RaftRequestRPC
is used in Raft level and RTDBRequestRPC
in TableDB level. They are both used in a Full Duplex
way, that is, they are not only used to send request but also recv response. For RaftRequestRPC
, see RpcListener:startListening
and PeerServer:SendRequest
. For RTDBRequestRPC
, see RaftTableDBStateMachine:start
, RaftClient:tryCurrentLeader
and RaftTableDBStateMachine:start2
.