-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-166:Nimbus HA solution based on Zookeeper #61
Conversation
{:text "Nimbus uptime" :attr {:class "tip right" | ||
:title (:nimbus-uptime tips)}} | ||
{:text "Nimbus leader uptime" :attr {:class "tip right" | ||
:title (:nimbus-leader-uptime tips)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please actually define an entry in tips for ':nimbus-leader-uptime'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix the tips soon
I have done a quick pass through the code. It looks like there are several places that the code is leaking connections to ZK. I am also concerned about the extra load that this may be placing on ZK. ZK is already the bottleneck for scalability of the cluster, a new client connecting does a write operation to the cluster, and having every client make a connection is bad, but not a deal-breaker. However, also having the existing daemons constantly making new connections to ZK feels like it will cause a lot of scalability issues. |
@@ -38,7 +38,6 @@ storm.thrift.transport: "backtype.storm.security.auth.SimpleTransportPlugin" | |||
storm.messaging.transport: "backtype.storm.messaging.netty.Context" | |||
|
|||
### nimbus.* configs are for the master | |||
nimbus.host: "localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if nimbus.host is not used any more we should either deprecate it or just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nimbus.host relative config should be removed all over the source codes. I will fix this.
As @revans2 alluded to, some of the challenges with code distribution (supervisor downloads from nimbus) will be alleviated by using bittorrent for topology distribution. I have a branch that switches to using bittorrent for code distribution, but I've held off on submitting a pull request because there's a bug with multi-lang topologies that I haven't had time to track down yet (the resource directory gets deleted). I'll submit the pull request for reference with the caveat that it shouldn't be merged until that bug is fixed. Here's the original pull request: |
FYI... aforementioned issue with multi-lang has been fixed. |
@yveschina, any update on this PR? Looks like it needs an up-merge to master. |
@yveschina any update on the concerns raised? |
Conflicts: storm-core/src/clj/backtype/storm/config.clj storm-core/src/clj/backtype/storm/thrift.clj storm-core/src/clj/backtype/storm/ui/core.clj
@yveschina The main concern I have is with catastrophic failure of a nimbus node during code distribution. I'm not sure it's acceptable to force users to resubmit a topology in that event. I'm working with @Parth-Brahmbhatt on a similar solution that involves a pluggable code distribution interface (either bittorrent or a distributed FS) that will also be compatible with the security work being done (e.g. code distribution backed by a secure HDFS). More details of that work are available in the JIRA, and we will be posting a much more detailed design doc in the future. For the time being, let's keep this pull request open. |
…utho Correct authorization check in nimbus methods
@yveschina @ptgoetz , it seems #354 has been merged for STORM-166. Should this pull request be closed? (Did we address everything this pull request fixes?) |
@d2r Yes, I think this pull request can be closed. |
Nimbus HA feature is quite important for our application running on the storm cluster. So, we've been working on the problem for some time and now a solution seems not that perfect but be enough to apply has comed out.
1.Nimbus Servers now can register themselves in Zookeeper. They perform a leader election using "InterProcessMutex" interact with Zookeeper to ensure that there is only one nimbus responsible for launching and monitoring topologies.
2.Every Nimbus Server is running a timer to compare and find if there are topology codes which are not exists on it's local disk. They would download lcoal missing topology codes from the Nimbus leader through the thrift RPC just like Supervisors do.With this feature, any numbers of Nimbus Server can be launched through out the cluster.
3.StormSubmitter,Supervisor,Non-leader Nimbus and Storm UI now are able to find and connect to the Nimbus leader via Zookeeper.A Nimbus leadership table is also added to Storm-UI on the main page to show every Nimbus's leader-election state and it's host in addition.
PS: Some implementation of the Nimbus-Election part has taken @Frostman's solution for reference(link: nathanmarz/storm#422).