Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-166:Nimbus HA solution based on Zookeeper #61

Closed
wants to merge 9 commits into from

Conversation

yveschina
Copy link

Nimbus HA feature is quite important for our application running on the storm cluster. So, we've been working on the problem for some time and now a solution seems not that perfect but be enough to apply has comed out.

1.Nimbus Servers now can register themselves in Zookeeper. They perform a leader election using "InterProcessMutex" interact with Zookeeper to ensure that there is only one nimbus responsible for launching and monitoring topologies.

2.Every Nimbus Server is running a timer to compare and find if there are topology codes which are not exists on it's local disk. They would download lcoal missing topology codes from the Nimbus leader through the thrift RPC just like Supervisors do.With this feature, any numbers of Nimbus Server can be launched through out the cluster.

3.StormSubmitter,Supervisor,Non-leader Nimbus and Storm UI now are able to find and connect to the Nimbus leader via Zookeeper.A Nimbus leadership table is also added to Storm-UI on the main page to show every Nimbus's leader-election state and it's host in addition.

PS: Some implementation of the Nimbus-Election part has taken @Frostman's solution for reference(link: nathanmarz/storm#422).

{:text "Nimbus uptime" :attr {:class "tip right"
:title (:nimbus-uptime tips)}}
{:text "Nimbus leader uptime" :attr {:class "tip right"
:title (:nimbus-leader-uptime tips)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please actually define an entry in tips for ':nimbus-leader-uptime'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the tips soon

@revans2
Copy link
Contributor

revans2 commented Apr 7, 2014

I have done a quick pass through the code. It looks like there are several places that the code is leaking connections to ZK. I am also concerned about the extra load that this may be placing on ZK. ZK is already the bottleneck for scalability of the cluster, a new client connecting does a write operation to the cluster, and having every client make a connection is bad, but not a deal-breaker. However, also having the existing daemons constantly making new connections to ZK feels like it will cause a lot of scalability issues.

@@ -38,7 +38,6 @@ storm.thrift.transport: "backtype.storm.security.auth.SimpleTransportPlugin"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"

### nimbus.* configs are for the master
nimbus.host: "localhost"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nimbus.host is not used any more we should either deprecate it or just remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nimbus.host relative config should be removed all over the source codes. I will fix this.

@ptgoetz
Copy link
Member

ptgoetz commented Apr 10, 2014

As @revans2 alluded to, some of the challenges with code distribution (supervisor downloads from nimbus) will be alleviated by using bittorrent for topology distribution.

I have a branch that switches to using bittorrent for code distribution, but I've held off on submitting a pull request because there's a bug with multi-lang topologies that I haven't had time to track down yet (the resource directory gets deleted).

I'll submit the pull request for reference with the caveat that it shouldn't be merged until that bug is fixed.

Here's the original pull request:

nathanmarz/storm#629

@ptgoetz
Copy link
Member

ptgoetz commented Apr 11, 2014

FYI... aforementioned issue with multi-lang has been fixed.

@d2r
Copy link

d2r commented Jul 10, 2014

@yveschina, any update on this PR? Looks like it needs an up-merge to master.

@ptgoetz
Copy link
Member

ptgoetz commented Jul 24, 2014

@yveschina any update on the concerns raised?

@yveschina yveschina changed the title nimbus ha solution for issue STORM-166 STORM-166:Nimbus HA solution based on Zookeeper Sep 11, 2014
@yveschina
Copy link
Author

@d2r @ptgoetz I've updated the pullrequest according to the conversions with @revans2 . Should we consider to close this pullrequest ? Is there any plan to merge this pullrequest or other solutions on Nimbus HA?

@ptgoetz
Copy link
Member

ptgoetz commented Oct 1, 2014

@yveschina The main concern I have is with catastrophic failure of a nimbus node during code distribution. I'm not sure it's acceptable to force users to resubmit a topology in that event.

I'm working with @Parth-Brahmbhatt on a similar solution that involves a pluggable code distribution interface (either bittorrent or a distributed FS) that will also be compatible with the security work being done (e.g. code distribution backed by a secure HDFS).

More details of that work are available in the JIRA, and we will be posting a much more detailed design doc in the future.

For the time being, let's keep this pull request open.

knusbaum pushed a commit to knusbaum/incubator-storm that referenced this pull request Feb 11, 2015
…utho

Correct authorization check in nimbus methods
@d2r
Copy link

d2r commented Oct 8, 2015

@yveschina @ptgoetz , it seems #354 has been merged for STORM-166. Should this pull request be closed? (Did we address everything this pull request fixes?)

@ptgoetz
Copy link
Member

ptgoetz commented Oct 8, 2015

@d2r Yes, I think this pull request can be closed.

@asfgit asfgit closed this in 26f966c Oct 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants