This is a „what I learned about Apache Zeppelin“ post.
To be honest, we have quite some issues with Apache Zeppelin:
- With many concurrent users, it becomes unresponsive
- It accumulates zombie processes
- Once an interpreter has been started, it cannot be stopped from the Zeppelin interface and does not free up executor resources à clutters the system
- User permissions are an issue as well
So here are a couple of learnings and/or insights.
1.) Unresponsiveness & zombie processes
So the issue is the following: zeppelin starts spark in client mode, meaning that the spark driver process will not be distributed over the cluster, but will run on the submitting machine, which in our case is hdp-master. So, clearly, how many spark interpreters can be started is dependent on the resources available on hdp-master. At least according to one of the main developers, this resource restrictions should be more of a problem than the zeppelin daemon itself: http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Limit-on-multiple-concurrent-interpreters-isolated-notebooks-tp4732p4737.html
Concerning the zombie processes after shutdown of the daemon: this should not happen. There are even mechanisms within Zeppelin that should prevent this. However, seems like we are not the only ones experiencing this problem. A bugticket has been issued: http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Interpreter-zombie-processes-tp4738.html and https://issues.apache.org/jira/browse/ZEPPELIN-1832
2.) Different interpreters & cluster resources
There are several issues with this one, especially if you do not use spark dynamic resource allocation. If you would (which works fine, I tried it on the new gcloud instance), an idle interpreter would simply consume resources for the driver (and maybe one executor, depending on config). Only if you actually start a computation it would ask for more free resources. So that’s one option to circumvent this problem at least a little bit. Second, a “stop” button in Zeppelin would be a very useful features, and actually there is such a features request already and might come soon(ish) (can’t find bugticket anymore)
3.) Another note on unresponsiveness
We say that a zeppelin notebook is unresponsive, if the status bar does not change. So now, this is a bad definition of “unresponsive”, as there might be many reasons why the status bar is not progressing, for example the spark operation might take long (e.g. many shuffles) or zeppelin is waiting for cluster resources. In an ideal word, Zeppelin would have a more verbose Status bar, telling the user what it is actually doing at the moment. The immediate reaction to restart the interpreter, kill all other processes, restart zeppelin, might not be always necessary – and combined with issue 1) might worsen the problem globally. I’ll create a feature request for this one.
In Zeppelin 0.6, many improvements have been made. Bugs have been fixed and features have been added. For example, user authentication is possible, also hooking Zeppelin up with github in order to version (and share) notebooks. Also, there is now the possibility to start a new interpreter for each notebook automatically.
Concerning multitenancy for Zeppelin: there is this project http://help.zeppelinhub.com/zeppelin_multitenancy/ However, this is in beta and does only run on a Spark standalone cluster. It’s not clear when Spark on YARN will be supported. We’ll have to wait.
Some general notes: Zeppelin is still incubating into Apache Foundation – and that for quite some time, the whole project is roughly four years old. The (dev) community is rather small, although many people (seem) to use it. Not sure whether it will ever gain real traction. I would not bet too much on Zeppelin for the future. Using it internally for analysis & prototyping purposes is certainly fine, if we can live with the drawbacks. At this moment however, I would not include it into any “production” workflows (especially not for external customers).