RP - Random Pondering: 2016

Wednesday, December 28, 2016

From "The DevOps Handbook"

In an age where competitive advantage requires fast time to market and relentless experimentation, organizations that are unable to replicate these outcomes are destined to lose in the marketplace to more nimble competitors and could potentially go out of business entirely, much like the manufacturing organizations that did not adopt Lean principles.

Monday, December 19, 2016

Apache Zeppelin .... again

This is a „what I learned about Apache Zeppelin“ post.

To be honest, we have quite some issues with Apache Zeppelin:

- With many concurrent users, it becomes unresponsive

- It accumulates zombie processes

- Once an interpreter has been started, it cannot be stopped from the Zeppelin interface and does not free up executor resources à clutters the system

- User permissions are an issue as well

So here are a couple of learnings and/or insights.

1.) Unresponsiveness & zombie processes

So the issue is the following: zeppelin starts spark in client mode, meaning that the spark driver process will not be distributed over the cluster, but will run on the submitting machine, which in our case is hdp-master. So, clearly, how many spark interpreters can be started is dependent on the resources available on hdp-master. At least according to one of the main developers, this resource restrictions should be more of a problem than the zeppelin daemon itself: http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Limit-on-multiple-concurrent-interpreters-isolated-notebooks-tp4732p4737.html

Concerning the zombie processes after shutdown of the daemon: this should not happen. There are even mechanisms within Zeppelin that should prevent this. However, seems like we are not the only ones experiencing this problem. A bugticket has been issued: http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Interpreter-zombie-processes-tp4738.html and https://issues.apache.org/jira/browse/ZEPPELIN-1832

2.) Different interpreters & cluster resources

There are several issues with this one, especially if you do not use spark dynamic resource allocation. If you would (which works fine, I tried it on the new gcloud instance), an idle interpreter would simply consume resources for the driver (and maybe one executor, depending on config). Only if you actually start a computation it would ask for more free resources. So that’s one option to circumvent this problem at least a little bit. Second, a “stop” button in Zeppelin would be a very useful features, and actually there is such a features request already and might come soon(ish) (can’t find bugticket anymore)

3.) Another note on unresponsiveness

We say that a zeppelin notebook is unresponsive, if the status bar does not change. So now, this is a bad definition of “unresponsive”, as there might be many reasons why the status bar is not progressing, for example the spark operation might take long (e.g. many shuffles) or zeppelin is waiting for cluster resources. In an ideal word, Zeppelin would have a more verbose Status bar, telling the user what it is actually doing at the moment. The immediate reaction to restart the interpreter, kill all other processes, restart zeppelin, might not be always necessary – and combined with issue 1) might worsen the problem globally. I’ll create a feature request for this one.

In Zeppelin 0.6, many improvements have been made. Bugs have been fixed and features have been added. For example, user authentication is possible, also hooking Zeppelin up with github in order to version (and share) notebooks. Also, there is now the possibility to start a new interpreter for each notebook automatically.

Concerning multitenancy for Zeppelin: there is this project http://help.zeppelinhub.com/zeppelin_multitenancy/ However, this is in beta and does only run on a Spark standalone cluster. It’s not clear when Spark on YARN will be supported. We’ll have to wait.

Some general notes: Zeppelin is still incubating into Apache Foundation – and that for quite some time, the whole project is roughly four years old. The (dev) community is rather small, although many people (seem) to use it. Not sure whether it will ever gain real traction. I would not bet too much on Zeppelin for the future. Using it internally for analysis & prototyping purposes is certainly fine, if we can live with the drawbacks. At this moment however, I would not include it into any “production” workflows (especially not for external customers).

Alternatives to Zeppelin are: https://github.com/andypetrella/spark-notebook and http://jupyter.org/ . Jupyter is great if you want to use the python interpreter. There are Scala bindings as well, but I did not dig deeper and test how it works.

Friday, December 9, 2016

Bose Hearphones

http://hearphones.bose.com

That will be very useful -- especially the "reduce World Volume" option. Looking forward to try them.

Friday, December 2, 2016

DBeaver -- Free Universal SQL Client

If you are looking for a good, universal (meaning: cross-database) SQL (&no-sql) client, look no further. I highly recommend DBeaver

http://dbeaver.jkiss.org/

OBS: If you want the Cassandra driver as well, you have to download the Enterprise Edition (which is also for free, but not OSS).

Docker ♡

I can only repeat myself ...

#docker is the best thing since *add previous favorite things here*
— René Pfitzner (@RenePfitznerZH) November 16, 2016

Tuesday, November 15, 2016

Sneak peak ...

I always have to look this up ...

http://theoatmeal.com/comics/sneak_peek

Saturday, November 12, 2016

Google DNI und das "demokratiefördernde Element"

Als Antwort auf: https://jourtagblog.wordpress.com/2016/11/02/googles-geld-gutes-geld/

"Das demokratiefördernde Element geht verloren, was durchaus negative Auswirkungen haben kann." Urs Bühler, zitiert in obigem blogpost.

Dass es negative Auswirkungen haben wird, wenn "das demokratiefördernde Element" des Journalismus verloren geht, ist unbestritten. Ob die DNI jedoch dazu führt, dass dieses Element verloren geht, darüber lässt sich trefflich streiten. Urs Bühler's Standpunkt ist einer. Meiner ist ein anderer. Der Medienbranche geht es schlecht. Einige Medienhäuser sind in der glücklichen Lage a) liquide Mittel und b) eine kluge Unternehmensführung zu haben um in Innovation investieren zu können/wollen. Realität ist: bei den allermeisten Medienhäusern ist wenigstens eine der zwei Bedingungen nicht erfüllt. Doch Innovation ist notwendig um Menschen auf neuen Kanälen mit unseren Inhalten erreichen zu können und sie davon zu überzeugen, dass es sich lohnt für diese Inhalte zu zahlen. Da die finanziellen Mittel, welche Google im Zuge des DNI Funds bereitstellt, zur Umsetzung von Projekten ausgeschüttet werden, deren Ziel es ist mittels innovativer technologischer Ideen genau das zu erreichen, fördert die DNI "das demokratische Element", anstatt es abzuschaffen.

https://www.digitalnewsinitiative.com/

Thursday, October 6, 2016

Zeppelin Zombies ...

As I discussed already earlier, we are (semi-happily) using Apache Zeppelin as Spark notebook. However, at some point Zeppelin notebooks were so slowly responding and running into time-out errors, that it was impossible to work with. Restarting the Zeppelin server did not help -- and for quite some time we were clueless what suddenly happened. At some point we figures out that Zeppelin has severe problems shutting down processes when errors occurred -- and starts accumulating zombie processes over time. We had a couple of hundred, that cluttered our system. Killing these zombie processes and restarting Zeppelin server did the trick -- now everything is running as smooth as before.

Monday, August 29, 2016

Get your sh** together Pro Tips -- vol. I

Next time you receive a newsletter/campaign email from sender x, and you haven't actually read the last five emails of sender x:

Immediately open that mail, scroll down to the bottom and click unsubscribe.

Voilà: potentially 100's of emails less per year.

Friday, August 26, 2016

An Apache Spark pyspark setup script, incl. virtualenv

Here is a little script that I employed to get pyspark running on our cluster. Why is this necessary? Well, if you want to use the ML libraries within Apache Spark from the Python API, you need Python 2.7. However, in case your cluster runs on CentOS, it comes with Python 2.6 due to dependencies. DO NOT REMOVE IT. Otherwise bad things will happen.

Instead, it's best practice to have a separate Python 2.7 installation. And to be completely isolated, best practice is to create a virtualenv, which you will use to install all packages you are going to use with pyspark.

Also, if you plan to run pyspark within Zeppelin, you have to be sure that the virtualenv is accessible to user Zeppelin. This is why I install the whole thing in /etc. Also, make sure to run this on all cluster nodes, otherwise Spark executors cannot launch the local Python processes.

#!/bin/bash

# run as root

# info on python2.7 req's here: http://toomuchdata.com/2014/02/16/how-to-install-python-on-centos/

# info on installing python for spark: http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

# info on python on local environment http://stackoverflow.com/questions/5506110/is-it-possible-to-install-another-version-of-python-to-virtualenv

#install needed system libraries

yum groupinstall "Development tools"

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel

#setup local python2.7 installation

mkdir /etc/spark-python

mkdir /etc/spark-python/python

cd /etc/spark-python/python

wget http://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz

tar -zxvf Python-2.7.9.tgz

cd Python-2.7.9

make clean

./configure --prefix=/etc/spark-python/.localpython

make

make install

#setup local pip installation

cd /etc/spark-python/python

wget https://pypi.python.org/packages/8b/2c/c0d3e47709d0458816167002e1aa3d64d03bdeb2a9d57c5bd18448fd24cd/virtualenv-15.0.3.tar.gz#md5=a5a061ad8a37d973d27eb197d05d99bf

tar -zxvf virtualenv-15.0.3.tar.gz

cd virtualenv-15.0.3/

/etc/spark-python/.localpython/bin/python setup.py install

cd /etc/spark-python

/etc/spark-python/.localpython/bin/virtualenv spark-venv-py2.7 --python=/etc/spark-python/.localpython/bin/python2.7

#activate venv

cd /etc/spark-python/spark-venv-py2.7/bin

source ./activate

#pip install packages of your choice

/etc/spark-python/spark-venv-py2.7/bin/pip install --upgrade pip
/etc/spark-python/spark-venv-py2.7/bin/pip install py4j

/etc/spark-python/spark-venv-py2.7/bin/pip install numpy

/etc/spark-python/spark-venv-py2.7/bin/pip install scipy

/etc/spark-python/spark-venv-py2.7/bin/pip install scikit-learn

/etc/spark-python/spark-venv-py2.7/bin/pip install pandas

After you did this, make sure to set variable PYSPARK_PYTHON in /etc/spark-env.sh to the path of the new binary, in this case /etc/spark-python/spark-venv-py2.7/bin/python

Also, if you use Zeppelin make sure to set the correct python path in interpreter settings. Simply alter/add property zeppelin.pyspark.python and set it's value to the python binary as above.

Tags: Apache Spark, Python, pyspark, Apache Zeppelin, Ambari, Hortonworks HDP

Friday, August 19, 2016

Yann LeCun: "Deep Learning is a conspiracy ... really."

Thursday, August 18, 2016

Virtual Reality Hackdays 2015

In order for it not to be forgotten, here is the webpage of the "Virtual Reality Hackdays" I co-organized last year.

http://hackdays.nzz.ch/

Monday, August 1, 2016

OK, now it's official. My side-project roound.io is online ... check it out, signup for the newsletter and stay tuned. We are going to launch in beta soon!

Thursday, July 7, 2016

Apache Zeppelin autocomplete / code completion

For those of you using Apache Zeppelin as interactive Spark notebook: if you have been wondering whether there is an autocompletion function. The answer is "yes". No, its not "tab" it's

Ctrl + .

It's not optimal (as of now), but works fairly well.
Tags: Zeppelin, Apache Zeppelin, autocompletion, auto-completion, code completion

Tuesday, June 21, 2016

Everytime you use groupByKey ...

Saturday, June 18, 2016

DAO vulnerability -- Ethereum

Yesterday someone exploited the so called "DAO vulnerability" to steal some 3 Mio. Ether. This incidence, of course, led to a panic attack by many people trading Ether, which resulted in Ether prices plummeting. This article by The Verge even titled "How an experimental cryptocurrency lost (and found) $53 million". So here is the catch: the author of this article, and in the same vein everyone else already summoning the death of the "Ethereum cryptocurrency", actually miss the point about Ethereum. Ethereum is not a cryptocurrency, but "... a decentralized platform that runs smart contracts: applications that run exactly as programmed without any possibility of downtime, censorship, fraud or third party interference." The execution of these smart contracts is fueled by Ether, but Ether is not a cryptocurrency, like bitcoin. It was never thought to be yet-another cryptocurrency. So don't blame the project, if you lost "real" money yesterday. Ether is not there to be traded in the first place, it's a commodity, to be used in the Ethereum network. And to be sure: Buterin nicely explains that the attack is not a bug in Ethereum itself, but a mistake in the code powering the DAO project. As it seems, a common bug, though.

So, what will happen? Well, I think the Ethereum community learned a valuable lesson. The attack might foster the creation of long-awaited "best-practices" for smart-contracts, maybe even projects to "security check" your own smart-contract code. Learning the hard way is often the only way to learn. In this sense: no, the project certainly is not dead -- quite the opposite, it might never have been more alive.

Thursday, June 9, 2016

Scraping tables from websites (the easy way)

Just need to scrape one table from one website? Use googledocs as described here.

Wednesday, May 25, 2016

Aha-moment of the morning

Also short blog posts can be worthwhile.

Monday, April 18, 2016

You are a physicist. And you are working at a newspaper. But you don't write articles. What do you do? And why?

As I am getting this question a lot, I am trying to give an answer here.

To be quick: As a data scientist at Neue Zürcher Zeitung I am dealing with predictive analytics, statistical modeling, advanced data analysis tasks as well as everything "algorithms" (e.g. recommendation & personalization). I am mainly using R for tasks involving small data and Apache Spark for tasks dealing with not-so-small data. Python and bash are my favorite scripting languages. I just happen to have a background in theoretical physics -- could be engineering, math or computer science as well.

But why media? I have always been looking for challenges and opportunities, intellectual and societal. And, well, being in media these days one finds both. The publishing business is super exciting, because everything is changing: how news are done, the way stories are told, distribution channels, the audience, the technology, the business models, ... Indeed, many of these issues are still open, are just being explored and the future of many publishing houses is still uncertain. So why is this?

Not-so-long ago news were mainly distributed using one medium: paper. For the individual there were exactly two possibilities: either you want to be informed daily, then you'd have to pay for a newspaper subscription, or you don't. If you (or your parents) happen to belong to the first group, chances are you only had one daily newspaper subscription. At the end, they are not cheap (NZZ subscription for example is roughly 600CHF/year) and you chose the one that suited you best. However, fast-forward less than 10 years and you find that reality today looks quite different. If you want to be informed, you can do this mainly for "free" on the www. Also, as you have immediate access to all these manifold resources and they do not cost you a dime, you have a much wider variety at hand. No need to stick to one newspaper.
So newspapers are not only suffering from the problem that technology has been changing fast (from print to web 1.0 to web 2.0 to mobile,...), but that this change undermines the very basis of the (news) publishing industry: loyal customers who are paying for the service you provide and, given this loyal, well-known customer base, to be able to monetize on the advertisement market.
Actually, from a balance-sheet perspective, the publishing industry has mainly been an advertisement industry -- only 20%-40% of revenue have been revenues due to subscriptions -- the rest was advertisement.

So what do you do, when the very basis of your business model is eroding? You innovate -- and this is where, among others, people like me come in.
Innovation comes in two parts. First you innovate in the sense that you optimize your current operations: you cut costs were possible and increase efficiency. And data, clearly, should be the basis for this: understand the numbers, then you can optimize. For example in marketing: use predictive analytics to help you decide where to put your marketing budget best. Or use customer analytics to better understand the need of your readership and to improve the customer experience accordingly.
The second part is what I like to call "true innovation". True innovation for me is not mere optimization, but novelty -- doing things that have not been done before. For this, on the one hand, data can be used as a decision criterium ("where to innovate"). On the other hand, data can also be the very basis for innovation. Here I am mainly thinking about data-driven / algorithmic products & services: things like smarter search, automated recommendations or personalization, in all its facets, that have the potential to greatly improve the customer experience, explore new ways of news consumption and reach a more tech-savvy audience.

I am contributing my part to this transformation at NZZ. Founded in 1780, NZZ is one of the oldest still published newspapers in the world. A heritage like this comes with a lot of responsibility -- balancing the tradition with the modern is a worthwhile challenge. At the end, a diverse and well-functioning media landscape is the basis for democracy. And I am glad to be part of it.

Sunday, April 3, 2016

Why you should go to college

April 2016: just found this weblog that I drafted ... don't know when. Certainly some time end 2012. Don't know why I did not publish it at that time. I still agree. So here it is:

I just read the following slashdot

http://news.slashdot.org/story/1212/03/1317234/just-say-no-to-college

commenting on an article from the NYTimes

http://www.nytimes.com/2012/12/02/fashion/saying-no-to-college.html?pagewanted=all

And, as an academic, I simply cannot not comment on this. First of all: Kids, please, think twice! Especially the part where he talks about not attending college at all. Let me explain why.
Often people confuse correlations and causality. Example: because Einstein played the violin, if I play the violin I will be super smart. There is no correlation here (at least not in this direction). Dropping college, because Mark Zuckerberg dropped college and now is who he is, will not bring you big bucks. Not seriously attending college at all and instead traveling through India will not make you a legend.

In general I disagree with how college education is judged in the article. Maybe what follows is simply my European perspective, but anyways. Attending University is about more than just "getting a degree at the end". It is about developing your mind, in an environment where free-thinking is allowed and, even more, specifically wanted. You are surrounded by smart people 24-7, somewhat isolated from reality. This enclave permits you to read and learn and work on the things you would not be able to in "the corporate world" - simply because in reality you would have to think about surviving. University life is different - and it is supposed to be. It is a period in your life to not worry about these things - because you have a scholarship, your parents can afford to pay or (if you happen to be in the USA) you got a students loan. But you won't need a lot of money anyways: you share a flat, you ride a bike, you eat noodles every day. But you are free from all wordly hastles. Free to think. Free to learn. Free to transform yourself into a beautiful and sharp mind. In classes (sure, not in all) you will be exposed to cutting-edge research or crazy theories you will never ever need in real life but that are simply fascinating and mind-boggling. You will spend nights awake discussing with your mates about Darwin and Freud and Einstein and this fu**** integral that took you the whole night to solve. College is about suffering on many levels: intellectually, financially and even physically. You will be some kind of ascetic, living only for the mere purpose of embedding yourself in an intellectual world and to fill your head with knowledge. College will lead you to the edge of wisdom, to the edge of your mind and will push you beyond. Sure, a hacking course will teach you how to program Angry Birds and eventually to become a Millionaire. But attending university is a once-in-a-lifetime cultural experience. An experience you will only be able to appreciate at a young age. An experience and exposition to human culture you should not miss. Sure, if during this experience you realize that you had enough and instead are inspired to found an awesome company, then dropping out might be the right choice. But remember (and this was the case for Zuckerberg and Brin and many others): college atmosphere most likely was the reason that you had this spark of inspiration in the first place.

Pages