Adding libraries to PySpark kernel on Jupyter/JupiterHub on EMR - libraries

I'm trying to use Matplotlib with PySpark3 with JupyterHub (0.9.4) running on a docker on an AWS EMR (5.20). There are 4 kernels preinstalled on that JupyterHub: Python, PySpark, PySpark3, and Spark.
There was no problem importing Matplotlib with the Python kernel. However, when I tried "import matplotlib as plt" with either PySpark or PySpark3 kernel, I got back the message "matplotlib not found". Have been trying to find a guy on that but no luck.
Could you please help?
Thanks and regards,
Averell

Further reading showed that I was wrong: Using the PySpark kernels will actually have the code run on the Spark cluster (the EMR itself), while using the Python kernel will have the code run on the JupyterHub server (the docker image).
Matplotlib came preinstalled on the docker image, not the EMR.
Installing matplotlib on the EMR master node would solve that import issue in PySpark kernels. However, that doesn't help further (at least for me now) in plotting graphs using dataframes from Spark.
I could finally get what I wanted by following this guide - transferring the result to "local" (here "local" means the JupyterHub server - the docker image) and use matplotlib locally using %%local magic: https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb

Related

Shared Python Packages Among Docker Containers

I've multiple docker containers that host some flask apps which runs some machine learning services. Let's say container 1 is using pytorch, and container 2 is also using pytorch. When I build image, both pytorch take up some size on disk. For some reason, we split these 2 services into different containers, if I insist on this way, is it possible to only build pytorch once so that both container can import it? Thanks in advance, appreciate any help and suggestions!
You can build one docker image and install pytorch on it. Then use that image as base image for those two codes. In this way, pytorch only takes hard space once. And you save time no installing pytorch twice.
You can also build only one image, copy your codes in two different directories,
for example /app1 and /app2. Then in your docker compose, change work directory for each app.

DASK CUDA on multi node EMR cluster is unable to detect nodes

I have setup an AWS EMR cluster using 10 core nodes of type g4dn.xlarge (each machine/node conatins 1 GPU). When I run the following commands on Zeppelin Notebook, I see only 1 worker allotted in my LocalCUDACluster:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
I tried passing n_workers=10 explicitly but it resulted in an error.
How do I make sure my LocalCUDACluser utilizes all of my other 9 nodes? What is the right way to setup a multi-node DASK-CUDA cluster?
Any help regarding this is appreciated.
There are a few options to setup a multi-worker cluster (with or without GPU), described here.
The docs don't seem to mention third-party solutions, but right now there are two companies offering these services: Coiled and Saturn Cloud.

difference between host and docker container

I have been trying to train a 3DCNN network with a specific architecture. I wanted to create a dockerfile with all the steps necessary to have the network working. The issue is that If I run the neural network network in the host I have no problem, everything works fine. But doing almost the same on a docker container I always get the "segmentation fault (core dumped)" error.
Both installations are not exactly the same but the variations (maybe some extra package installed) shouldn't be a problem, right? Besides I don't have any error until it starts iterating, so it seems like is a memory problem. The GPU works on the docker container and is the same GPU as the host. the python code is the same.
The Docker container neural network network start training with the data but on the epoch 1 it gets the "segmentation fault (core dumped)".
So my question is the following: Is it possible to have critical differences between the host and a docker container even if they have exactly the same packages installed? Especially with relation to tensorflow and GPU. Because the error must be from outside the code, given that the code works in a similar environment.
Hope I explained myself enough to give the idea of my question, thank you.
A docker image will resolve, at runtime, will resolve its system calls by the host kernel.
See "How can Docker run distros with different kernels?".
In your case, your Error is
Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1, SSE4.2
See "How to compile Tensorflow with SSE4.2 and AVX instructions?"
(referenced by tensorflow/tensorflow issue 8037)
You could try and build an image from a Tensorflow built from source, using a docker multi-stage build.

Getting Openshift 3 container to install numpy and scipy

I have a working Pod for a Deployment in Openshift 3 Starter. This is based off an Image stream from a Docker image. However, I cannot get it to build in Openshift with the inbuilt S2I.
The Docker option is not good as I cannot find setting anywhere to make a Image stream update and cause a redeployment.
I tried setting it up so that a webhook would trigger an Openshift Build, but the server needs python 3 with numpy and scipy, which makes the build get stuck. The best I could do is inelegantly get a Python 3 cartridge install numpy based on requirements.txt and the rest via setup.py, but this still got stuck. I have a working webhook going for a different app that runs on basically the same layout bar for requirements (Python3 Pyramid with waitress).
Github: https://github.com/matteoferla/pedel2
Docker: https://hub.docker.com/r/matteoferla/pedel2/
Openshift: http://pedel2-git-matteo-ferla.a3c1.starter-us-west-1.openshiftapps.com
UPDATE I have made a Openshift pyramid starter template.
I would first suggest going back to using the builtin Python S2I builder. If you are doing anything with numpy/pandas, you will need to increase the amount of memory available during the build phase of your application as the compiler runs out of memory when building those packages. See:
Pandas on OpenShift v3
See if that helps and if need be can look at what your other options are around using an externally built container image.

distributed tensorflow using docker

I am experimenting with distributed tensorflow and an example project.
Running the project on the same docker container seems to work well. As soon as you run the application on different conatiners, they cannot connect to eachother.
I don't really know the problem, but I think this is because docker and tensorflow open ports which have to be concatenated to connect to the application like localhost:[docker-port]:[tf-port]
Do you think my guess is correct? And how can I solve this problem?

Resources