How to start graphframes on spark on pyspark on juypter on docker?

How to start graphframes on spark on pyspark on juypter on docker? - docker

Been playing with pyspark on juypter all day with no issues. Just by simply using the docker image juypter/pyspark-notebook, 90% of everything I need is packaged (YAY!)
I would like to start exploring using GraphFrames, which sits on top of GraphX which sits on top of Spark. Has anyone gotten this combination to work?
Essentially, according to the documentation, I just need to pass "--packages graphframes:xxyyzz" when running pyspark to download and run graphframes. Problem is that juypter is already running as soon as the container comes up.
I've tried passing the "--packages" line as an environment variable (-e) for both JUYPTER_SPARK_OPTS and SPARK_OPTS when running docker run and that didn't work. I found that I can do pip install graphframes from a terminal, which gets me part of the way -- the python libraries are installed, but the java ones are not "java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI".
The image specifics documentation does not appear to offer any insights on how to deploy a Spark Package to the image.
Is there a certain place to throw the graphframes .jar? Is there a command to install a spark package post-docker? Is there a magic argument to docker run that would install this?
I bet there's a really simple answer to this --Or am I in high cotton here?
References:
No module named graphframes Jupyter Notebook
How do I run pyspark with jupyter notebook?

So the answer was quite simple:
From the gist here, we need to simply tell juypter to add the --packages line to the SPARK_SUBMIT with something like this to the top of my notebook. Spark goes out and installs the package when grabbing the context:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 pyspark-shell'
Keep a watch on the versions available at the graphframes package, which for now, means graphframes 0.8.1 on spark 3.0 on scala 2.12.

Related

running pycharm interpreter using nvidia-docker2

Im working on Ubuntu 20. I've installed docker, nvidia-docker2. On Pycharm, I've followed jetbrain guide, but in the advanced steps it isn't consistent with what I see in my setup. I use PyCharm Proffesional 2022.2.
In this step:
in the run options I put additionally --runtime=nvidia and --gpus=all.
Step 4 finishes as same as in the guide (almost, but it seems that it doesn't bother anything so on that later) and on step 5 I put manually the path to the interpreter in the virtual environment I've created using the Dockerfile.
In that way I am able to run the command of nvidia-smi and see correctly the GPU, but I don't see any packages I've installed during the Dockerfile build.
There is another option to connect the interpreter a little bit differently in which I do see the packages, but I can't run the nvidia-smi command and the torch.cuda.is_availble return False.
The way is instead of doing this as in the guide:
I press on the little down arrow in left of the Add Interpreter button and then click on Show all:
After which I can press the + button :
works, so it might be PyCharm "Python Console" issue.
and then I can choose Docker:
which will result in the difference mentioned above in functionality and also in the path dispalyed (the first one is the first remote interpreter top to bottom direction and the second is the second correspondingly):
Here of course the effect of the first and the second correspondingly:
Here is the results of the interpreter run with the first method connected interpreter:
and here is the second:
Of the following code:
Here is the Dockerfile file if you want to take a look:
Anyone configured it correctly and can help ?
Thank you in advance.
P.S: if I run the docker from services and enter the terminal the command nvidia-smi works fine and also the import of torch and the command torch.cuda.is_available return True.
P.S.2:
The thing that has worked for me for now is to change the Dockerfile to install directly torch with pip without create conda environement.
Then I set the path to the python2.7 and I can run the code, but not debug it.
for run the result is as expected (the packages list as was shown before is still empty, but it works, I guess somehow my IDE cannot access the packages list of the remote interpreter in that case, I dont know why):
But the debugger outputs the following error:
Any suggestions for the debugger issue also will be welcome, although it is a different issue.

Please update to 2022.2.1 as it looks like a known regression that has been fixed.
Let me know if it still does not work well.

how should I persistently save Julia packages in a Docker container

I'm running Julia on the raspberry pi 4. For what I'm doing, I need Julia 1.5 and thankfully there is a docker image of it here: https://github.com/Julia-Embedded/jlcross
My challenge is that, because this is a work-in-progress development I find myself adding packages here and there as I work. What is the best way to persistently save the updated environment?
Here are my problems:
I'm having a hard time wrapping my mind around volumes that will save packages from Julia's package manager and keep them around the next time I run the container
It seems kludgy to commit my docker container somehow every time I install a package.
Is there a consensus on the best way or maybe there's another way to do what I'm trying to do?

You can persist the state of downloaded & precompiled packages by mounting a dedicated volume into /home/your_user/.julia inside the container:
$ docker run --mount source=dot-julia,target=/home/your_user/.julia [OTHER_OPTIONS]
Depending on how (and by which user) julia is run inside the container, you might have to adjust the target path above to point to the first entry in Julia's DEPOT_PATH.
You can control this path by setting it yourself via the JULIA_DEPOT_PATH environment variable. Alternatively, you can check whether it is in a nonstandard location by running the following command in a Julia REPL in the container:
julia> println(first(DEPOT_PATH))
/home/francois/.julia

You can manage the package and their versions via a Julia Project.toml file.
This file can keep both the list of your dependencies.
Here is a sample Julia session:
julia> using Pkg
julia> pkg"generate MyProject"
Generating project MyProject:
MyProject\Project.toml
MyProject\src/MyProject.jl
julia> cd("MyProject")
julia> pkg"activate ."
Activating environment at `C:\Users\pszufe\myp\MyProject\Project.toml`
julia> pkg"add DataFrames"
Now the last step is to provide package version information to your Project.toml file. We start by checking the version number that "works good":
julia> pkg"st DataFrames"
Project MyProject v0.1.0
Status `C:\Users\pszufe\myp\MyProject\Project.toml`
[a93c6f00] DataFrames v0.21.7
Now you want to edit Project.toml file [compat] to fix that version number to always be v0.21.7:
name = "MyProject"
uuid = "5fe874ab-e862-465c-89f9-b6882972cba7"
authors = ["pszufe <pszufe#******.com>"]
version = "0.1.0"
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
[compat]
DataFrames = "= 0.21.7"
Note that in the last line the equality operator is twice to fix the exact version number see also https://julialang.github.io/Pkg.jl/v1/compatibility/.
Now in order to reuse that structure (e.g. different docker, moving between systems etc.) all you do is
cd("MyProject")
using Pkg
pkg"activate ."
pkg"instantiate"
Additional note
Also have a look at the JULIA_DEPOT_PATH variable (https://docs.julialang.org/en/v1/manual/environment-variables/).
When moving installations between dockers here and there it might be also sometimes convenient to have control where all your packages are actually installed. For an example you might want to copy JULIA_DEPOT_PATH folder between 2 dockers having the same Julia installations to avoid the time spent in installing packages or you could be building the Docker image having no internet connection etc.

In my Dockerfile I simply install the packages just like you would do with pip:
FROM jupyter/datascience-notebook
RUN julia -e 'using Pkg; Pkg.add.(["CSV", "DataFrames", "DataFramesMeta", "Gadfly"])'
Here I start with a base datascience notebook which includes Julia, and then call Julia from the commandline instructing it to execute the code needed to install the packages. Only downside for now is that package precompilation is triggered each time I load the container in VS Code.
If I need new packages, I simply add them to the list.

Pytorch errors: "received an invalid combination of arguments" in Jupyter Notebook

I'm trying to learn Pytorch, but whenever I seem to try any online tutorial (https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py), I get errors when trying to run certain functions, but only in Jupyter Notebook.
When running
x = torch.empty(5, 3)
I get an error:
module 'torch' has no attribute 'empty'
Furthermore, when running
x = torch.zeros(5, 3, dtype=torch.long)
I get the error:
module 'torch' has no attribute 'long'
Some other functions work fine like:
x = torch.rand(5, 3)
But generally, most code I try to run seems to run into an error really quickly. I couldn't find any resolution online.
When I go into my docker container and simply run python in the shell, I can run these lines just fine with no errors.
I'm running pytorch in a Docker image that I extended from a fastai image, as it already included things like jupyter notebook and pytorch. I used anaconda to update everything, and committed it to a new image for myself.
I have absolutely no idea what the issue could be. I've tried updating packages through anaconda, pip, aptitude in my docker container, and making sure to commit my changes, but nothing seems to work. I also tried creating a new kernel with python 3.7 as I noticed that my Jupyter Notebook only runs in 3.6.4, and when I run python in the shell it is at 3.7.
I've also tried getting different docker images and extending them with what I need, but all images that I've tried have had errors with anaconda where it gets stuck on "Solving environment" step.

Ok, so the fix for me was to either update pytorch through conda using the following command
conda update pytorch
If it's not installed yet, I've gotten it to work in other environments by simply installing it through conda
conda install pytorch
Kind of stupid that I didn't try this earlier, but I was confused on the difference between conda and pip.

How can I properly run OpenAI gym with nvidia-docker and see the environments

So I'm trying set run OpenAI gym in a docker container, but it looks like this:
Notice the pong window has a weird render issue where it's repeating things and the colors are off. Here is space invaders:
NOTE FOR "NOT A PROGRAMMING ISSUE" PEOPLE: The solution involves the correct bash script code to call the right API methods to render the arrays of pixels correctly. Also only a graphics programmer is likely to "recognize the render glitch".
My setup is very simple.
- I'm on a local ubuntu 16.04 install with an Nvidia gtx1060 and corei7
- I installed nvida runfile driver with --no-opengl-files (as per instructions from Nvidia and many place).
- Specifically, I'm running floydhub/pytorch docker image.
Does anyone recognize the particular render glitch and what it could mean? It almost looks like a StackOverflow of a frame buffer! What can I do to track down the bug?
EDIT: I have eliminated all the extra dependencies I had been installing and am just doing simple x-forwarding according to the ROS GUI guide.
You can easily reproduce this as follows:
docker run -it --user=$(id -u) --env="DISPLAY" --workdir="/home/$USER" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" floydhub/pytorch:0.1.11-gpu-py3.6 bash
Now in the image, type python and then the following:
import gym
gym.make('Pong-v0').render()
That should open up an x-forwarded window on your machine, but the display is corrupt (at least for me)
Above I actually used SpaceInvaders-v0

dockerfile: vim (compiled python), vim-ipython, and ipython notebook

I would like to build a Dockerfile in linux which
1. compiles vim with python
2. installs python stack (such as numpy, scipy, ipython, etc)
3. creates ssl certificate for ipython-notebook, to view the notebooks on host machine
It seemed straightforward enough. But I have run into problems despite a variety of approaches, such as linking separate containers, using anaconda, as well as with a single unified image vs separate layers, or creating a user or running all as a root.
In order to run vim, simply installing to root, does not activate pathogen bundle/vim-ipython. Creating a user allows pathogen bundles (ie nerdtree works) to install, but :IPython throws error.
:IPython failed
^-- failed '' not found .
Ive tried the above with no layers/1 large Dockerfile, and with different layers for the python stack, vim, and the ipython notebook.
Dockerfile
What am I not seeing here ?
what does the ^-- failed '' not found referring to?
Ive tried running the ipython notebook using --no-browser & and then running vim, or using running two shells on the same container... but cant get past this error.

Here is a working Dockerfile for anyone trying to get vim-ipython working in Docker.
issues:
user/shared home needed to for vim, despite runtimepath in .vimrc to pathogen/bundle
%connect_info >> required with containers
I am running in root, not sure why vim required a USER to install packages, but changing to USER would throw errors with CMD
--best

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart