How to run a jar example(word count) on Docker with Hadoop and Spark installed?

How to run a jar example(word count) on Docker with Hadoop and Spark installed? - docker

I have installed Hadoop and Spark on Docker. Now, I want to verify that the installations are succesfully done, through a simple jar of a word count example. But I do not know how to do it. Any ideas?
Thank you in advance.

You don't need Hadoop. Spark can run wordcount on plain-text local files
You can use spark-shell rather than any JAR. Run the code from the documentation, but use file:/// path
text_file = sc.textFile("file:///some_local_folder")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("file:///tmp/results")

Related

Download spark-submit without all the Spark framework to make lite Docker image

Most of the Docker image that embed Apache Spark have the whole spark archive in it.
Also most of the time, we submit the spark application on kubernetes, hence the spark job is running on other Docker container.
As such, I am wondering, in order to make the Docker image smaller, how to embed spark-submit feature?

That's a great question! I had a look (for the latest one on the Downloads page: 3.3.1) and found the following:
Looking at the contents of $SPARK_HOME/bin/spark-submit, you can see the following line:
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$#"
Ok, so it looks like the $SPARK_HOME/bin/spark-submit script calls the $SPARK_HOME/bin/spark-class script. Let's have a look at that one.
Similar to spark-submit, spark-class calls the load-spark-env.sh script like so:
. "${SPARK_HOME}"/bin/load-spark-env.sh
This load-spark-env.sh script calls other scripts of its own as well. But there is also a bit about Spark jars in spark-class:
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
echo "You need to build Spark with the target \"package\" before running this program." 1>&2
exit 1
else
LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi
# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi
So as you can see, it is referencing the spark jars directory (288MB of the total 324MB for Spark 3.3.1) and putting that on the launch classpath. Now, it's very possible that not all of those jars are needed in the case of submitting an application on kubernetes. But at least you need some kind of library to translate a spark application to some kubernetes configuration that your Kubernetes API server can understand.
So my conclusion from this bit is:
We can quite easily follow which files are exactly needed. In first instance, I would say anything in $SPARK_HOME/bin and $SPARK_HOME/conf. But that is no issue since they are all very small scripts/conf files.
Some of those scripts though, are putting the jars directory in your classpath for the final java command.
Maybe they don't need all the jars, but they will need some kind of library to connect to the Kubernetes API server. So I would expect some jar to be needed there. I see there is a jar called kubernetes-model-core-5.12.2.jar. Maybe this one?
Since most of the size of this $SPARK_HOME folder comes from those jars, you can try to delete some jars and run your spark-submit jobs to see what happens. I would think that, amongst others, jars like commons-math3-3.6.1.jar or spark-mllib_2.12-3.3.1.jar would not be necessary for a simple spark-submit to a Kubernetes API Server.
(All those specific jars just come from that one Spark version I talked about in the start of this post)
Really interesting question, I hope this helps you a bit! Just try deleting some jars, run your spark-submit and see what happens!

how should I persistently save Julia packages in a Docker container

I'm running Julia on the raspberry pi 4. For what I'm doing, I need Julia 1.5 and thankfully there is a docker image of it here: https://github.com/Julia-Embedded/jlcross
My challenge is that, because this is a work-in-progress development I find myself adding packages here and there as I work. What is the best way to persistently save the updated environment?
Here are my problems:
I'm having a hard time wrapping my mind around volumes that will save packages from Julia's package manager and keep them around the next time I run the container
It seems kludgy to commit my docker container somehow every time I install a package.
Is there a consensus on the best way or maybe there's another way to do what I'm trying to do?

You can persist the state of downloaded & precompiled packages by mounting a dedicated volume into /home/your_user/.julia inside the container:
$ docker run --mount source=dot-julia,target=/home/your_user/.julia [OTHER_OPTIONS]
Depending on how (and by which user) julia is run inside the container, you might have to adjust the target path above to point to the first entry in Julia's DEPOT_PATH.
You can control this path by setting it yourself via the JULIA_DEPOT_PATH environment variable. Alternatively, you can check whether it is in a nonstandard location by running the following command in a Julia REPL in the container:
julia> println(first(DEPOT_PATH))
/home/francois/.julia

You can manage the package and their versions via a Julia Project.toml file.
This file can keep both the list of your dependencies.
Here is a sample Julia session:
julia> using Pkg
julia> pkg"generate MyProject"
Generating project MyProject:
MyProject\Project.toml
MyProject\src/MyProject.jl
julia> cd("MyProject")
julia> pkg"activate ."
Activating environment at `C:\Users\pszufe\myp\MyProject\Project.toml`
julia> pkg"add DataFrames"
Now the last step is to provide package version information to your Project.toml file. We start by checking the version number that "works good":
julia> pkg"st DataFrames"
Project MyProject v0.1.0
Status `C:\Users\pszufe\myp\MyProject\Project.toml`
[a93c6f00] DataFrames v0.21.7
Now you want to edit Project.toml file [compat] to fix that version number to always be v0.21.7:
name = "MyProject"
uuid = "5fe874ab-e862-465c-89f9-b6882972cba7"
authors = ["pszufe <pszufe#******.com>"]
version = "0.1.0"
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
[compat]
DataFrames = "= 0.21.7"
Note that in the last line the equality operator is twice to fix the exact version number see also https://julialang.github.io/Pkg.jl/v1/compatibility/.
Now in order to reuse that structure (e.g. different docker, moving between systems etc.) all you do is
cd("MyProject")
using Pkg
pkg"activate ."
pkg"instantiate"
Additional note
Also have a look at the JULIA_DEPOT_PATH variable (https://docs.julialang.org/en/v1/manual/environment-variables/).
When moving installations between dockers here and there it might be also sometimes convenient to have control where all your packages are actually installed. For an example you might want to copy JULIA_DEPOT_PATH folder between 2 dockers having the same Julia installations to avoid the time spent in installing packages or you could be building the Docker image having no internet connection etc.

In my Dockerfile I simply install the packages just like you would do with pip:
FROM jupyter/datascience-notebook
RUN julia -e 'using Pkg; Pkg.add.(["CSV", "DataFrames", "DataFramesMeta", "Gadfly"])'
Here I start with a base datascience notebook which includes Julia, and then call Julia from the commandline instructing it to execute the code needed to install the packages. Only downside for now is that package precompilation is triggered each time I load the container in VS Code.
If I need new packages, I simply add them to the list.

After Drake Source installation on macOS, how to run a example?

After using "Source installation on macOS" to install drake, "Bazel built//..." and " Bazel test//..." are done. The question is: how I run an example , for examples/acrobot/run_swing_up ? Should I input a command like: Bazel-bin/examples/acrobot/run_swing_up ?

Yup, you can either run it via bazel run or ./bazel-bin (the latter being better for running multiple processes, having stdin access, etc.):
https://drake.mit.edu/bazel.html
Some of the examples also have brief READMEs or docs on how to run it; e.g.:
jaco arm
inclined plane

Copying an exe and composing it as a docker image and making it platform independent

I need to create a Docker image, which when run, should install an exe in the specified directory that mentioned in my docker file.
Basically, I need ImageMagick application. The docker file created should be platform independent, say if I ran in windows it should use windows distribution, Linux means Linux distribution. It would be great if it adds an environmental variable in the system. I browsed for the solution, but I couldn't find an appropriate solution.

I know it's a bit late but maybe someone (like me) was still searching.
I ended up using a java-imagemagick docker version from https://hub.docker.com/r/cpaitsupport/java-imagemagick/dockerfile
You can run docker pull cpaitsupport/java-imagemagick to get this docker image to your docker machine.
Now comes the tricky part: as I needed to run the imagemagick inside a docker container for my main app. Now you can COPY the files from cpaitsupport/java-imagemagick to your custom container. Example :
COPY --from=cpaitsupport/java-imagemagick:latest . ./some/dir/imagemagick
now you should have the docker file structure for your custom app and also under some/dir/imagemagick/ the file structure for imagemagick. Here are all ImageMagick relative files (also convert, magic, the libraries etc).
Now if you want to use ImageMagick in your Code you need to setup some ENV variables to your docker container with the "new" path to the ImageMagick directory. Example:
IM4JAVA_TOOLPATH=/some/dir/imagemagick/usr/bin \
LD_LIBRARY_PATH=/usr/lib:/some/dir/imagemagick/usr/lib \
MAGICK_CONFIGURE_PATH=/some/dir/imagemagick/etc/ImageMagick-7 \
MAGICK_CODER_MODULE_PATH=/some/dir/imagemagick/usr/lib/ImageMagick-7.0.5/modules-Q16HDRI/coders \
MAGICK_HOME=/some/dir/imagemagick/usr
Now delete (in Java Code) ProcessStarter.setGlobalSearchPath(imPath); this part if it is set. So you can use the IM4JAVA_TOOLPATH.
Now the ConvertCmd cmd = new ConvertCmd(); and cmd.run(op); should be working.
Maybe it's not the best way but worked for me and I was struggling a lot.
Hope this helps!
Feel free to correct or add additional info.

You can install (extract files) to the external hosting system using docker mount or volumes -
however you can not change system setting by updating environment variables of the hosting system from inside of the containers.

Upload Files with Space in Name on Google Cloud SDK Shell

I'm new to Google Cloud Storage, but have managed to find every answer to my questions online, but now I'm trying to upload files using the Google Cloud SDK and it works great for files with no spaces "this_file_001.txt" but if I try to upload a file with spaces "this file 001.txt" the system won't recognize the command. The command I'm using that works is
gsutil -m cp -r this_file_001.txt gs://this_file_001.txt
Now the same command with spaces doesn't work
gsutil -m cp -r this file 001.txt gs://this file 001.txt
Is there any way to accomplish this task?
Thanks in advance.

Putting the argument into quotes should help. I just tried the commands below using Google Cloud Shell terminal and it worked fine:
$ gsutil mb gs://my-test-bucket-55
Creating gs://my-test-bucket-55/...
$ echo "hello world" > "test file.txt"
$ gsutil cp "test file.txt" "gs://my-test-bucket-55/test file.txt"
Copying file://test file.txt [Content-Type=text/plain]...
Uploading gs://my-test-bucket-55/test file.txt: 12 B/12 B
$ gsutil cat "gs://my-test-bucket-55/test file.txt"
hello world
That said, I'd avoid file names with spaces if I could.

Alexey's suggestion about quoting is good. If you're on Linux or a Mac, you can likely also escape with a backslash (). On Windows, you should be able to use a caret (^).
Linux example:
$> gsutil cp test\ file.txt gs://bucket
Windows example:
c:\> gsutil cp test^ file.txt gs://bucket
Quotes work for both platforms, I think.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart