Download spark-submit without all the Spark framework to make lite Docker image - docker

Most of the Docker image that embed Apache Spark have the whole spark archive in it.
Also most of the time, we submit the spark application on kubernetes, hence the spark job is running on other Docker container.
As such, I am wondering, in order to make the Docker image smaller, how to embed spark-submit feature?

That's a great question! I had a look (for the latest one on the Downloads page: 3.3.1) and found the following:
Looking at the contents of $SPARK_HOME/bin/spark-submit, you can see the following line:
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$#"
Ok, so it looks like the $SPARK_HOME/bin/spark-submit script calls the $SPARK_HOME/bin/spark-class script. Let's have a look at that one.
Similar to spark-submit, spark-class calls the load-spark-env.sh script like so:
. "${SPARK_HOME}"/bin/load-spark-env.sh
This load-spark-env.sh script calls other scripts of its own as well. But there is also a bit about Spark jars in spark-class:
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
echo "You need to build Spark with the target \"package\" before running this program." 1>&2
exit 1
else
LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi
# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi
So as you can see, it is referencing the spark jars directory (288MB of the total 324MB for Spark 3.3.1) and putting that on the launch classpath. Now, it's very possible that not all of those jars are needed in the case of submitting an application on kubernetes. But at least you need some kind of library to translate a spark application to some kubernetes configuration that your Kubernetes API server can understand.
So my conclusion from this bit is:
We can quite easily follow which files are exactly needed. In first instance, I would say anything in $SPARK_HOME/bin and $SPARK_HOME/conf. But that is no issue since they are all very small scripts/conf files.
Some of those scripts though, are putting the jars directory in your classpath for the final java command.
Maybe they don't need all the jars, but they will need some kind of library to connect to the Kubernetes API server. So I would expect some jar to be needed there. I see there is a jar called kubernetes-model-core-5.12.2.jar. Maybe this one?
Since most of the size of this $SPARK_HOME folder comes from those jars, you can try to delete some jars and run your spark-submit jobs to see what happens. I would think that, amongst others, jars like commons-math3-3.6.1.jar or spark-mllib_2.12-3.3.1.jar would not be necessary for a simple spark-submit to a Kubernetes API Server.
(All those specific jars just come from that one Spark version I talked about in the start of this post)
Really interesting question, I hope this helps you a bit! Just try deleting some jars, run your spark-submit and see what happens!

Related

How to access scala shell using docker image for spark

I just downloaded this docker image to set up a spark cluster with two worker nodes. Cluster is up and running however I want to submit my scala file to this cluster. I am not able to start spark-shell in this.
When I was using another docker image, I was able to start it using spark-shell.
Can someone please explain if I need to install scala separately in the image or there is a different way to start
UPDATE
Here is the error bash: spark-shell: command not found
bash: spark-shell: command not found
root#a7b0682ff17d:/opt/spark# ls /home/shangupta/Scripts/
ProfileData.json demo.scala queries.scala
TestDataGeneration.sql input.scala
root#a7b0682ff17d:/opt/spark# spark-shell /home/shangupta/Scripts/input.scala
bash: spark-shell: command not found
root#a7b0682ff17d:/opt/spark#
You're getting command not found because PATH isn't correctly established
Use the absolute path /opt/spark/bin/spark-shell
Also, I'd suggest packaging your Scala project as an uber jar to submit unless you have no external dependencies or like to add --packages/--jars manually

Can I run scripts from a docker build context without a copy?

I want to build on top of a windows docker container by installing a couple programs. The files total .5 GB and I want to keep the layers as small as possible. I am hoping I can run the setup files from the build-context, and then have the build-context swept away at the end so I don't have a needless copy of the source files for the setup.exe embedded in my container layers. However, I have not found an example where this is the case. Instead I mostly see people run a COPY command to a temporary build folder, run their setup, then remove the folder. Won't those files still be in the container layers because the COPY command creates a new layer when it's done?
I don't know if the container can see the build-context directly. I was hoping for some magical folder filled with the build-context files so I could run a script using it, but haven't found anything.
It seems like the alternative is to create a private file-server and perform a RUN that can download them from that private server and unpack them, run the install, and remove them (all as 1 docker step). I understand this would make it more available to others who need to rerun the build, but I'm not convinced we'll need to rerun it. It's not likely to change as the container will build patches for a legacy application. Just seems like a lot to host files on a private, public-facing server for something that will get called once every couple years if ever.
So are these my two options?
Make a container with needless copies of source files embedded within
Host the files on a private file server and download/install/remove them
Or am I missing another option or point about how the containers work?
It's a long shot as Windows is a tricky thing with file system, but you could do this way:
In your Dockerfile use a COPY command, install then RUN del ... to remove the installation files
Build your image docker build -t my-large-image:latest .
Run your image docker run --name my-large-container my-large-image:latest
Stop the container
Export your container filesystem docker export my-large-container > my-large-container.tar
Import the filesystem to a new image cat my-large-container.tar | docker import - my-small-image
Caveat is you need to run the container once which might not be what you want. And also I haven't tested with windows container, sorry.
I usually do the download or copy in one step, then in the next step I do the silent installation and remove the installer.
# escape=`
FROM mcr.microsoft.com/dotnet/framework/wcf:4.8-windowsservercore-ltsc2016
SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]
ADD https://download.visualstudio.microsoft.com/download/pr/6afa582f-fa26-4a73-8cb9-194321e85f8d/ecea51ead62beb7acc73ad9799511ffdb3083ad384fe04ec50e2cbecfb426482/VS_RemoteTools.exe VS_RemoteTools_x64.exe
RUN Start-Process .\\VS_RemoteTools_x64.exe -ArgumentList #('/install','/quiet','/norestart') -NoNewWindow -Wait; `
Remove-Item -Path C:/VS_RemoteTools_x64.exe -Force;
But otherwise, I don't think you can mount a custom volume while it's being built.
I didn't find a satisfactory answer to this. Docker seems designed for only the modern era and assumes you'll be able to download what you need via scripts and tools hitting APIs and file servers. The easiest option I found that I eventually went with was to host the files on a private file server or service (in my case, AWS S3).
I really wish there was a way to have files hosted by the docker daemon in some way, eg. if it acted like a temporary server that you could get data from via http instead of needing to COPY the files and create a layer. Alas, I found no such feature.
Taking this route made my container about a GB smaller.

Using JMeter plugins with justb4/jmeter Docker image results in error

Goal
I am using Docker to run JMeter in Azure Devops. I am trying to use Blazemeter's Parallel Controller, which is not native to JMeter. So, according to the justb4/jmeter image documentation, I used the following command to get the image going and run the JMeter test:
docker run --name jmetertest -i -v /home/vsts/work/1/s/plugins:/plugins -v $ROOTPATH:/test -w /test justb4/jmeter ${#:2}
Error
However, it produces the following error while trying to accommodate for the plugin (I know the plugin makes the difference due to testing without the plugin):
cp: can't create '/test/lib/ext': No such file or directory
As far as I understand, this is an error produced when one of the parent directories of the directory you are trying to make does not exist. Is there something I am doing wrong, or is there actually something wrong with the image?
References
For reference, I will include links to the image documentation and the repository.
Image: https://hub.docker.com/r/justb4/jmeter
Repository: https://github.com/justb4/docker-jmeter
Looking into the Dockerfile:
ENV JMETER_HOME /opt/apache-jmeter-${JMETER_VERSION}
Looking into entrypoint.sh
if [ -d /plugins ]
then
for plugin in /plugins/*.jar; do
cp $plugin $(pwd)/lib/ext
done;
fi
It basically copies the plugins from /plugins folder (if it is present) to /lib/ext folder relative to current working directory
I don't know why did you add this stanza -w /test to your command line but it explicitly "tells" the container that local working directory is /test, not /opt/apache-jmeter-xxxx, that's why the script is failing to copy the files.
In general I don't think that the approach is very valid because:
In Azure DevOps you won't have your "local" folder (unless you want to add plugins binaries under the version control system)
Some JMeter Plugins have other .jars as the dependencies so when you're installing the plugin you should:
put the plugin itself under /lib/ext folder of your JMeter installation
put the plugin dependencies under /lib folder of your JMeter installation
So I would recommend amending the Dockerfile, download JMeter Plugins Manager and installed the plugin(s) you need from the command line
Something like:
RUN wget https://jmeter-plugins.org/get/ -O /opt/apache-jmeter-${JMETER_VERSION}/lib/ext/jmeter-plugins-manager.jar
RUN wget https://repo1.maven.org/maven2/kg/apc/cmdrunner/2.2/cmdrunner-2.2.jar -P /opt/apache-jmeter-${JMETER_VERSION}/lib/
RUN java -cp /opt/apache-jmeter-${JMETER_VERSION}/lib/ext/jmeter-plugins-manager.jar org.jmeterplugins.repository.PluginManagerCMDInstaller
RUN /opt/apache-jmeter-${JMETER_VERSION}/bin/./PluginsManagerCMD.sh install bzm-parallel

Set line-buffering in container output

I use Java S2I image for a container running in Openshift (on premise). My problem is that the output of the image is page-buffered and oc logs ... does not show me the last logs.
I could probably spin up my docker image that would do stdbuf -oL -e0 java ... but I would prefer to stick to the 'official' image (just adding the jar to /deployments). Is there any way to reduce buffering (use line-buffering instead of page-buffering), or flush the output on demand?
EDIT: It seems that I could update deployment config and pass stdbuf in there, but that means that I'd have to compose all the args myself. Ideal solution would be passing --tty do Docker, but I can't see how a custom arguments could be passed that way in Openshift.
In your repo, try creating the file .s2i/bin/run. In it add:
#/bin/bash
exec stdbuf -oL -e0 /usr/local/s2i/run
I always forget where the S2I assemble and run scripts are in the Java S2I image, so you may need to replace /usr/local/s2i with the correct path.
What adding this file does is that it will be run as the startup command instead of the original run script. You can then run the original script with stdbuf. Ensure you use exec so that the sub process replaces the current one, else signals will not be propagated through properly.
Even though this might work, am surprised logging isn't working in an unbuffered mode already. I expect there would be a better way of controlling it through some Java config instead.

Need to know how to use Groovy to automate a Docker build & runtime

I have a task to containerize a Spring & React web-app so that non-technical staff can make use of the container to demo the app to clients. Currently we develop on OSX & deploy to Tomcat on AWS managed by a 3rd party firm, and the non-technical staff use Windows laptops for their stuff.
So far I have bash scripts in OSX which will create a Packager container that has a Java 8 SDK & maven installed, & which will compile the app into a war file. A second script creates and initializes a mongodb container & gives it a name, and the third script creates a Tomcat/Java 8 container, loads the war file into it, links it to the mongodb container & sets it running. In bash on OSX this works fine, but I found it didn't work if I tried it in cygwin on Windows 10, and my CMD/Powershell-fu is too weak to script it in a Windows native fashion.
So, I'm trying to do the script in something that'll run on both OSX, an AWS linux server & Windows 10, & being a Java developer myself I thought of Groovy. This is my first time scripting Docker using Groovy so I've ended up resorting to structures like:
println "docker build -f Dockerfile.packager -t mycontainer .".execute().text
I wonder if Docker has a Java or Groovy API that I could plug into & do things like:
docker.build("Dockerfile.packager").tag("mycontainer")
Currently my script is determining the location of the project root & building up the Docker run command as a string, like:
File emToo = new File(System.getProperty("user.dir")+"/.m2")
String currentDirectory = new File(".").getCanonicalPath()
String projectRoot = new File(currentDirectory+"/../").getCanonicalPath()
I get an option string from the user via a command line prompt, "Do you want QA or Dev?" & then:
String dockerRunCmd = "docker run -it -v $projectRoot/:/usr/local/build/myproject:cached -v ${emToo.getCanonicalPath()}:/root/.m2:cached mycontainer $option"
println dockerRunCmd.execute().text
Currently it doesn't seem to do anything after asking for the option - it's kinda bombing out. I get the run command output to screen, & if I copy/paste that into a command line in the scripts directory it falls over saying that the parent pom can't be found. Remember though that if I run the OSX bash script to do this, it works just fine. The bash script is basically:
#! /usr/bin/env bash
CWD=`pwd`
options=$1
docker run -it -v $CWD/../:/usr/local/build/myproject:cached -v ~/.m2:/root/.m2:cached --rm mycontainer $options
...which I think amounts to the same thing, right? Where's it going wrong?
UPDATE: I've found a bug - I should have been setting emToo to
new File(System.getProperty("user.home")+"/.m2"). user.dir just picks up the current directory, & the maven .m2 directory is in the user's home, usually. Currently though, the script gives me a run command that works if I cut/paste into a command line, but which doesn't allow me to call .execute() on the string in Groovy. If I can get that to work, there'll be no need for the docker-client projects suggested.
There are different ways to communicate with docker from groovy or java (sdk's are listed there https://docs.docker.com/engine/api/sdks/#other-languages):
Groovy (https://github.com/gesellix/docker-client)
Java (https://github.com/docker-java/docker-java)
Many others can be also found on github.
But as I see you are using maven so probably it will be easier for you to use awesome docker maven plugin (https://dmp.fabric8.io) which can build, push images, run containers etc.

Resources