Install pyspark + pytest in docker container - docker

I'm trying to unit test my pyspark code using pytest but can't figure out the proper steps and method of installation. I was able to get this working locally on my Mac using this tutorial. I've tried 2 methods to accomplish this:
Try to replicate what I did on my Mac in the Dockerfile. i.e. install pypark, apache-spark, java 8, scala, pytest, and make sure I get the ENV paths correct.
Use an image from docker like bitnami.
I attempted (1) but could not find the right RUN command to install java properly.
For (2), is there any way in the Dockerfile for me to install bitnami separately from pytest since bitnami does not give root access?
Note:
Bitnami does not put py4j in the PYTHONPATH so I had to add this line to the docker file:
ENV PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:${PYTHONPATH}"

How about building your image FROM bitnami:spark and adding pytest?
I created test_spark.py:
from pyspark.sql import SparkSession
def test1():
spark = SparkSession.builder.getOrCreate()
data = spark.sql("SELECT 1").collect()
assert data == [(1,)]
and a Dockerfile:
FROM bitnami/spark:latest
RUN pip install pytest py4j
COPY test_spark.py .
CMD python -m pytest test_spark.py
Now I can build and run my container and execute the pytests:
docker build . -t pytest_spark && docker run pytest_spark
[+] Building 0.1s (8/8) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 36B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/bitnami/spark:latest 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 35B 0.0s
=> [1/3] FROM docker.io/bitnami/spark:latest 0.0s
=> CACHED [2/3] RUN pip install pytest py4j 0.0s
=> CACHED [3/3] COPY test_spark.py . 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:33b5f945afb750aecb0a8e1b2e811eb71b2bb2e67752e1b73a2c321bcc433841 0.0s
=> => naming to docker.io/library/pytest_spark 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
08:13:35.34
08:13:35.34 Welcome to the Bitnami spark container
08:13:35.35 Subscribe to project updates by watching https://github.com/bitnami/containers
08:13:35.35 Submit issues and feature requests at https://github.com/bitnami/containers/issues
08:13:35.35
============================= test session starts ==============================
platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /opt/bitnami/spark
collected 1 item
test_spark.py . [100%]
============================== 1 passed in 10.11s ==============================

Related

Error when trying to build my golang application in docker while using the mysql driver

I have a simple application that uses github.com/go-sql-driver/mysql to connect to a MySQL database and execute simple queries. This all works fine on my local machine, however when I try to build it using docker build I get the following output:
[+] Building 4.1s (9/10)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 104B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/golang:onbuild 1.3s
=> [auth] library/golang:pull token for registry-1.docker.io 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 5.63kB 0.0s
=> CACHED [1/2] FROM docker.io/library/golang:onbuild#sha256:c0ec19d49014d604e4f62266afd490016b11ceec103f0b7ef44 0.0s
=> [2/2] COPY . /go/src/app 0.1s
=> [3/2] RUN go-wrapper download 2.0s
=> ERROR [4/2] RUN go-wrapper install 0.6s
------
> [4/2] RUN go-wrapper install:
#8 0.465 + exec go install -v
#8 0.535 github.com/joho/godotenv
#8 0.536 github.com/go-sql-driver/mysql
#8 0.581 # github.com/go-sql-driver/mysql
#8 0.581 ../github.com/go-sql-driver/mysql/driver.go:88: undefined: driver.Connector
#8 0.581 ../github.com/go-sql-driver/mysql/driver.go:99: undefined: driver.Connector
#8 0.581 ../github.com/go-sql-driver/mysql/nulltime.go:36: undefined: sql.NullTime
------
executor failed running [/bin/sh -c go-wrapper install]: exit code: 2
My go version is up to date and I am using the following dockerfile:
FROM golang:onbuild
To my knowledge this should go get all the packages it requires. I've also tried it this way:
FROM golang:onbuild
RUN go get "github.com/go-sql-driver/mysql"
This had the same output.
Note that in my code I import the package like this:
import _ "github.com/go-sql-driver/mysql"
I also use other packages from github, these seem to work fine.
The Docker community has generally been steering away from the Dockerfile ONBUILD directive, since it makes it very confusing what will actually happen in derived images (see the various comments around "is that really the entire Dockerfile?"). If you search Docker Hub for the golang:onbuild image you'll discover that this is Go 1.7 or 1.8; Go modules were introduced in Go 1.11.
You'll need to update to a newer base image, and that means writing out the Dockerfile steps by hand. For a typical Go application this would look like
FROM golang:1.18 AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY ./ ./
RUN go build -o myapp .
FROM ubuntu:20.04
COPY --from=build /app/myapp /usr/local/bin
CMD ["myapp"]
(In the final stage you may need to RUN apt-get update && apt-get install ... a MySQL client library or other tools.)

Docker errors when trying to build in ARM64 Apple M1: "Failed to resolve full path of the current executable [/proc/self/exe]"

I'm having trouble building docker containers on an Apple M1
The project uses sdk 2.2 which is incompatible with arm64 architecture. So I changed the sdk and aspnet cores to 2.2-alpine3.8 and they seem to build ok but the process fails when it needs to publish
The docker file:
FROM mcr.microsoft.com/dotnet/core/sdk:2.2-alpine3.8 AS build-env
WORKDIR /app
# Copy csproj and restore as distinct layers
COPY . .
RUN dotnet publish src/admin/MyContainer.Admin.csproj -c Release -o ./publish
# Build runtime image
FROM mcr.microsoft.com/dotnet/core/aspnet:2.2-alpine3.8
WORKDIR /app
COPY --from=build-env /app/src/admin/publish .
# Copy script that allows waiting for the database on docker-compose
# As referenced here https://docs.docker.com/compose/startup-order/
# This was done because admin starts up very quickly on e2e env
# and the sqlserver container wasn't getting ready fast enough
COPY --from=build-env /app/docker-wait-for-it.sh .
EXPOSE 80
CMD ["dotnet", "MyContainer.Admin.dll"]
the build command I'm using is:
docker buildx build --platform linux/arm64/v8 -t MyContainer.azurecr.io/admin-api-web:development ./backend -f ./backend/Admin.Api.Web.Dockerfile
and the log output is:
[+] Building 0.5s (11/13)
=> [internal] load build definition from Admin.Api.Web.Dockerfile 0.0s
=> => transferring dockerfile: 46B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> [internal] load metadata for mcr.microsoft.com/dotnet/core/aspnet:2.2-alpine3.8 0.1s
=> [internal] load metadata for mcr.microsoft.com/dotnet/core/sdk:2.2-alpine3.8 0.1s
=> [build-env 1/4] FROM mcr.microsoft.com/dotnet/core/sdk:2.2-alpine3.8#sha256:1299ac0379146c1de7b588196727dbd56eace3abfbdd4321c547c9ff4a18a2f7 0.0s
=> => resolve mcr.microsoft.com/dotnet/core/sdk:2.2-alpine3.8#sha256:1299ac0379146c1de7b588196727dbd56eace3abfbdd4321c547c9ff4a18a2f7 0.0s
=> [stage-1 1/4] FROM mcr.microsoft.com/dotnet/core/aspnet:2.2-alpine3.8#sha256:4d6e528f4c09c55804b6032ecc5d60565a3ee16f68bb08d2cf337dff99cdb8c3 0.0s
=> => resolve mcr.microsoft.com/dotnet/core/aspnet:2.2-alpine3.8#sha256:4d6e528f4c09c55804b6032ecc5d60565a3ee16f68bb08d2cf337dff99cdb8c3 0.0s
=> [internal] load build context 0.2s
=> => transferring context: 473.91kB 0.1s
=> CACHED [stage-1 2/4] WORKDIR /app 0.0s
=> CACHED [build-env 2/4] WORKDIR /app 0.0s
=> CACHED [build-env 3/4] COPY . . 0.0s
=> ERROR [build-env 4/4] RUN dotnet publish src/admin/MyContainer.Admin.csproj -c Release -o ./publish 0.1s
------
> [build-env 4/4] RUN dotnet publish src/admin/MyContainer.Admin.csproj -c Release -o ./publish:
#11 0.127 Failed to resolve full path of the current executable [/proc/self/exe]
I've tried with different platforms on the --platform flag and without any specified as well... If I try to build with core 2.2 I get this output:
[+] Building 0.5s (4/4) FINISHED
=> [internal] load build definition from Admin.Api.Web.Dockerfile 0.0s
=> => transferring dockerfile: 745B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> ERROR [internal] load metadata for mcr.microsoft.com/dotnet/core/aspnet:2.2 0.4s
=> CANCELED [internal] load metadata for mcr.microsoft.com/dotnet/core/sdk:2.2 0.4s
------
> [internal] load metadata for mcr.microsoft.com/dotnet/core/aspnet:2.2:
------
error: failed to solve: failed to solve with frontend dockerfile.v0: failed to create LLB definition: no match for platform in manifest sha256:08xxx: not found
Which I've found out to be because core 2.2 has no support form arm64. But this file builds without any errors on an Intel MacBook but with M1 I can't get past the build error. Any ideas?
This is not an answer but a followup to the problem,
which might shed some more light on the issue.
I'm using aspnet:6 and it passes the build.
The problem later on is that the runnable .dll throws
cannot execute binary file: Exec format error
which seems to be a platform issue, using Mac with an M1 Silicon chipset.
I have no issue with other docker images that are made for linux/amd64
The Docker daemon seems to deal with this fine (having installed Rosetta2 and the Latest Docker Desktop for M1)
https://docs.docker.com/desktop/mac/apple-silicon/
Have you managed to resolve the issue and execute the dll?

file not found error when running container

I have a test automation project which gets uses the code built as part of jar file and that jar gets invoked via bat file. All these files are stored within my project folder.
contents of my Docker file:
FROM maven:3.8.1-adoptopenjdk-11
#WORKDIR C:/Work/Kickstart_TEM/Prefs
COPY Prefs /home/Prefs
COPY KickStart.jar /home/Prefs/KickStart.jar
CMD home\prefs\run.bat && cmd
docker build generates following output
[+] Building 0.3s (8/8) FINISHED
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 210B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/maven:3.8.1-adoptopenjdk-11 0.0s
=> [1/3] FROM docker.io/library/maven:3.8.1-adoptopenjdk-11 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 390B 0.0s
=> CACHED [2/3] COPY Prefs /home/Prefs 0.0s
=> CACHED [3/3] COPY KickStart.jar /home/Prefs/KickStart.jar 0.0s
=> exporting to image 0.1s
=> => exporting layers 0.0s
=> => writing image sha256:4c878e8a895b2fad307e00f1b2fb5c9b5df7dc630e87414230d1989b75a5ee17 0.0s
=> => naming to docker.io/library/demo2
Docker run generates following error:
PS C:\Work\Docker_POC> docker run -i -p 4044:4044 demo2
/bin/sh: 1: homeprefsrun.bat: not found
My containers stops right away, so I am not even able to figure out if my files and folders got copied successfully or not. And I am unsure of how to resolve this error.
First of all, you're trying to run a batch script under Linux (the docker image you're using determines this).
In general, your CMD statement should look like CMD ["/bin/sh", "-c", "/home/Prefs/run.sh && cmd"] (although I'm not sure what cmd is and why you want to run it)
You should convert this batch script (run.bat) to a shell script. Also, there is a difference between home and /home and filenames are case-sensitive (thus it's Prefs and not prefs).

Docker fails to build image with exit code 139

I'm trying to build an image from CentOS 6.9. Using this Dockerfile:
FROM centos:6.9
RUN ls
But it keeps failing with exit code 139 with the following output:
$ docker build -t centos-6.9 .
[+] Building 1.1s (7/7) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 72B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/centos:6.9 0.6s
=> [internal] load build context 0.1s
=> => transferring context: 72B 0.0s
=> CACHED [1/3] FROM docker.io/library/centos:6.9#sha256:6fff0a9edc920968351eb357c5b84016000fec6956e6d745f695e5a34f18ecd2 0.0s
=> [2/3] COPY . . 0.0s
=> ERROR [3/3] RUN ls 0.3s
------
> [3/3] RUN ls:
------
executor failed running [/bin/sh -c ls]: exit code: 139
I'm running:
Windows 10 Enterprise Version 2004
Docker Desktop 3.0.0
This appears to be an issue with WSL 2 with older base images, not docker or the image itself.
Create %userprofile%\.wslconfig file.
Add the following:
[wsl2]
kernelCommandLine = vsyscall=emulate
Restart WSL. wsl --shutdown
Restart Docker Desktop.
References:
https://github.com/microsoft/WSL/issues/4694#issuecomment-556095344
https://github.com/docker/for-win/issues/7284#issuecomment-646910923
https://github.com/microsoft/WSL/issues/4694#issuecomment-558335829

Check if stage exists in Dockerfile

I have a CI script that builds Dockerfiles. My plan is that unit tests should be run in a test stage in each Dockerfile, for example:
FROM alpine AS build
WORKDIR /app
COPY src .
...
FROM build AS test
RUN mvn clean test
FROM build AS package
COPY --from=build ...
So, for a given Dockerfile, I would like to check if it has a test stage and, if so, run docker build --target test .... If it doesn't have a test stage, I don't want to run docker build (which would fail).
How can I check if a Dockerfile contains a certain stage without actually building it?
I do realize this question has some XY problem vibes to it, so feel free to enlighten me. But I also think the question can be generally useful anyway.
I'm going to shy away from trying to parse the Dockerfile since there are a lot of ways to inject false positives or negatives. E.g.
RUN echo \
FROM base as test
or
FROM base \
as test
So instead, I'm going to favor letting docker do the hard work, and modifying the file to not fail on a missing test. This can be done by adding a test stage to a file even when it already as a test stage. Whether you want to put this at the beginning or end of the Dockerfile depends on whether you are running buildkit:
$ cat df.dup-target
FROM busybox as test
RUN exit 1
FROM busybox as test
RUN exit 0
$ DOCKER_BUILDKIT=0 docker build --target test -f df.dup-target .
Sending build context to Docker daemon 20.99kB
Step 1/2 : FROM busybox as test
---> be5888e67be6
Step 2/2 : RUN exit 1
---> Running in 9f96f42bc6d8
The command '/bin/sh -c exit 1' returned a non-zero code: 1
$ DOCKER_BUILDKIT=1 docker build --target test -f df.dup-target .
[+] Building 0.1s (6/6) FINISHED
=> [internal] load build definition from df.dup-target 0.0s
=> => transferring dockerfile: 114B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> [internal] load metadata for docker.io/library/busybox:latest 0.0s
=> [test 1/2] FROM docker.io/library/busybox 0.0s
=> CACHED [test 2/2] RUN exit 0 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:8129063cb183c1c1aafaf3eef0c8671e86a54f795092fa7a918145c14da3ec3b 0.0s
Then you could append the always successful test at the beginning or end, passing that modified Dockerfile to stdin for the docker build to process:
$ cat df.simple
FROM busybox as build
RUN exit 0
$ cat - df.simple <<EOF | DOCKER_BUILDKIT=1 docker build --target test -f - .
FROM busybox as test
RUN exit 0
EOF
[+] Building 0.1s (6/6) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 109B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 34B 0.0s
=> [internal] load metadata for docker.io/library/busybox:latest 0.0s
=> [test 1/2] FROM docker.io/library/busybox 0.0s
=> CACHED [test 2/2] RUN exit 0 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:8129063cb183c1c1aafaf3eef0c8671e86a54f795092fa7a918145c14da3ec3b 0.0s
This is a simple grep invocation:
egrep -i -q '^FROM .* AS test$' Dockerfile
You also might consider running your unit tests outside of Docker, before you start building containers. (Or, if your CI system supports running steps inside containers, use a container to get a language runtime, but not necessarily run the Dockerfile.) You'll still need a Docker-based setup to run larger integration tests, but you can run these on your built production-ready containers.

Resources