scriptA.py contents:
import shlex, subprocess
from dask.distributed import Client
def my_task(params):
print("params[1]", params[1]) ## prints python scriptB.py arg1 arg2
child = subprocess.Popen(shlex.split(params[1]), shell=False)
child.communicate()
if __name__ == '__main__':
clienta = Client("192.168.1.3:8786")
params=["dummy_arguments", "python scriptB.py arg1 arg2"]
future = clienta.submit(my_task, params)
print(future.result())
print("over.!")
scriptB.py contents:
import file1, file2
from folder1 import file4
import time
for _ in range(3):
file1.do_something();
file4.try_something();
print("sleeping for 1 sec")
time.sleep(1)
print("waked up..")
scriptA.py runs on node-1(192.168.23.12:9784) while the dask-worker runs on another node-2 (198.168.54.86:4658) and dask-scheduler is on different node-3(198.168.1.3:8786).
The question here is how to pass the dependencies needed by scriptB.py such as folder1, file1, file2 etc. to the dask-worker node-2 from scriptA.py which is running on node-1.?
You might want to look at the Client.upload_file method.
client.upload_file('/path/to/file1.py')
For any larger dependency though you are generally expected to handle dependencies yourself. In larger deployments people typically rely on some other mechanism, like Docker or a network file system, to ensure uniform software dependencies.
Related
I have one powerful machine(remote machine), accessible through SSH. My data is stored at remote machine.
I want to run & access data on the remote machine. For this, I ran a dask-scheduler & a dask-worker on the remote machine. Then I ran a jupyter notebook on my laptop (local machine) with client=Client(‘schedular-ip:8786’), but it still refer data on the local machine, not of the remote machine.
How do I refer to data of the remote machine from notebook, running on the local machine?
import dask.dataframe as dd
from dask.distributed import Client
client = Client('remote-ip:8786')
ddf = dd.read_csv(
'remote-machine-file.csv',
header=None,
assume_missing=True,
dtype=object,
)
It fails with
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-37-17d26dadb3a8> in <module>
----> 1 ddf = dd.read_csv('remote-machine-file.csv', header=None, assume_missing=True, dtype=object)
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
735 storage_options=storage_options,
736 include_path_column=include_path_column,
--> 737 **kwargs,
738 )
739
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
520
521 # Infer compression from first path
--> 522 compression = infer_compression(paths[0])
523
524 if blocksize == "default":
IndexError: list index out of range
When using dask.dataframe with a distributed.Client, while the majority of the I/O is done by remote workers, dask does rely on the client machine being able to access the data for scheduling.
To run anything purely on the worker, you can always have the worker schedule the operation, e.g. with:
client = Client()
# use the client to have the worker run the dask.dataframe command!
f = client.submit(dd.read_csv, fp)
# because the worker is holding a dask dataframe object, requesting
# the result brings the dask.dataframe object/metadata to the
# local client, while leaving the data on the remote machine
df = f.result()
Alternatively, you can partition the job manually yourself,
e.g. if you have many files, then read them into memory on
the workers, and finally construct the dask dataframe locally with dask.dataframe.from_delayed:
import pandas as pd
files_on_remote = ['data/file_{}.csv'.format(i) for i in range(100)]
# have the workers read the data with pandas
futures = client.map(pd.read_csv, files_on_remote)
# use dask.dataframe.from_delayed to construct a dask.dataframe from the
# remote pandas objects
df = ddf.from_delayed(futures)
I'm trying to run perf test in my CI environment, using the k6 docker, and a simple single script file works fine. However, I want to break down my tests into multiple JS files. In order to do this, I need to mount a volume on Docker so I can import local modules.
The volume seems to be mounting correctly, with my command
docker run --env-file ./test/performance/env/perf.list -v \
`pwd`/test/performance:/perf -i loadimpact/k6 run - /perf/index.js
k6 seems to start, but immediately errors with
time="2018-01-17T13:04:17Z" level=error msg="accepts 1 arg(s), received 2"
Locally, my file system looks something like
/toychicken
/test
/performance
/env
- perf.list
- index.js
- something.js
And the index.js looks like this
import { check, sleep } from 'k6'
import http from 'k6/http'
import something from '/perf/something'
export default () => {
const r = http.get(`https://${__ENV.DOMAIN}`)
check(r, {
'status is 200': r => r.status === 200
})
sleep(2)
something()
}
You need to remove the "-" after run in the Docker command. The "-" instructs k6 to read from stdin, but in this case you want to load the main JS file from the file system. That's why it complains that it receives two args, one being the "-" and the second being the path to index.js (the error message could definitely be more descriptive).
You'll also need to add .js to the '/perf/something' import.
I want to run a nix-shell with the following packages installed:
aspell
aspellDicts.en
hello
I can't simply do: nix-shell -p aspell aspellDicts.en hello --pure as this will not correctly install the aspell dictionaries. Nix provides a aspellWithDict function that can be used to build aspell with dictionaries:
nix-build -E 'with import <nixpkgs> {}; aspellWithDicts (d: [d.en])'
I want to use the result of this build as a dependency in another local package (foo). This is how I'm currently achieving this:
./pkgs/aspell-with-dicts/default.nix:
with import <nixpkgs> {};
aspellWithDicts (d: [d.en])
./pkgs/foo/default.nix:
{stdenv, aspellWithDicts, hello}:
stdenv.mkDerivation rec {
name = "foo";
buildInputs = [ aspellWithDicts hello ];
}
./custom-packages.nix:
{ system ? builtins.currentSystem }:
let
pkgs = import <nixpkgs> { inherit system; };
in
rec {
aspellWithDicts = import ./pkgs/aspell-with-dicts;
foo = import ./pkgs/foo {
aspellWithDicts = aspellWithDicts;
hello = pkgs.hello;
stdenv = pkgs.stdenv;
};
}
Running the shell works as expected: nix-shell ./custom-packages.nix -A foo --pure
So my solution works, but could this outcome be achieved in a more succinct idiomatic way?
Do you need to build foo? What in foo you will use?
Assume you only want to use the shell via nix-shell and not want to build/install anything using nix-build or nix-env -i, this should works.
The following shell.nix
with import <nixpkgs> {};
with pkgs;
let
myAspell = aspellWithDicts (d: [d.en]);
in
stdenv.mkDerivation {
name = "myShell";
buildInputs = [myAspell hello];
shellHooks = ''
echo Im in $name.
echo aspell is locate at ${myAspell}
echo hello is locate at ${hello}
'';
}
will give you a shell with aspell and hello
$ nix-shell
Im in myShell.
aspell is locate at /nix/store/zcclppbibcg4nfkis6zqml8cnrlnx00b-aspell-env
hello is locate at /nix/store/gas2p68jqbzgb7zr96y5nc8j7nk61kkk-hello-2.10
If it is the case that foo have some code to build and install.
mkDerivation in foo/default.nix must have src field which could be src = ./.; or something like fetchurl or fetchFromGithub (see document for examples).
Then you can use callPackages or import (depends on how the nix expression was written) with foo/default.nix as argument to bring what foo provided to use in this shell.
If you try to build this shell.nix (or foo/default.nix) it will failed with missing src
$ nix-build shell.nix
these derivations will be built:
/nix/store/20h8cva19irq8vn39i72j8iz40ivijhr-myShell.drv
building path(s) ‘/nix/store/r1f6qpxz91h5jkj7hzrmaymmzi9h1yml-myShell’
unpacking sources
variable $src or $srcs should point to the source
builder for ‘/nix/store/20h8cva19irq8vn39i72j8iz40ivijhr-myShell.drv’ failed with exit code 1
error: build of ‘/nix/store/20h8cva19irq8vn39i72j8iz40ivijhr-myShell.drv’ failed
To make this code more idiomatic, I have the following suggestions:
callPackage
Use the pkgs.callPackage function. It will take care of passing the arguments that your derivation needs. This is why many files in NixPkgs look like { dependency, ...}: something. The first argument is the function you want to inject dependencies into and the second argument is an attribute set that you can use to pass some dependencies manually.
By using callPackage you do not need to import <nixpkgs> {}, so your code will be easier to use in new contexts <nixpkgs> can't be used and it will evaluate a bit faster because it has to evaluate the NixPkgs fix-point only once.
(Of course you have to import <nixpkgs> once to get started, but after that, there should be no need.)
with
In pkgs/aspell-with-dicts/default.nix you use a with keyword, which is ok, but in this case it does not really add value. I prefer to refer to variables explicitly, so prefer to read pkgs.something when it's used once or twice, or inherit (pkgs) something if it's use more often. This way the reader can easily determine where a variable comes from.
I do use it when experimenting with unfamiliar packages or functions, because maintenance is not an issue then.
pkgs/aspell-with-dicts/default.nix
Unless you expect that your instantiation of aspell is something you want to reuse, it's probably easier to just construct it where you use it.
If you do expect to reuse a specific configuration of a package, you might want to make it a first class package by constructing it in an overlay.
That's it. I think the most important point is avoiding <nixpkgs> and apart from that it's already pretty idiomatic.
I don't know what your mysterious foo is, but if it's open source, please consider upstreaming it into NixPkgs. Nix has a very welcoming community in my experience.
I have a single Python script called myscript.py and would like to package it up as a nix derivation with mkDerivation.
The only requirement is that my Python script has a run-time dependency, say, for the consul Python library (which itself depends on the requests and six Python libraries).
For example for myscript.py:
#!/usr/bin/env python3
import consul
print('hi')
How to do that?
I can't figure out how to pass mkDerivation a single script (its src seems to always want a directory, or fetchgit or similar), and also can't figure out how to make the dependency libraries available at runtime.
When you have a single Python file as your script, you don't need src in your mkDerivation and you also don't need to unpack any source code.
The default mkDerivation will try to unpack your source code; to prevent that, simply set dontUnpack = true.
myscript-package = pkgs.stdenv.mkDerivation {
name = "myscript";
propagatedBuildInputs = [
(pkgs.python36.withPackages (pythonPackages: with pythonPackages; [
consul
six
requests2
]))
];
dontUnpack = true;
installPhase = "install -Dm755 ${./myscript.py} $out/bin/myscript";
};
If your script is executable (which we ensure with install -m above) Nix will automatically replace your #!/usr/bin/env python3 line with one which invokes the right specific python interpreter (the one for python36 in the example above), and which does so in an environment that has the Python packages you've specifified in propagatedBuildInputs available.
If you use NixOS, you can then also put your package into environment.systemPackages, and myscript will be available in shells on that NixOS.
This helper function is really nice:
pkgs.writers.writePython3Bin "github-owner-repos" { libraries = [ pkgs.python3Packages.PyGithub ]; } ''
import os
import sys
from github import Github
if __name__ == '__main__':
gh = Github(os.environ['GITHUB_TOKEN'])
for repo in gh.get_user(login=sys.argv[1]).get_repos():
print(repo.ssh_url)
''
https://github.com/nixos/nixpkgs/blob/master/pkgs/build-support/writers/default.nix#L319
I submit my code to a spark stand alone cluster. Submit command is like below:
nohup ./bin/spark-submit \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2" \
./myCode.py 1>a.log 2>b.log &
I specify the executor use 4G memory in above command. But use the top command to monitor the executor process, I notice the memory usage keeps growing. Now the top Command output is below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12578 root 20 0 20.223g 5.790g 23856 S 61.5 37.3 20:49.36 java
My total memory is 16G so 37.3% is already bigger than the 4GB I specified. And it is still growing.
Use the ps command , you can know it is the executor process.
[root#ES01 ~]# ps -awx | grep spark | grep java
10409 ? Sl 1:43 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
10603 ? Sl 6:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
12420 ? Sl 10:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 --executor-memory 4G --num-executors 1 --total-executor-cores 1 /opt/flowSpark/sparkStream/ForAsk01.py
12578 ? Sl 21:03 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#10.79.148.184:52931 --executor-id 0 --hostname 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url spark://Worker#10.79.148.184:52660
Below are the code. It is very simple so I do not think there is memory leak
if __name__ == "__main__":
dataDirectory = '/stream/raw'
sc = SparkContext(appName="Netflow")
ssc = StreamingContext(sc, 20)
# Read CSV File
lines = ssc.textFileStream(dataDirectory)
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
The code for process function is below. Please note that I am using HiveContext not SqlContext here. Because SqlContext do not support window function
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = HiveContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def process(time, rdd):
if rdd.isEmpty():
return sc.emptyRDD()
sqlContext = getSqlContextInstance(rdd.context)
# Convert CSV File to Dataframe
parts = rdd.map(lambda l: l.split(","))
rowRdd = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), flow_direction=p[9], bits=int(p[11])))
dataframe = sqlContext.createDataFrame(rowRdd)
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
dataframe.show()
Actually I found below code will cause the problem:
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
Because If I remove these 5 line. The code can run all night without showing memory increase. But adding them will cause the memory usage of executor grow to a very high number.
Basically the above code is just some window + grouby in SparkSQL. So is this a bug?
Disclaimer: this answer isn't based on debugging, but more on observations and the documentation Apache Spark provides
I don't believe that this is a bug to begin with!
Looking at your configurations, we can see that you are focusing mostly on the executor tuning, which isn't wrong, but you are forgetting the driver part of the equation.
Looking at the spark cluster overview from Apache Spark documentaion
As you can see, each worker has an executor, however, in your case, the worker node is the same as the driver node! Which frankly is the case when you run locally or on a standalone cluster in a single node.
Further, the driver takes 1G of memory by default unless tuned using spark.driver.memory flag. Furthermore, you should not forget about the heap usage from the JVM itself, and the Web UI that's been taken care of by the driver too AFAIK!
When you delete the lines of code you mentioned, your code is left without actions as map function is just a transformation, hence, there will be no execution, and therefore, you don't see memory increase at all!
Same applies on groupBy as it is just a transformation that will not be executed unless an action is being called which in your case is agg and show further down the stream!
That said, try to minimize your driver memory and the overall number of cores in spark which is defined by spark.cores.max if you want to control the number of cores on this process, then cascade down to the executors. Moreover, I would add spark.python.profile.dump to your list of configuration so you can see a profile for your spark job execution, which can help you more with understanding the case, and to tune your cluster more to your needs.
As I can see in your 5 lines, maybe the groupBy is the issue , would you try with reduceBy, and see how it performs.
See here and here.