Read with dask.dataframe when file not accessible from local machine - dask

I have one powerful machine(remote machine), accessible through SSH. My data is stored at remote machine.
I want to run & access data on the remote machine. For this, I ran a dask-scheduler & a dask-worker on the remote machine. Then I ran a jupyter notebook on my laptop (local machine) with client=Client(‘schedular-ip:8786’), but it still refer data on the local machine, not of the remote machine.
How do I refer to data of the remote machine from notebook, running on the local machine?
import dask.dataframe as dd
from dask.distributed import Client
client = Client('remote-ip:8786')
ddf = dd.read_csv(
'remote-machine-file.csv',
header=None,
assume_missing=True,
dtype=object,
)
It fails with
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-37-17d26dadb3a8> in <module>
----> 1 ddf = dd.read_csv('remote-machine-file.csv', header=None, assume_missing=True, dtype=object)
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
735 storage_options=storage_options,
736 include_path_column=include_path_column,
--> 737 **kwargs,
738 )
739
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
520
521 # Infer compression from first path
--> 522 compression = infer_compression(paths[0])
523
524 if blocksize == "default":
IndexError: list index out of range

When using dask.dataframe with a distributed.Client, while the majority of the I/O is done by remote workers, dask does rely on the client machine being able to access the data for scheduling.
To run anything purely on the worker, you can always have the worker schedule the operation, e.g. with:
client = Client()
# use the client to have the worker run the dask.dataframe command!
f = client.submit(dd.read_csv, fp)
# because the worker is holding a dask dataframe object, requesting
# the result brings the dask.dataframe object/metadata to the
# local client, while leaving the data on the remote machine
df = f.result()
Alternatively, you can partition the job manually yourself,
e.g. if you have many files, then read them into memory on
the workers, and finally construct the dask dataframe locally with dask.dataframe.from_delayed:
import pandas as pd
files_on_remote = ['data/file_{}.csv'.format(i) for i in range(100)]
# have the workers read the data with pandas
futures = client.map(pd.read_csv, files_on_remote)
# use dask.dataframe.from_delayed to construct a dask.dataframe from the
# remote pandas objects
df = ddf.from_delayed(futures)

Related

vscode dev container python interactive (`tkagg`) plots

Expected Behavior (local environment: fresh MacOS 12.4 installation)
With no environment updates except $ pip3 install matplotlib, I can successfully run this simple plot from the Matplotlib documentation:
Example Code:
# testplot.py
import matplotlib.pyplot as plt
import numpy as np
# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
fig, ax = plt.subplots()
ax.plot(t, s)
ax.set(xlabel='time (s)', ylabel='voltage (mV)',
title='About as simple as it gets, folks')
ax.grid()
fig.savefig("test.png")
plt.show()
Actual Output (saved to a .png after window opens):
Run $ python3 testplot.py in the terminal:
Observed Behavior (vscode python 3.8 dev container)
Disclaimer: This post does not address notebook-based plots (which work fine but are not always preferred)
However, when I run this in my dev container, I get the following error:
testplot.py:16: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
First Attempted Solution:
Following this previously posted solution, I specified the backend (export MPLBACKEND=TKAgg) before running the interpreter, but the error persists.
Second Attempted Solution:
Following the comments, I added the following lines to the script:
import matplotlib
matplotlib.use('tkagg')
In the v3.8 dev container, this addition changes the error to:
Traceback (most recent call last):
File "testplot.py", line 5, in <module>
matplotlib.use('tkagg')
File "/usr/local/python/lib/python3.8/site-packages/matplotlib/__init__.py", line 1144, in use
plt.switch_backend(name)
File "/usr/local/python/lib/python3.8/site-packages/matplotlib/pyplot.py", line 296, in switch_backend
raise ImportError(
ImportError: Cannot load backend 'TkAgg' which requires the 'tk' interactive framework, as 'headless' is currently running
Note: adding these two lines broke the local script as well. The point of the local example was to show that it plots stuff without installing anything except matplotlib.

Cannot find files that should be inside my running docker container

I'm doing some work with the reverse engineering tool angr, and I'm trying to run it in a container.
My current directory looks likes this:
ask#Garsy:~/Notes/ethHack/wetransfer-85179d/Export$ ls
angry.py impossible_password_location.csv report.md
impossible_password.bin impossible_password_strings.txt test.txt
I then run a specific angr image like so:
ask#Garsy:~/Notes/ethHack/wetransfer-85179d/Export$ sudo docker run -it --rm -v $pwd:/local angr/angr
Where I believe that using $pwd:/local should give me acces to the before shown files inside the container (following [this][1] guide [5:40]).
I run the container, and try to write some python:
(angr) angr#38b067fffc2d:~$ ipython3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import angr
In [2]: angr.Project("/impossible_password.bin")
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-2-3f794a899665> in <module>
----> 1 angr.Project("/impossible_password.bin")
~/angr-dev/angr/angr/project.py in __init__(self, thing, default_analysis_mode, ignore_functions, use_sim_procedures, exclude_sim_procedures_func, exclude_sim_procedures_list, arch, simos, engine, load_options, translation_cache, support_selfmodifying_code, store_function, load_function, analyses_preset, concrete_target, **kwargs)
124 self.loader = cle.Loader(thing, **load_options)
125 elif not isinstance(thing, str) or not os.path.exists(thing) or not os.path.isfile(thing):
--> 126 raise Exception("Not a valid binary file: %s" % repr(thing))
127 else:
128 # use angr's loader, provided by cle
Exception: Not a valid binary file: '/impossible_password.bin'
where it can't find the file. Same goes for local/impossible_password.bin". How do I get the files of my current direcotry to be available when I spin up the container?
[1]: https://www.youtube.com/watch?v=9dQFM5O4KFk

when use pyspark pandas_udf, python worker use too much memory and exceeding memory limit

spark version is 2.4.0, my cluster has four nodes and each node has 16 CPU and 128g RAM.
I am using jupyter-notebook conncet pyspark. The working process is read kudu data by spark then calculate by pandas udf. On the terminal start pyspark
PYSPARK_DRIVER_PYTHON="/opt/anaconda2/envs/py3/bin/jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark2 --jars kudu-spark2_2.11-1.7.0-cdh5.15.0.jar
--conf spark.executor.memory=40g --conf spark.executor.memoryOverhead=5g --num-executors=4 --executor-cores=8 --conf yarn.nodemanager.vmem-check-enabled=false
My dataset only 6g and 32 partitions. when running i can see each node has a executor contains 8 python worker and each python worker uses 6g memory! Container killed by yarn because memory limit.
Container killed by YARN for exceeding memory limits. 45.1 GB of 45 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I'm confused why does it take up so much memory? data size for each partiton only ~200M. Isn't pandas_udf avoiding serialization and deserialization by pyarrow? Maybe jupyter causes the quesiton?
I am very grateful if anyone helps me.
This is my code.
df = spark.range(0, 800000000)
df= df.select("id",rand(seed=10).alias("uniform"),randn(seed=27).alias("normal"),
randn(seed=27).alias("normal1"),randn(seed=1).alias("normal3"))
df=df.withColumn("flag",
F.array(
F.lit("0"),
F.lit("1"),
F.lit("2"),
F.lit("3"),
F.lit("4"),
F.lit("5"),
F.lit("6"),
F.lit("7"),
F.lit("8"),
F.lit("9"),
F.lit("10"),
F.lit("11"),
F.lit("12"),
F.lit("13"),
F.lit("14"),
F.lit("15"),
F.lit("16"),
F.lit("17"),
F.lit("18"),
F.lit("19"),
F.lit("20"),
F.lit("21"),
F.lit("22"),
).getItem(
(F.rand()*23).cast("int")
)
)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
schema = StructType([
StructField("flag", IntegerType()),
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def del_data(data):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return data[["flag"]]
df.groupBy('flag').apply(del_data).write.csv('/tmp/')

how to pass client side dependency to the dask-worker node

scriptA.py contents:
import shlex, subprocess
from dask.distributed import Client
def my_task(params):
print("params[1]", params[1]) ## prints python scriptB.py arg1 arg2
child = subprocess.Popen(shlex.split(params[1]), shell=False)
child.communicate()
if __name__ == '__main__':
clienta = Client("192.168.1.3:8786")
params=["dummy_arguments", "python scriptB.py arg1 arg2"]
future = clienta.submit(my_task, params)
print(future.result())
print("over.!")
scriptB.py contents:
import file1, file2
from folder1 import file4
import time
for _ in range(3):
file1.do_something();
file4.try_something();
print("sleeping for 1 sec")
time.sleep(1)
print("waked up..")
scriptA.py runs on node-1(192.168.23.12:9784) while the dask-worker runs on another node-2 (198.168.54.86:4658) and dask-scheduler is on different node-3(198.168.1.3:8786).
The question here is how to pass the dependencies needed by scriptB.py such as folder1, file1, file2 etc. to the dask-worker node-2 from scriptA.py which is running on node-1.?
You might want to look at the Client.upload_file method.
client.upload_file('/path/to/file1.py')
For any larger dependency though you are generally expected to handle dependencies yourself. In larger deployments people typically rely on some other mechanism, like Docker or a network file system, to ensure uniform software dependencies.

Is this a bug of spark stream or memory leak?

I submit my code to a spark stand alone cluster. Submit command is like below:
nohup ./bin/spark-submit \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2" \
./myCode.py 1>a.log 2>b.log &
I specify the executor use 4G memory in above command. But use the top command to monitor the executor process, I notice the memory usage keeps growing. Now the top Command output is below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12578 root 20 0 20.223g 5.790g 23856 S 61.5 37.3 20:49.36 java
My total memory is 16G so 37.3% is already bigger than the 4GB I specified. And it is still growing.
Use the ps command , you can know it is the executor process.
[root#ES01 ~]# ps -awx | grep spark | grep java
10409 ? Sl 1:43 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
10603 ? Sl 6:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
12420 ? Sl 10:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 --executor-memory 4G --num-executors 1 --total-executor-cores 1 /opt/flowSpark/sparkStream/ForAsk01.py
12578 ? Sl 21:03 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#10.79.148.184:52931 --executor-id 0 --hostname 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url spark://Worker#10.79.148.184:52660
Below are the code. It is very simple so I do not think there is memory leak
if __name__ == "__main__":
dataDirectory = '/stream/raw'
sc = SparkContext(appName="Netflow")
ssc = StreamingContext(sc, 20)
# Read CSV File
lines = ssc.textFileStream(dataDirectory)
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
The code for process function is below. Please note that I am using HiveContext not SqlContext here. Because SqlContext do not support window function
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = HiveContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def process(time, rdd):
if rdd.isEmpty():
return sc.emptyRDD()
sqlContext = getSqlContextInstance(rdd.context)
# Convert CSV File to Dataframe
parts = rdd.map(lambda l: l.split(","))
rowRdd = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), flow_direction=p[9], bits=int(p[11])))
dataframe = sqlContext.createDataFrame(rowRdd)
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
dataframe.show()
Actually I found below code will cause the problem:
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
Because If I remove these 5 line. The code can run all night without showing memory increase. But adding them will cause the memory usage of executor grow to a very high number.
Basically the above code is just some window + grouby in SparkSQL. So is this a bug?
Disclaimer: this answer isn't based on debugging, but more on observations and the documentation Apache Spark provides
I don't believe that this is a bug to begin with!
Looking at your configurations, we can see that you are focusing mostly on the executor tuning, which isn't wrong, but you are forgetting the driver part of the equation.
Looking at the spark cluster overview from Apache Spark documentaion
As you can see, each worker has an executor, however, in your case, the worker node is the same as the driver node! Which frankly is the case when you run locally or on a standalone cluster in a single node.
Further, the driver takes 1G of memory by default unless tuned using spark.driver.memory flag. Furthermore, you should not forget about the heap usage from the JVM itself, and the Web UI that's been taken care of by the driver too AFAIK!
When you delete the lines of code you mentioned, your code is left without actions as map function is just a transformation, hence, there will be no execution, and therefore, you don't see memory increase at all!
Same applies on groupBy as it is just a transformation that will not be executed unless an action is being called which in your case is agg and show further down the stream!
That said, try to minimize your driver memory and the overall number of cores in spark which is defined by spark.cores.max if you want to control the number of cores on this process, then cascade down to the executors. Moreover, I would add spark.python.profile.dump to your list of configuration so you can see a profile for your spark job execution, which can help you more with understanding the case, and to tune your cluster more to your needs.
As I can see in your 5 lines, maybe the groupBy is the issue , would you try with reduceBy, and see how it performs.
See here and here.

Resources