Dask script fails on large csv file on a localhost environment - dask

We are trying to use Dask to clean up some data as part of an ETL process.
The original file is over 3GB csv .
When we run the code on a subset (1GB) the code runs successfully (with a few user warning regarding our cleaning procedures such as:
ddf[id1] = ddf[id1].str.extract(´(\d+)´)
repeater = re.compile(r´((\d)\2{5,}´)
mask_repeater = ddf[id1].str.contrains(repeater, regex=True)
ddf = ddf[~mask_repeater]
On the 3GB file the process nearly completes (there is only one task left - drop-duplicates-agg) and then restarts from the middle (that is what I can see from the bokeh status website). we also see the warning which is the same as when the script starts to run.
RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to '127.0.0.1'...
I´m running on a offline single windows64bit workstation with 24 cores .
Any suggestions?

Related

Dask with tls connection can not end the program with to_parquet method

I am using dask to process 10 files which the size of each file is about 142MB. I build a method with delayed tag, following is an example:
#dask.delayed
def process_one_file(input_file_path, save_path):
res = []
for line in open(input_file_path):
res.append(line)
df = pd.DataFrame(line)
df.to_parquet(save_path+os.path.basename(input_file_path))
if __name__ == '__main__':
client = ClusterClient()
input_dir = ""
save_dir = ""
print("start to process")
cvss = [process_one_file(input_dir+filename, save_dir) for filename in os.listdir(input_dir)]
dask.compute(csvs)
However, dask does not always run successfully. After processing all files, the program often hangs.
I used the command line to run the program. The program often huangs after printing start to process. I know the program runs correctly, since I can see all output files after a while.
But the program never stops. If I disabled tls, the program can run successfully.
It was so strange that dask can not stop the program if I enable tls connection. How can I solve it?
I found that if I add to_parquet method, then the program cannot stop, while if I remove the method, it runs successfully.
I have found the problem. I set 10GB for each process. That means I set memory-limit=10GB. I totally set 2 workers and each has 2 processes. Each process has 2 threads.
Thus, each machine will have 4 processes which occupy 40GB. However, my machine only have 32GB. If I lower the memory limit, then the program will run successfully!

Dask- Same tasks are not running in parallel on cluster of Ubuntu machines

I have 3 ubuntu machine(CPU). my dask scheduler and client both are present on the same machine, whereas the two dask workers are running on other two machines. when I launch first task, it gets scheduled on first worker, but then upon launching second worker, while the first one still executing, it does not get scheduled on second worker. here is the sample client code that I tried.
### client.py
from dask.distributed import Client
import time, sys, os, random
def my_task(arg):
print("doing something in my_task")
time.sleep(2)
print("inside my task..", arg)
print("again doing something in my_task")
time.sleep(2)
print("return some random value")
value = random.randint(1,100)
print("value::", value)
return value
client = Client("172.25.49.226:8786")
print("client::", client)
future = client.submit(my_task, "hi")
print("future result::", future.result())
print("closing the client..")
client.close()
I am running "python client.py" two times almost at the same time from two different terminal/machines. both the client seems to be executing, but it results in exactly the same output which it should not because the return type of the my_task() is a random value. I tested this on ubuntu machines.
However a month back, I was able to run same tasks in parallel on CentOs machines. And now if check back and ran same two tasks from those CentOs machines, the problem persist. This is strange. it did not run in parallel. Not able to figure out this behavior by dask. Am I missing any OS level settings or something else.?
Run the below almost at the same time,
python client.py # from one machine/terminal
python client.py # from another machine/terminal
these two tasks should run in parallel, each task should run on different worker(we have two free workers available), but this is not happening. I can't see any log on the second worker console nor on the scheduler, while the first task continues to execute. At the end I noticed both the tasks finishes exactly at the same time with exactly same output.
However the above client code works well in "parallel" in windows OS, each task running through multiple terminals. but I would like to run it on Ubuntu machines.
By default if you call the same function on the same inputs Dask will assume that this will produce the same value, and only compute it once. You can override this behavior with the pure=False keyword
future = client.submit(func, *args, pure=False)

MySQL connection pool in python?

I'm trying to process large amount of data using Python and maintaining processing status in MySQL. However, I'm surprised there is no standard connection pool for python-mysql (like HikariCP in Java).
I initially started with PyMySQL, things were great until the program ran for first few hours. After few hours, things started to fail. I was getting lot of errors like:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 99] Cannot assign requested address)")
Moreover, lot of ports were stuck in TIME_WAIT state because I'm opening and closing connections too frequently because of lack of connection pooling
/d/p/950 ❯❯❯ netstat -nt | wc -l
84752
Per this and this, I tried to set tcp_fin_timeout and ip_local_port_range, but hardly anything improved.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 15000 65000 > /proc/sys/net/ipv4/ip_local_port_range
Then I found out that MySQL provides mysql.connector which comes with pooling functionality. After doing all that performance actually deteriorated. More processes started to get failed. I'm using Python's multiprocessing module to simultaneously run 29 processes(multiprocessing.Pool picked this no by default) on a 24 core machine. Following was the code, of course I was using .my.cnf to pass all the credential to avoid committing them to git :
import mysql.connector
from mysql.connector import pooling
conn_pool = pooling.MySQLConnectionPool(pool_name="mypool1",
pool_size=pooling.CNX_POOL_MAXSIZE,
option_files=MYSQL_CONFIG,
option_groups=MYSQL_GROUP_NODE1,
allow_local_infile=True)
conn = conn_pool.get_connection()
Finally, reverted back to old code. Still using PyMySQL and though errors are less frequent it is still causing a major problem. I looked at SQLAlchemy and couldn't really found much of a documentation around pooling.
I'm wondering how's everyone else dealing with mysql-python connection pooling issue? I really believe there should be something out there so that I don't have to reinvent the wheel.
Any pointers are much appreciated.
DBUtils implements MySQL (and generally claims to support abritrary DB-API 2 compliant database interfaces) user-sized connection pool PooledDB, thead-mapped pool PersistentDB and SteadyDB (see functionality section). The latter should fit your case where multiprocessing.Pool creates worker processes with managed persistent database connection each. It is described as:
DBUtils.SteadyDB is a module implementing "hardened" connections to a database, based on ordinary connections made by any DB-API 2 database module. A "hardened" connection will transparently reopen upon access when it has been closed or the database connection has been lost or when it is used more often than an optional usage limit.
You can use it with PyMySQL like:
import pymysql
from DBUtils.SteadyDB import connect
db = connect(
creator = pymysql, # the rest keyword arguments belong to pymysql
user = 'guest', password = '', database = 'name',
autocommit = True, charset = 'utf8mb4',
cursorclass = pymysql.cursors.DictCursor)
Also see this related question for more examples.

IPython.parallel client is hanging while waiting for result of map_async

I am running 7 worker processes on a single machine with 4 cores. I may have made a poor choice with this loop while waiting for the result of map_async:
while not result.ready():
time.sleep(10)
for out in result.stdout:
print out
rec_file_list = result.get()
result.stdout keeps growing with all the printed output from the 7 processes running, and it caused the console that initiated the map to hang. The activity monitor on my MacBook Pro shows the 7 processes are still running, and the terminal running the Controller is still active. What are my options here? Is there any way to acquire the result once the processes have completed?
I found an answer:
Remote introspection of ASyncResult objects is possible from another client as long as a 'database backend' has been enabled by the controller with:
ipcontroller --dictb # or --mongodb or --sqlitedb
Then, it is possible to create a new client instance and retrieve the results with:
client.get_result(task_id)
where the task_ids can be retrieved with:
client.hub_history()
Also, a simple way to avoid the buffer overflow I encountered is to periodically print just the last few lines from each engine's stdout history, and to flush the buffer like:
from IPython.display import clear_output
import sys
while not result.ready():
clear_output()
for stdout in result.stdout:
if stdout:
lines = stdout.split('\n')
for line in lines[-4:-1]:
if line:
print line
sys.stdout.flush()
time.sleep(30)

SQLite occasionally fails to create :memory: database

In our unit testing suite, we create and destroy a large number of SQLite databases that use the path of ":memory:". Occasionally, and only when running on the iOS simulator, the creation of those databases fails with the rather enigmatic message:
Database ":memory:": unable to open database file
99% of the time, these requests succeed. (Subsequent tests within the same test run typically succeed after this failure occurs.) But when you're using this in an automated build-acceptance test, you want 100%.
We've instrumented for memory consumption (it's within normal limits) and disk-space availability (20GB+ available).
Any ideas?
UPDATE: Captured this happening with extra logging per Richard's suggestion below. Here's the log output:
SQLITE ERROR: (28) attempt to open "/Users/xxx/Library/Developer/CoreSimulator/Devices/CF762060-7D23-4C79-A466-7F20AB6233E7/data/Containers/Data/Application/582E1ED0-81E0-4CC7-A6F6-DBEBC101BBE8/tmp/etilqs_1ghbf1MSTa8ilSj" as
SQLITE ERROR: (14) cannot open file at line 30595 of [f66f7a17b7]
SQLITE ERROR: (14) os_unix.c:30595: (17) open(/Users/xxx/Library/Developer/CoreSimulator/Devices/CF762060-7D23-4C79-A466-7F20AB6233E7/data/Containers/Data/Application/582E1ED0-81E0-4CC7-A6F6-DBEBC101BBE8/tmp/etilqs_1ghbf1MST
I've noticed that even a :memory: database will files on disk if you create a temporary table. The temporary files for unix system are built by a Prng, so there is a non-zero chance of name collision if lots and lots of temporary files are created simultaneously. Or, if the disk is full, the create could fail. Or if for some reason the unix temp directory is not accessible either because it's been deleted or permissions on it are invalid.
For example, I've turned on several loggers in sqlite3 command line by adding these command line arguments to llvc-gcc: -DSQLITE_DEBUG_OS_TRACE=1 -DSQLITE_TEST=1 -DSQLITE_DEBUG=1 then I observed a temp file being created from the command line using this SQL:
$ ./sqlite3
SQLite version 3.8.8.2 2015-01-30 14:30:45
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create temporary table t( x );
OPENX 3 /var/folders/nf/l1cw8sn1707b73zy5nqycrpw0000gn/T//etilqs_fvwR6KbMm518S4w 01002
OPEN 3
WRITE 3 512 0 0
OPENX 4 /var/folders/nf/l1cw8sn1707b73zy5nqycrpw0000gn/T//etilqs_OJJJ1lrTtQIFnUO 05402
OPEN 4
WRITE 4 1024 0 0
WRITE 4 1024 1024 0
WRITE 3 28 0 0
sqlite>
No ideas. But perhaps if you turned on the error and warning log it will give some clues.

Resources