Slowly increasing memory usage of Dask Sheduler - dask

I'm running a test:
client = Client('127.0.0.1:8786')
def x(i):
return {}
while True:
start = time.time()
a = client.submit(randint(0,1000000))
res = a.result()
del a
end = time.time()
print("Ran on %s with res %s" % (end-start, res))
client.shutdown()
del client
I used it (with more code) to get an estimate of my queries performance. But for this example I've removed all things I could think of.
The above code leaks roughly 0.1 MB per second, which I would guesstimate to roughly 0.3MB per 1000 calls.
Am I doing something wrong in my code?

My python debugging skills are a bit rusty (and with a bit I mean I last used objgraph on Orbited (the precursor to websockets) in 2009 https://pypi.python.org/pypi/orbited) but from what I can see, checking to number of references before and after:
Counting objects in the scheduler, before and after using objgraph.show_most_common_types()
| What | Before | After | Diff |
|-------------+------------------+--------|---------+
| function | 33318 | 33399 | 81 |
| dict | 17988 | 18277 | 289 |
| tuple | 16439 | 28062 | 11623 |
| list | 10926 | 11257 | 331 |
| OrderedDict | N/A | 7168 | 7168|
It's not a huge number of RAM in any case, but digging deeper I found that t scheduler._transition_counter is 11453 and scheduler.transition_log is filled with:
('x-25ca747a80f8057c081bf1bca6ddd481', 'released', 'waiting',
OrderedDict([('x-25ca747a80f8057c081bf1bca6ddd481', 'processing')]), 4121),
('x-25ca747a80f8057c081bf1bca6ddd481', 'waiting', 'processing', {}, 4122),
('x-25cb592650bd793a4123f2df39a54e29', 'memory', 'released', OrderedDict(), 4123),
('x-25cb592650bd793a4123f2df39a54e29', 'released', 'forgotten', {}, 4124),
('x-25ca747a80f8057c081bf1bca6ddd481', 'processing', 'memory', OrderedDict(), 4125),
('x-b6621de1a823857d2f206fbe8afbeb46', 'released', 'waiting', OrderedDict([('x-b6621de1a823857d2f206fbe8afbeb46', 'processing')]), 4126)
First error on my part
Which of course led me to realise the first error on my part was not configuring transition-log-length.
After setting configuration transition-log-length to 10:
| What | Before | After | Diff |
| ---------------+----------+--------+---------|
| function | 33323 | 33336 | 13 |
| dict | 17987 | 18120 | 133 |
| tuple | 16530 | 16342 | -188 |
| list | 10928 | 11136 | 208 |
| _lru_list_elem | N/A | 5609 | 5609 |
A quick google found that _lru_list_elem is made by #functools.lru_cache which in turn is in invoked in key_split (in distributed/utils.py)
Which is the LRU cache, of up to 100 000 items.
Second try
Based on the code it appears as Dask should climb up to roughly 10k _lru_list_elem
After running my script again and watching the memory it climbs quite fast up until I approach 100k _lru_list_elem, afterwards it stops climbing almost entirely.
This appears to be the case, since it pretty much flat-lines after 100k
So no leak, but fun to get hands dirty on Dask source code and Python memory profilers

For diagnostic, logging, and performance reasons the Dask scheduler keeps records on many of its interactions with workers and clients in fixed-sized deques. These records do accumulate, but only to a finite extent.
We also try to ensure that we don't keep around anything that would be too large.
Seeing memory use climb up until a nice round number like what you've seen and then stay steady seems to be consistent with this.

Related

Finding which child is using up all my memory in Erlang

I am troubleshooting a crashing Erlang program. It runs out of memory. It has several children started by OTP (one_for_one in the supervisor), and some started with spawn.
I am starting the program and falling into the Erlang prompt (test#test)1>. I'd like to see how much memory each of these children is using from here. I've searched online and not found anything, but this seems like a common enough need to already have a solution.
How can I find the memory utilization of each child, in Erlang, from the system prompt?
Did you try observer?
when you get the prompt, type observer:start(), then in the Application tab, you can see all the applications for each of them the processes. For each process you can get the memory usage by opening the process_info sub window.
Try erlang:process_info/2 with memory in ItemList
process_info(Pid, ItemList) -> InfoTupleList | [] | undefined
Types
Pid = pid()
ItemList = [Item]
Item = process_info_item()
InfoTupleList = [InfoTuple]
InfoTuple = process_info_result_item()
process_info_item() =
backtrace |
binary |
catchlevel |
current_function |
current_location |
current_stacktrace |
dictionary |
error_handler |
garbage_collection |
garbage_collection_info |
group_leader |
heap_size |
initial_call |
links |
last_calls |
memory |
message_queue_len |
messages |
min_heap_size |
min_bin_vheap_size |
monitored_by |
monitors |
message_queue_data |
priority |
reductions |
registered_name |
sequential_trace_token |
stack_size |
status |
suspending |
total_heap_size |
trace |
trap_exit

How to import apple core motion dataset in turi create?

I've recently discovered that apple core motion data (accelerometer, gyroscope etc) can be used to create learning models. The link below shows an example:
https://github.com/apple/turicreate/blob/master/userguide/activity_classifier/introduction.md
This example uses data from a large dataset (HAPT). In my situation I'm the creator of my own dataset using recordings of core motion data while performing different activities (i.e. jumping, walking, sitting). The next step is to import my dataset in turi to create a model. How this can be achieved? Could anyone provide a list of steps to follow?
Thank you
Ideally, you would have recorded your motion data into some standard format. Let's assume it is in CSV format.
walking,jumping,sitting
82,309635,1
82,309635,1
25,18265403,1
30,18527312,8
30,17977769,40
30,18375422,37
30,18292441,38
30,303092,7
85,18449654,3
You can read the file using any file reader. To simplify your life, pandas or sframe may rescue you.
In [14]: import turicreate as tc
In [15]: sf = tc.SFrame.read_csv('/tmp/activity.csv')
Finished parsing file /tmp/activity.csv
Parsing completed. Parsed 9 lines in 0.13823 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as
column_type_hints=[int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /tmp/activity.csv
Parsing completed. Parsed 9 lines in 0.113868 secs.
In [16]: sf.head()
Out[16]:
Columns:
walking int
jumping int
sitting int
Rows: 9
Data:
+---------+----------+---------+
| walking | jumping | sitting |
+---------+----------+---------+
| 82 | 309635 | 1 |
| 82 | 309635 | 1 |
| 25 | 18265403 | 1 |
| 30 | 18527312 | 8 |
| 30 | 17977769 | 40 |
| 30 | 18375422 | 37 |
| 30 | 18292441 | 38 |
| 30 | 303092 | 7 |
| 85 | 18449654 | 3 |
+---------+----------+---------+
[9 rows x 3 columns]

Return MAX value with VLOOKUP from list

I have a Google sheet with data of different players attacks and their corresponding damage.
Sheet1
| Player | Attack | Damage |
|:------------|:-----------:|------------:|
| Iron Man | Melee | 50 |
| Iron Man | Missile | 2500 |
| Iron Man | Unibeam | 100 |
| Superman | Melee | 9000 |
| Superman | Breath | 200 |
| Superman | Laser | 1500 |
In my second sheet, I want to list each player and display their best attack and the corresponding damage. Like this:
Sheet2
| Player | Best attack | Damage |
|:------------|:-----------:|------------:|
| Iron Man | Missile | 2500 |
| Superman | Melee | 9000 |
I have tried to add the following in the damage column (third column) of Sheet2:
=MAX(IF(Sheet1!A:A=A2;Sheet1!C:C))
But I get 9000 for Superman and 0 for Iron Man. For best attack (second column) I guess MAX should be used together with VLOOKUP, but I don't know how to apply it.
Edit:
=ArrayFormula(MAX(IF(Sheet1!A:A=A3;Sheet1!C:C))) seems to fix the first issue. Getting correct values in the damage column (third column). But still don't know how to apply this to return which is the best attack.
You could use Filter.
Damage:
=MAX(FILTER(Sheet1!C:C,Sheet1!A:A=A2))
Then Best Attack:
=JOIN(",",FILTER(Sheet1!B:B,Sheet1!A:A=A2,Sheet1!C:C=C2))
The Join will join two or more if there are more attacks with the same damage.
I am considering the range A2:C.
Try this formula.
=sortn(sort(A2:C,3,0),9^9,2,1,0)
Screenshot

Does endianness refer to ordering within a defined array or memory or also the actual memory used?

I'm having trouble expressing my question in words, but I think I can express it visually quite simply. Storing the string abcd, is the difference between Big and Little Endian this:
memory address | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ...
little endian | d | c | b | a |
big endian | a | b | c | d |
Or this:
memory address | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ...
little endian | d | c | b | a |
big endian | a | b | c | d |
My attempt in words: does "endianness" refer to the ordering of bytes within a specific memory "array", where in both cases the array begins at the same point in memory, or does it refer to both the ordering and the actual array used?
Endianness refers to the ordering of bytes used to store a single multi-byte numerical value. The "big endian" system in your second image is storing 4-byte integers unaligned, which no system would normally do.

Neo4j CSV import query super slow, when setting relationships

I am trying to evaluate Neo4j (using the community version).
I am importing some data (1 million rows) using the LOAD CSV process. It needs to match previously imported nodes to create a relationship between them.
Here is my query:
//Query #3
//create edges between Tr and Ad nodes
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
FIELDTERMINATOR '\t'
//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)
//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)
I have indicies on:
Indexes
ON :Ad(p58) ONLINE (for uniqueness constraint)
ON :Tr(txid) ONLINE
ON :Tr(h) ONLINE (for uniqueness constraint)
This query has been running for 5 days now and it has so far created 270K relationships (out of 1M).
Java heap is 4g
Machine has 32G of RAM and an SSD for a drive, only running linux and Neo4j
Any hints to speed this process up would be highly appreciated.
Should I try the enterprise edition?
Query Plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns,
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended,
it may often be possible to reformulate the query that avoids the use of this cross product,
perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+---------------------------------+----------------+---------------------+----------------------------+
| Operator | Estimated Rows | Variables | Other |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults | 1 | | |
| | +----------------+---------------------+----------------------------+
| +EmptyResult | | | |
| | +----------------+---------------------+----------------------------+
| +Apply | 1 | line -- ad, out, tx | |
| |\ +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4) | 1 | ad, out, tx | |
| | | +----------------+---------------------+----------------------------+
| | +CreateRelationship | 1 | out -- ad, tx | |
| | | +----------------+---------------------+----------------------------+
| | +ValueHashJoin | 1 | ad -- tx | ad.p58; line.p58 |
| | |\ +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek | 1 | tx | :Tr(txid) |
| | | +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) | 1 | ad | :Ad(p58) |
| | +----------------+---------------------+----------------------------+
| +LoadCSV | 1 | line | |
+---------------------------------+----------------+---------------------+----------------------------+
OKAY, so by splitting the MATCH statement into two it sped up the query immensely. Thanks #William Lyon for pointing me to the Plan. I noticed the warning.
old MATCH atatement
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
split into two:
MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})
on 750K relationships the query took 83 seconds.
Next up 9 Million CSV LOAD

Resources