Get individual dask dataframe partition status - dask

I like to run an asynchronous dask dataframe computation with dd.persist() and then been able to track an individual partition status. The goal is to get access to partial results in a non-blocking way.
Here the desired pseudo code:
dd = dd.persist()
if dd.partitions[0].__dask_status__ == 'finished':
# Partial non-blocking result access
df = dd.partitions[0].compute()
Using dask futures works well, but submitting many individual partitions is very slow compared to a single dd.persist() and having one future per partition breaks the dashboard "groups" tab by showing too many blocks.
futures = list(map(client.compute, dd.partitions))
Broken dask dashboard "groups" tab

The function you probably want is distributed.futures_of, which lists the running futures of a collection. You can either examine this list yourself, looking at the status of the futures, or use with distributed.as_completed and a for-loop to process the partitions as they become available. The keys of the futures are like (collection-name, partition-index), so you know which partition each belongs to.
The reason dd.partitions[i] (or looping over these with list) doesn't work, is that this creates a new graph for each partition, and so you end up submitting much more to the scheduler than the single call to .persist().

Related

Join of of multiple streams with the Python SDK

I would like to join multiple streams on a common key and trigger a result either as soon as all of the streams have contributed at least one element or at the end of the window. CoGroupByKey seems to be the appropriate building block, but there does not seem to be a way to express the early trigger condition (count trigger applies per input collection)?
I believe CoGroupByKey is implemented as Flatten + GroupByKey under the hood. Once multiple streams are flattened into one, data-driven trigger (or any other triggers) won't have enough control to achieve what you want.
Instead of using CoGroupByKey, you can use Flatten and StatefulDoFn that fills an object backed by State for each key. Also in this case, StatefulDoFn would have the chance to decide what to do when stream A has 2 elements arrived but stream B doesn't have any element yet.
Another potential solution that comes to mind is (a stateless) DoFn that filters the CoGBK results to remove those that don't have at least one occurrence for each joined stream. For the end of window result (which does not have the same restriction), it would then be necessary to have a parallel CoGBK and its result would not go through the filter. I don't think there is a way to tag results with the trigger that emitted it?

Safe & performant way to modify dask dataframe

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.
Case1.
I want to run:
res = dataframe.column.map(func, ...)
this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:
dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)
Any other recommended way to do it?
Case2
I need to map partitions of the dataframe:
df.map_partitions(mapping_func, meta=df)
Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?
Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?
You can safely modify a dask.dataframe
Operations like the following are supported and safe
df['col'] = df['col'].map(func)
This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).
You can not safely modify a partition
Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

Using complex objects in DataFlow

We have several BigQuery tables that we're reading from through DataFlow. At the moment those tables are flattened and a lot of the data is repeated. In Dataflow, all operations must be idempotent, so any output only depends on the input to the function, there's no state kept anywhere else. This is why it makes sense to first group all the records together that belong together and in our case, this probably means creating complex objects.
Example of A complex object (there are many other types like this). We can have millions of instances of each type obviously:
Customer{
customerId
address {
street
zipcode
region
...
}
first_name
last_name
...
contactInfo: {
"phone1": {type, number, ... },
"phone2": {type, number, ... }
}
}
The examples we found for DataFlow only process very simple objects and the examples demonstrate counting, summing and averaging.
In our case, we eventually want to use DataFlow to perform more complicated processing in accordance with sets of rules. Those rules apply to the full contact of a customer, invoice or order for example and eventually produce a whole set of indicators, sums and other items.
We considered doing this 100% in BigQuery, but this gets very messy very quickly due to the rules that apply per entity.
At this time I'm still wondering whether DataFlow is really the right tool for this job. There are almost no examples for dataFlow that demonstrate how it's used for these type of more complex objects with one or two collections. The closest I found was the use of a "LogMessage" object for log processing, but this didn't have any collections and therefore didn't do any hierarchical processing.
The biggest problem we're facing is hierarchical processing. We're reading data like this:
customerid ... street zipcode region ... phoneid type number
1 a b c phone1 1 555-2424
1 a b c phone2 1 555-8181
And the first operation should be group those rows together to construct a single entity, so we can make our operations idempotent. What is the best way to do that in DataFlow, or point us to an example that does that?
You can use any object as the elements in a Dataflow pipeline. The TrafficMaxLaneFlow example uses a complex object (although it doesn't have a collection).
In your example you would do a GroupByKey to group the elements. The result is a KV<K, Iterable<V>>. The KV here is just an object and has a collection-like value inside. You could then take that KV<K, Iterable<V>> and turn it into whatever kind of objects you wanted.
The only thing to be aware of is that if you have very few elements that are really big you may run into some parallelism limits. Specifically, each element needs to be small enough to be processed on a single machine.
You may also be interested in withoutFlatteningResults on BigQueryIO. It only supports reading from a query (rather than a table) but it should provide the results without flattening.

Blulbflow Neo4j Graph Database Slow

I am trying to create 500,000 nodes in a graph database. I plan to add edges as per my requirements later. I have a text file with 500,000 lines representing the data to be stored in each node.
from bulbs.neo4jserver import Graph, Config, NEO4J_URI
config = Config(NEO4J_URI)
g = Graph(config)
def get_or_create_node(text, crsqid):
v = g.vertices.index.lookup(crsqid=crsqid)
if v==None:
v = g.vertices.create(crsqid=crsqid)
print text + " - node created"
v.text = text
v.save()
return v
I then loop over each line in the text file,
count = 1
with open('titles-sorted.txt') as f:
for line in f:
get_or_create_node(line, count)
count += 1
This is terribly slow. This gives me 5000 nodes in 10 minutes. Can this be improved? Thanks
I don't see any transaction code in there, establishing one, or signaling transaction success. You should look into that -- if you're doing one transaction for every single node creation, that's going to be slow. You should probably create one transaction, insert thousands of nodes, then commit the whole batch.
I'm not familiar with bulbs, so I can't tell you how to do that with this python framework, but here is a place to start: this page suggests you can use a coding style like this, with some python/neo bindings:
with db.transaction:
foo()
also, if you're trying to load mass amounts of data and you need performance, you should check this page for information on bulk importing. It's unlikely that doing it in your own script is going to be the most performant. You might instead consider using your script to generate cypher queries, which get piped to the neo4j-shell.
Finally a thing to consider is indexes. Looks like you're indexing on crsqid - if you get rid of that index, creates may go faster. I don't know how your IDs are distributed, but it might be better to break records up into batches to test if they exist, rather than using the get_or_create() pattern.
Batch loading 500k nodes individually via REST is not ideal. Use Michael's batch loader or the Gremlin shell -- see Marko's movie recommendation blog post for an example of how to do this from the Gremlin shell.

The right way to hydrate lots of entities in py2neo

this is more of a best-practices question. I am implementing a search back-end for highly structured data that, in essence, consists of ontologies, terms, and a complex set of mappings between them. Neo4j seemed like a natural fit and after some prototyping I've decided to go with py2neo as a way to communicate with neo4j, mostly because of nice support for batch operations. This is more of a best practices question than anything.
What I'm getting frustrated with is that I'm having trouble with introducing the types of higher-level abstraction that I would like to in my code - I'm stuck with either using the objects directly as a mini-orm, but then I'm making lots and lots of atomic rest calls, which kills performance (I have a fairly large data set).
What I've been doing is getting my query results, using get_properties on them to batch-hydrate my objects, which preforms great and which is why I went down this route in the first place, but this makes me pass tuples of (node, properties) around in my code, which gets the job done, but isn't pretty. at all.
So I guess what I'm asking is if there's a best practice somewhere for working with a fairly rich object graph in py2neo, getting the niceties of an ORM-like later while retaining performance (which in my case means doing as much as possible as batch queries)
I am not sure whether I understand what you want, but I had a similar issue. I wanted to make a lot of calls and create a lot of nodes, indexes and relationships.. (around 1.2 million) . Here is an example of adding nodes, relationships, indexes and labels in batches using py2neo
from py2neo import neo4j, node, rel
gdb = neo4j.GraphDatabaseService("<url_of_db>")
batch = neo4j.WriteBatch(gdb)
a = batch.create(node(name='Alice'))
b = batch.create(node(name='Bob'))
batch.set_labels(a,"Female")
batch.set_labels(b,"Male")
batch.add_indexed_node("Name","first_name","alice",a) #this will create an index 'Name' if it does not exist
batch.add_indexed_node("Name","first_name","bob",b)
batch.create(rel(a,"KNOWS",b)) #adding a relationship in batch
batch.submit() #this will now listen to the db and submit the batch records. Ideally around 2k-5k records should be sent
Since your asking for best practice, here is an issue I ran into:
When adding a lot of nodes (~1M) with py2neo in a batch, my program often gets slow or crashes when the neo4j server runs out of memory. As a workaround, I split the submit in multiple batches:
from py2neo import neo4j
def chunker(seq, size):
"""
Chunker gets a list and returns slices
of the input list with the given size.
"""
for pos in xrange(0, len(seq), size):
yield seq[pos:pos + size]
def submit(graph_db, list_of_elements, size):
"""
Batch submit lots of nodes.
"""
# chunk data
for chunk in chunker(list_of_elements, size):
batch = neo4j.WriteBatch(graph_db)
for element in chunk:
n = batch.create(element)
batch.add_labels(n, 'Label')
# submit batch for chunk
batch.submit()
batch.clear()
I tried this with different chunk sizes. For me, it's fastest with ~1000 nodes per batch. But I guess this depends on the RAM/CPU of your neo4j server.

Resources