Why do dask divisions need to be unique? - dask

I want to set the index for a dask dataframe (from_delayed) using already known divisions. However, dask complains that the divisions are required to be unique. This constraint causes trouble for me since the partitions would turn out to be of about 5GB in size which is a bit too much for my taste.
Is there a way to work around this constraint or loosen it for certain operations?

You should view the divisions as an optimisation, which allows dask to know which data is expected in which partition for some operations (groupby, fetch particular index row, etc.).
If your data is not organised in a way that the divisions on the index are unique, you have a simple option: do not provide divisions at all. Then you will lose out on those certain optimisations, which are not appropriate to your case. Alternatively, you could decide to reorganise your data, either within data or before passing it to dask.

Related

Checking if two Dasks are the same

What is the right way to determine if two Dask objects refer to the same result? Is it as simple as comparing the name attributes of both or are there other checks that need to be run?
In the case of any of the dask collections in the main library (array, bag, delayed, dataframe) yes, equal names should imply equal values.
However the opposite is not always true. We don't use deterministic hashing everywhere. Sometimes we use uuid's instead. For example, random arrays always get random UUIDs for keys, but two random arrays might be equal by chance.
No guarantees are given for collections made outside of the Dask library. No enforcement is made at the scheduler level.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

How to apportion between BatchInserterIndex cache and MMIO?

In a batch insertion using lucene indexes, given a large set of nodes and relations such that the node and relationship store cannot fit completely in mapped memory (hence the need for lucene index caching), how should one divide memory between MMIO and lucene index caches to achieve optimal performance? Having read the documentation, I am already somewhat familiar with how to divide memory within the mapped-memory schema. I am interested in the overall allotment of memory between MMIO and the lucene caches. Since I am working on a prototype with what hardware happens to be available, and the future resources and data volume are undetermined, I would prefer the answer to be in general terms (I think this would also make the answer more useful to the rest of Neo4j community too.) So it would be good if I could pose the question like this:
Given
rwN nodes and rwR relationships that are written and must be read later in the batch insertion,
woN nodes and woR relationships that are only written,
G gigabytes of RAM (not including what is required for the operating system)
What is the optimal division of G between lucene index caches and MMIO?
If more details are needed I can supply them for my particular case.
All these considerations are only relevant for importing (multiple) billions of nodes and relationships
Usually when you do lookups it depends on the "hot dataset size" of your index lookups.
By default that's all nodes but if you know your domain better, you can probably devise some paging that results in smaller needed caches (e.g. by pre-sorting your input data for relationship creation by start and end-node lookup-property) then you have kind of a moving window over your node data during which each node is accessed frequently.
I usually even sort by min(start,end).
Usually you try to use most of the RAM for mmio mapping of the rel-store and node store. The property stores are only written to but the others have to be updated as well.
The index cache lookup is only a HashMap behind the scenes, so quite wasteful. What I found to work better is to use a different approach, e.g. a multi-pass one.
use an string-array put all your lookup properties in there, sort it and use the array index (Arrays.binarySearch) as node-id then the lookup only in that array is quite efficient
another way is using a multi-pass on the source data so you already create the node-ids that are needed for the rels as part of the source, Friso and Kris from Xebia did something like that in their hadoop based solution esp. the monotonically increasing parallel id's

How to improve the performance in big table join?

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

Resources