Using dask.compute with an array of delayed items - dask

Currently, I can create (nested) lists of objects that are a mix of eagerly computed items and delayed items.
If I pass that list to dask.compute, it can create the graph and computes the result as a new list replacing the delayed items with their computed counterparts.
The list has a very well defined structure that I would like to exploit. As such, before using Dask, I had been using numpy array with dtype=object.
Can I pass these numpy arrays to dask.compute?
Are there other collections, that support ND slicing à la numpy, that I can use instead?
My current workaround is to either use dictionaries, or nested lists, but the ability to slice numpy arrays is really nice and I would not like to loose that.
Thanks,
Mark
code example as notebook

Dask.compute currently only searches through core Python data structures like lists and dictionaries. It does not search through Numpy arrays.
You might consider using Numpy arrays until the very end, then calling .tolist() then calling np.array again.
result = dask.compute(*x.tolist())
result = np.array(result)

Related

How to "reindex" with Dask DataFrame

I'm looking into using dask for time-series research with large volumes of data. One common operation that I use is realignment of data to a different index (the reindex operation on pandas dataframe's). I noticed that the reindex function is not currently supported in the dask dataframe API, but is in the DataArray API. Are there plans to add this function?
I believe you could use the Dataframe.set_index() method combined with .resample() for that same purpose.

Safe & performant way to modify dask dataframe

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.
Case1.
I want to run:
res = dataframe.column.map(func, ...)
this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:
dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)
Any other recommended way to do it?
Case2
I need to map partitions of the dataframe:
df.map_partitions(mapping_func, meta=df)
Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?
Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?
You can safely modify a dask.dataframe
Operations like the following are supported and safe
df['col'] = df['col'].map(func)
This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).
You can not safely modify a partition
Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

Mapping flat fields to sequential records

I have a source schema that defines a "ShippingCharge" and a "DiscountAmount". My destination schema is an EDI X12 850 message.
I need to create two "fake" iterations for the SAC loop. I need a way to define that for the first iteration, use the ShippingCharge and the second use the DiscountAmount. There are a few additional "default values" that I need to set to SAC01 that also depends on the iteration (1 or 2).
What functoid should I be using? Any suggestions?
Have you tried the Table Looping functoid? You can use the table looping functoid to define multiple rows using input links (ShippingCharge and DiscountAmount) and constants (the SAC01 values). The output would then loop through these rows and create the two SACLoop1 elements.
You will need to use the Table Extractor functiod as well to deal with each data value in the table.
Complete instructions on using Table Looping and Table Extractor can be found here: http://msdn.microsoft.com/en-us/library/aa559310%28v=bts.20%29.aspx

bulk insert using OCI

I am currently inserting records one-by-one into a table from C++ code using OCI. The data is in a hashmap of structs, I iterate over the elements of the map, binding the attributes of the struct to the columns of a record in the table (e.g.
define insert query
use OCIBindByname( ) for all the columns of record
iterate over map
assign bind variables as attributes of the struct
OCIStmtExecute
end
This is pretty slow, so I'd like to speed up by doing a bulk insert. What is a good way to do this? Should I use an array of struct to insert all the records in one OCIStmtExecute? Do you have any example code which shows how to do this?
Here is some sample code showing how I implemented this in OCI*ML. In summary, the way to do this is (say for a table with one column of integers):
malloc() a block of memory of sizeof(int) × the number of rows and populate it. This could be an array.
Call OCIBindByPos() with that pointer for *valuep and the size for value_sz.
Call OCIStmtExecute() with iters set to the number of rows from step 1
In my experience, speedups of 100× are certainly possible.
What you probably want to do is 'bulk inserts'. Bulk inserts in array are done by using ArrayBinds where you bind the data of the first row with the first structure of the array and set jumps, which generally is size of structure. After this you can just do statement execute with number of arrays. Multiple binds will create overheads, hence bulk inserts are used.
bulk insert example.txt
by
{
delimeter=',' // or any delimiter specified in your text files
size=200kb //or your size of text file
}
Use DPL(Direct Path Loading).
Refer to docs.oracle.com for more info.

Sorted TStringList, how does the sorting work?

I'm simply curious as lately I have been seeing the use of Hashmaps in Java and wonder if Delphi's Sorted String list is similar at all.
Does the TStringList object generate a Hash to use as an index for each item? And how does the search string get checked against the list of strings via the Find function?
I make use of Sorted TStringLists very often and I would just like to understand what is going on a little bit more.
Please assume I don't know how a hash map works, because I don't :)
Thanks
I'm interpreting this question, quite generally, as a request for an overview of lists and dictionaries.
A list, as almost everyone knows, is a container that is indexed by contiguous integers.
A hash map, dictionary or associative array is a container whose index can be of any type. Very commonly, a dictionary is indexed with strings.
For sake of argument let us call our lists L and our dictionaries D.
Lists have true random access. An item can be looked-up in constant time if you know its index. This is not the case for dictionaries and they usually resort to hash-based algorithms to achieve efficient random access.
A sorted list can perform binary search when you attempt to find a value. Finding a value, V, is the act of obtaining the index, I, such that L[I]=V. Binary search is very efficient. If the list is not sorted then it must perform linear search which is much less efficient. A sorted list can use insertion sort to maintain the order of the list – when a new item is added, it is inserted at the correct location.
You can think of a dictionary as a list of <Key,Value> pairs. You can iterate over all pairs, but more commonly you use index notation to look-up a value for a given key: D[Key]. Note that this is not the same operation as finding a value in a list – it is the analogue of reading L[I] when you know the index I.
In older versions of Delphi it was common to coax dictionary behaviour out of string lists. The performance was terrible. There was little flexibility in the contents.
With modern Delphi, there is TDictionary, a generic class that can hold anything. The implementation uses a hash and although I have not personally tested its performance I understand it to be respectable.
There are commonly used algorithms that optimally use all of these containers: unsorted lists, sorted lists, dictionaries. You just need to use the right one for the problem at hand.
TStringList holds the strings in an array.
If you call Sort on an otherwise unsorted (Sorted property = false) string list then a QuickSort is performed to sort the items.
The same happens if you set Sorted to true.
If you call Find (or IndexOf which calls find) on an unsorted string list (Sorted property = false, even if you explicitly called Sort the list is considered unsorted if the Sorted property isn't true) then a linear search is performed comparing all strings from the start till a match is found.
If you call Find on a sorted string list (Sorted property = true) then a binary search is performed (see http://en.wikipedia.org/wiki/Binary_search for details).
If you add a string to a sorted string list, a binary search is performed to determine the correct insertion position, all following elements in the array are shifted by one and the new string is inserted.
Because of this insertion performance gets a lot worse the larger the string list is. If you want to insert a large number of entries into a sorted string list, it's usually better to turn sorting off, insert the strings, then set Sorted back to true which performs a quick sort.
The disadvantage of that approach is that you will not be able to prevent the insertion of duplicates.
EDIT: If you want a hash map, use TDictionary from unit Generics.Collections
You could look at the source code, since that comes with Delphi. Ctrl-Click on the "sort" call in your code.
It's a simple alphabetical sort in non-Unicode Delphi, and a slightly more complex Unicode one in later versions. You can supply your own comparison for custom sort orders. Unfortunately I don't have a recent version of Delphi so can't confirm, but I expect that under the hood there's a proper Unicode-aware and locale-aware string comparison routine. Unicode sorting/string comparison is not trivial and a little web searching will point out some of the pitfalls.
Supplying your own comparison routine is often done when you have delimited text in the strings or objects attached to them (the Objects property). In those cases you often wat to sort by a property of the object or something other than the first field in the string. Or it might be as simple as wanting a numerical sort on the strings (so "2" comes after "1" rather than after "19")
There is also a THashedStringList, which could be an option (especially in older Delphi versions).
BTW, the Unicode sort routines for TStringList are quite slow. If you override the TStringList.CompareStrings method then if the strings only contain Ansi characters (which if you use English exclusively they will), you can use customised Ansi string comparisons. I use my own customised TStringList class that does this and it is 4 times faster than the TStringList class for a sorted list for both reading and writing strings from/to the list.
Delphi's dictionary type (in generics-enabled versions of Delphi) is the closest thing to a hashmap, that ships with Delphi. THashedStringList makes lookups faster than they would be in a sorted string list. you can do lookups using a binary search in a sorted stringlist, so it's faster than brute force searches, but not as fast as a hash.
The general theory of a hash is that it is unordered, but very fast on lookup and insertion. A sorted list is reasonably fast on insertion until the size of the list gets large, although it's not as efficient as a dictionary for insertion.
The big benefit of a list is that it is ordered but a hash-lookup dictionary is not.

Resources