Checking if two Dasks are the same - dask

What is the right way to determine if two Dask objects refer to the same result? Is it as simple as comparing the name attributes of both or are there other checks that need to be run?

In the case of any of the dask collections in the main library (array, bag, delayed, dataframe) yes, equal names should imply equal values.
However the opposite is not always true. We don't use deterministic hashing everywhere. Sometimes we use uuid's instead. For example, random arrays always get random UUIDs for keys, but two random arrays might be equal by chance.
No guarantees are given for collections made outside of the Dask library. No enforcement is made at the scheduler level.

Related

Why do dask divisions need to be unique?

I want to set the index for a dask dataframe (from_delayed) using already known divisions. However, dask complains that the divisions are required to be unique. This constraint causes trouble for me since the partitions would turn out to be of about 5GB in size which is a bit too much for my taste.
Is there a way to work around this constraint or loosen it for certain operations?
You should view the divisions as an optimisation, which allows dask to know which data is expected in which partition for some operations (groupby, fetch particular index row, etc.).
If your data is not organised in a way that the divisions on the index are unique, you have a simple option: do not provide divisions at all. Then you will lose out on those certain optimisations, which are not appropriate to your case. Alternatively, you could decide to reorganise your data, either within data or before passing it to dask.

Initialisation order in Lua Table Constructor

So, a table constructor has two components, list-like and record-like. Do list-like entries always take precedence over record-like ones? I mean, consider the following scenario:
a = {[1]=1, [2]=2, 3}
print(a[1]) -- 3
a = {1, [2]=2, [1]=3}
print(a[1]) -- 1
Is the index 1 always associated with the first list-like entry, 2 with the second, and so on? Or is there something else?
There are two types of tables in Lua, arrays and dictionaries, these are what you call "lists" and "records". An array, contains values in an order, this gives you a few advantages, like faster iteration or inserting/removing values. Dictionaries work like a giant lookup table, it has no order, it's advantages are how you can use any value as a key, and you are not as restricted.
When you construct a table, you have 2 syntaxes, you can seperate the values with commas, e.g. {2,4,6,8} thereby creating an array with keys 1 through n, or you can define key-value pairs, e.g. {[1]=2,[58]=4,[368]=6,[48983]=8} creating a dictionary, you can often mix these syntaxes and you won't run into any problems, but this is not the case in your scenario.
What you are doing is defining the same key twice during the table's initial construction. This is most generally impractical and as such hasn't really had any serious thought put into it during the language's development. This means that what happens is essentially unspecified behaviour. It is not completely understood what effect this will have, and may be inconsistent across different platforms or implementations.
As such, you should not use this in any commercial projects, or any code you'll be sharing with other people. When in doubt, construct an empty table and define the key-value pairs afterward.

What kind of sort does Cocoa use?

I'm always amazed by the abstractions our modern languages or frameworks create, even the ones considered relatively low level such as Objective-C/Cocoa.
Here I'm interested in the type of sort executed when one calls sortedArrayUsingComparator: on an NSArray. Is it dynamic, like analyzing the current constraints of the environment (particularly free memory) and the attributes of the array (length, unique values), and pick the best sort accordingly, or does it always use the same one, like Quick or Merge Sort?
It should be possible to test that by analyzing the running time of the method relatively to N, just wondering if anyone already bothered to.
This has been described at a developers conference. The sort doesn't need any memory. It checks if there is a sorted range of numbers at the start or the end or both and takes advantage of that. You can ask yourself how you would sort an 100,000 entry array if the first 50,000 are sorted in descending order.

C linked list or hash table for matrix operations

I have matrix in C with size m x n. Size isn't known. I must to have operations on matrix such as : delete first element and find i-th element. (where size woudn't be too big , from 10 to 50 columns of matrix). What is more efficient to use, linked list or hash table? How can I map column of matrix to one element of linked list or hash table depens what I choose to use?
Thanks
Linked lists don't provide very good random access, so from that perspective, you might not want to look in to using them to represent a matrix, since your lookup time will take a hit for each element you attempt to find.
Hashtables are very good for looking up elements as they can provide near constant time lookup for any given key, assuming the hash function is decent (using well established hashtable implementations would be wise)
Provided with the constraints that you have given though, a hashtable of linked lists might be a suitable solution, though it would still present you with the problem of finding the ith element, as you'd still need to iterate through each linked list to find the element you want. This would give you O(1) lookup for the row, but O(n) for the column, where n is the column count.
Furthermore, this is difficult because you'd have to make sure EVERY list in your hashtable is updated with the appropriate number of nodes as the number of columns grows/shrinks, so you're not buying yourself much in terms of space complexity.
A 2D array is probably best suited for representing a matrix, where you provide some capability of allowing the matrix to grow by efficiently managing memory allocation and copying.
An alternate method would be to look at something like the std::vector in lieu of the linked list, which acts like an array in that it's contiguous in memory, but will allow you the flexibility of dynamically growing in size.
if its for work then use hash table, avg runtime would be O(1).
for deletion/get/set given indices at O(1) 2d arr would be optimal.

Is objectForKey slow for big NSDictionary?

Assume we have very big NSDictionary, when we want to call the objectForKey method, will it make lots of operations in core to get value? Or will it point to value in the memory directly?
How does it works in core?
The CFDictionary section of the Collections Programming Topics for Core Foundation (which you should look into if you want to know more) states:
A dictionary—an object of the CFDictionary type—is a hashing-based
collection whose keys for accessing its values are arbitrary,
program-defined pieces of data (or pointers to data). Although the key
is usually a string (or, in Core Foundation, a CFString object), it
can be anything that can fit into the size of a pointer—an integer, a
reference to a Core Foundation object, even a pointer to a data
structure (unlikely as that might be).
This is what wikipedia has to say about hash tables:
Ideally, the hash function should map each possible key to a unique
slot index, but this ideal is rarely achievable in practice (unless
the hash keys are fixed; i.e. new entries are never added to the table
after it is created). Instead, most hash table designs assume that
hash collisions—different keys that map to the same hash value—will
occur and must be accommodated in some way. In a well-dimensioned hash
table, the average cost (number of instructions) for each lookup is
independent of the number of elements stored in the table. Many hash
table designs also allow arbitrary insertions and deletions of
key-value pairs, at constant average (indeed, amortized) cost per
operation.
The performance therefore depends on the quality of the hash. If it is good then accessing elements should be an O(1) operation (i.e. not dependent on the number of elements).
EDIT:
In fact after reading further the Collections Programming Topics for Core Foundation, apple gives an answer to your question:
The access time for a value in a CFDictionary object is guaranteed to
be at worst O(log N) for any implementation, but is often O(1)
(constant time). Insertion or deletion operations are typically in
constant time as well, but are O(N*log N) in the worst cases. It is
faster to access values through a key than accessing them directly.
Dictionaries tend to use significantly more memory than an array with
the same number of values.
NSDictionary is essentially an Hash Table structure, thus Big-O for lookup is O(1). However, to avoid reallocations (and to achieve the O(1)) complexity you should use dictionaryWithCapacity: to create a new Dictionary with appropriate size with respect to the size of your dataset.

Resources