How to access a huge map in Apache Spark? - memory

I am calculating the skyline of a set of points where each point has a number of dimensions where each dimension has a double value. My input can be anything from thousands to millions of points with multiple dimensions. I am creating an algorithm where I have a huge map created as an RDD. The RDD looks like this:
JavaPairRDD<Tuple2<Integer, Integer>, BitSet> map
Where the first int is the dimension, the second the ranking of a double in that dimension and BitSet is a bit representation of that value in that dimension. This map is smaller than the input but still can be very large (usually 10000+ elements) elements and I need to distribute it between all nodes. The first thing I did was a broadcast:
Broadcast<Map<Tuple2<Integer, Integer>, BitSet>> broadcasted = sparkContext.broadcast(map.collectAsMap());
Which of course fails in Big Data because this map is way too memory consuming. This map needs to be accessed by every node in a read-only mode in order to perform some calculations, for example a node will do (with the broadcast):
BitSet bitSet = (BitSet) broadcasted.value().get(key).clone()
// perform calculations
This action will be done during a map or a filter action so I can't keep the map as an RDD and do lookup. Specifically, the second RDD that needs to access the map RDD looks like this:
JavaPairRDD<Point, Iterable<Tuple2<Integer, Integer>> values;
where the Iterable<Tuple2<Integer, Integer>> is the list of keys that the Point needs during a 'filter' in order for me to see if the point is in the skyline or not.
JavaRDD<Point2D> skylines = pointsWithKeys
.filter(p -> isSkyline(p._2()))
.map(Tuple2::_1);
Inside isSkyline I need access to the map in order to access the bitsets depending on the keys I give it.
What is a more efficient way to distribute it between the nodes?

Related

How does hash table work for dynamically lengthed array?

In cases where the length of an array is fixed, it makes sense to me that the time complexity of hash tables is O(1). However, I don't get how hash tables work for dynamically lengthed arrays such as a linked list. Direct indexing is clearly not possible when elements of an array are scattered all over the memory.
You are correct, a hash table where the keys are implemented as a linked list would not be O(1), because the key search is O(n). However, linked lists are not the only expandable structure.
You could, for example, use a resizable vector, such as one that doubles in size each time it needs to expand. That is directly addressable without an O(n) search so satisfies that O(1) condition.
Keep in mind that resizing the vector would almost certainly change the formula that allocates items into individual buckets of that vector, meaning there's a good chance you'd have to recalculate the buckets into which every existing item is stored.
That would still amortise to O(1), even with a single insert operation possibly having to do an O(n) reallocation, since the reallocations would be infrequent, and likely to become less frequent over time as the vector gets larger.
You can still map the elements of a linked list to a hash table. Yes, it's true we do not know the size of the list beforehand, so we cannot use a C-style or non-expandable array to represent our hash table. This is where vectors come into play (or ArrayList if you're from Java).
A crash course on vectors will be: if there is no more space in the current array, make a new array of double size, and copy the previous elements into it. More formally, if we want to insert n+1 elements into an array of size n, then it will make a new array of size 2n. For the next overflow, it will create an array of size 4n and so on.
The following code can map values of a linked list into a hash table.
void map(Node* root) {
vector<int> hash;
while(root){
hash[root->val]++;
root = root->next;
}
for(int i = 0; i<hash.size(); i++){
cout<<hash[i]<<" ";
}
}

Coredata performance: is there a penalty for loading many individual entities?

I'm working on an app that will include a set of points drawn from CLLocationManager and draw them on a map. I'll never really have a need for each point as an individual entity, they only have meaning in the context of the path.
Instead of creating a model representing the points, I could just store the path as a big JSON (or other more efficient string format) and thereby read only the single entity when it's time to pull the data out. It seems to me this could save overhead, is that true?
This is something that would need some testing. Finding the path directly which contains the points is probably a faster way then fetching all the points which correspond to a certain path but the part with writing them into strings seems a bit off. Parsing those strings will be slow. (JSON being a string).
For saving the points into paths I would suggest either to also add the point entity which is then linked through reference to the path. An alternative would be to use transformable data; Your point will be represented by 2 or 3 double values which could be put directly into a buffer (NSData for instance). The length of the data saved then defines the number of points as data.length/(sizeof(double)*dimensions). This would be extremely easily done in ObjectiveC while in Swift you may lose some hair when working with raw data and unsafe pointers.
It really depends on what you are implementing but if you plan to have very many paths in the database you can still expect a large delay when fetching the data. You might want to consider creating sectors. Each sector would be represented with the same data as the region (MKCoordinateRegion) where on database initialize you would iterate to create sectors for the whole earth. Then when you are inserting paths you check what regions the path intersects with and assign the path to those regions (many-to-many relation). Now when you show the map you check what regions are visible and fetch only those regions and then extract paths from them.

C linked list or hash table for matrix operations

I have matrix in C with size m x n. Size isn't known. I must to have operations on matrix such as : delete first element and find i-th element. (where size woudn't be too big , from 10 to 50 columns of matrix). What is more efficient to use, linked list or hash table? How can I map column of matrix to one element of linked list or hash table depens what I choose to use?
Thanks
Linked lists don't provide very good random access, so from that perspective, you might not want to look in to using them to represent a matrix, since your lookup time will take a hit for each element you attempt to find.
Hashtables are very good for looking up elements as they can provide near constant time lookup for any given key, assuming the hash function is decent (using well established hashtable implementations would be wise)
Provided with the constraints that you have given though, a hashtable of linked lists might be a suitable solution, though it would still present you with the problem of finding the ith element, as you'd still need to iterate through each linked list to find the element you want. This would give you O(1) lookup for the row, but O(n) for the column, where n is the column count.
Furthermore, this is difficult because you'd have to make sure EVERY list in your hashtable is updated with the appropriate number of nodes as the number of columns grows/shrinks, so you're not buying yourself much in terms of space complexity.
A 2D array is probably best suited for representing a matrix, where you provide some capability of allowing the matrix to grow by efficiently managing memory allocation and copying.
An alternate method would be to look at something like the std::vector in lieu of the linked list, which acts like an array in that it's contiguous in memory, but will allow you the flexibility of dynamically growing in size.
if its for work then use hash table, avg runtime would be O(1).
for deletion/get/set given indices at O(1) 2d arr would be optimal.

Why does a hash table take up more memory than other data-structures?

I've been doing some reading about hash tables, dictionaries, etc. All literature and videos that I have watched imply to hash-tables as having the space/time trade-off property.
I am struggling to understand why a hash table takes up more space than, say, an array or a list with the same number of total elements (values)? Does it have something to do with actually storing the hashed keys?
As far as I understand and in basic terms, a hash table takes a key identifier (say some string), passes it through some hashing function, which spits out an index to an array or some other data-structure. Apart from the obvious memory usage to store your objects (values) in the array or table, why does a hash table use up more space? I feel like I am missing something obvious...
Like you say, it's all about the trade-off between lookup time and space. The larger the number of spaces (buckets) the underlying data structure has, the greater the number of locations the hash function has where it can potentially store each item, and so the chance of a collision (and therefore worse than constant-time performance) is reduced. However, having more buckets obviously means more space is required. The ratio of number of items to number of buckets is known as the load factor, and is explained in more detail in this question: What is the significance of load factor in HashMap?
In the case of a minimal perfect hash function, you can achieve O(1) performance storing n items in n buckets (a load factor of 1).
As you mentioned, the underlying structure of hash table is an array, which is the most basic type in the data structure world.
In order to make hash table fast, which support O(1) operations. The underlying array's capacity must be more than enough. It uses the term of load factor to evaluate this point. The load factor is the ratio of numbers of element in hash table to numbers of all the cells in the hash table. It evaluates how empty the hash table is.
To make the hash table run fast, the load factor can't be greater than some threshold value. For example, in the Quadratic Probing collision resolution method, the load factor should not be greater than 0.5. When the load factor approaches 0.5 while inserting new element into hash table, we need to rehashing the table to meet the requirement.
So hash table's high performance in the run time aspect is based on more space usage. This is time and space tradeoff.

Given collection of points and polygons, determine which point lies in which polygon (or not)

My question is almost similar to this. But in my case, the polygons are not necessarily touching/overlapping each other. They are present all over the space.
I have a big set of such polygons. Similarly, I have a huge set of points. I am currently running a RoR module that takes 1 point at a time and checks the intersection with respect to 1 polygon at a time. The database is PostGIS. The performance is quite slow.
Is there a faster or optimal way of doing this?
Can be done as one select statement, but for performance....look into a gist index on your polygons. FOr simplicity, lets say I have a table with a polygon field (geom data type) and a point field (geom data type). If you are doing a list of points in a list of polygons, do a cross join so each polygon and each point is compared.
select *
from t1 inner join t2 on 1=1
where st_contains(t1.poly,t2.point) = 't'
(modified to include the table join example. I'm using a cross join, which means every polygon will be joined to every point and compared. If we're talking a large record set, get those GIS tree indexes going)
I'm currently doing this to locate a few million points within a few hundred polygons. If you have overlapping polygons, this will return multiple rows for every point thats located in 2 or more polygons.
May fail pending on the data type your points are stored as. If they are in a geom field, it'll flow fine. If you are using text values, you'll need to use the st.geomfromtext statement to turn your characters into a point. This will look more like:
st_contains(poly, st_geomfromtext('POINT('||lon||' ' ||lat ||')')) = 't'
I used a lat/lon example...only thing to watch for here is the geomfromtext requires you to create the point using || to create the string from your field. Let me know if you need assistance with the st_geomfromtext concept.

Resources