spatial set operations in Apache Spark - spatial

Has anyone been able to do spatial operations with #ApacheSpark? e.g. intersection of two sets that contain line segments?
I would like to intersect two sets of lines.
Here is a 1-dimensional example:
The two sets are:
A = {(1,4), (5,9), (10,17),(18,20)}
B = {(2,5), (6,9), (10,15),(16,20)}
The result intersection would be:
intersection(A,B) = {(1,1), (2,4), (5,5), (6,9), (10,15), (16,17), (18,20)}
A few more details:
- sets have ~3 million items
- the lines in a set cover the entire range
Thanks.

One approach to parallelize this would be to create a grid of some size, and group line segments by the grids they belong to.
So for a grid with sizes n, you could flatMap pairs of coordinates (segments of line segments), to create (gridId, ( (x,y), (x,y) )) key-value pairs.
The segment (1,3), (5,9) would be mapped to ( (1,1), ((1,3),(5,9) ) for a grid size 10 - that line segment only exists in grid "slot" 1,1 (the grid from 0-10,0-10). If you chose a smaller grid size, the line segment would be flatmapped to multiple key-value pairs, one for each grid-slot it belongs to.
Having done that, you can groupByKey, and for each group, calculation intersections as normal.
It wouldn't exactly be the most efficient way of doing things, especially if you've got long line segments spanning multiple grid "slots", but it's a simple way of splitting the problem into subproblems that'll fit in memory.

You could solve this with a full cartesian join of the two RDDs, but this would become incredibly slow at large scale. If your problem is smallish, sure, this is an easy and cheap approach. Just emit the overlap, if any, between every pair in the join.
To do better, I imagine that you can solve this by sorting the sets by start point, and then walking through both at the same time, matching one's current interval versus another and emitting overlaps. Details left to the reader.
You can almost solve this by first mapping each tuple (x,y) in A to something like ((x,y),'A') or something, and the same for B, and then taking the union and sortBy the x values. Then you can mapPartitions to encounter a stream of labeled segments and implement your algorithm.
This doesn't quite work though since you would miss overlaps between values at the ends of partitions. I can't think of a good simple way to take care of that off the top of my head.

Related

Tableau: How to display size as the number of records for the same lat,long

On the same (lat,long) on a Tableau-Desktop map, I want the size of a dot to be proportional to the number of records at that location. I tried count/sum(Number of Records) built-in tableau measure, I created a SeqId and tried count(SeqId) for Size, neither worked. Here is a sample of my data, as you can see:
(44.92810490,-74.89186500) has one Record
(44.69948730,-73.45291240) has five Records
(44.72143010,-73.72375280) has 10 records
I would like the point to be proportional to the number of records at that location. Help is Much appreciated
Musa
Seq Id,Census,Gender,Lat,Long
1,1860,F,44.92810490,-74.89186500
2,1870,M,44.69948730,-73.45291240
3,1870,F,44.69948730,-73.45291240
4,1870,M,44.69948730,-73.45291240
5,1870,F,44.69948730,-73.45291240
6,1870,F,44.69948730,-73.45291240
7,1870,M,44.72143010,-73.72375280
8,1870,M,44.72143010,-73.72375280
9,1870,M,44.72143010,-73.72375280
10,1870,M,44.72143010,-73.72375280
11,1870,M,44.72143010,-73.72375280
12,1870,M,44.72143010,-73.72375280
13,1870,M,44.72143010,-73.72375280
14,1870,M,44.72143010,-73.72375280
15,1870,M,44.72143010,-73.72375280
16,1870,M,44.72143010,-73.72375280
Can you try this?
Create a calculated field "Geo" with this definition
IFNULL(STR([Lat]),"")+ ","+IFNULL(STR([Long],"")
Move this field in "size" mark using Count([Geo])
Hope this should give you the desired result.
Put Latitude on the Rows shelf, and then right click on the pill and convert it to a dimension. Make sure it stays continuous.
Likewise, put Longitude on the Columns shelf and convert it to a dimension
Put SUM(Number of Records) on the size shelf
Important, Don't have any other dimensions on any shelves, leave SeqId off
This approach will make one mark for each unique latitude/longitude pair and size that mark according to how many times that pair appears in the data set.
A problem you will probably notice is that two latitudes that differ only in the final decimal place are treated as distinct latitudes. That may not make the most useful visualization. You can bin nearby latitudes together by making a calculated field to round values to the degree you wish. If you do that, be sure to make your field a continuous dimension, and also set its geographic role. It has the effect of snapping lat/long pairs to a grid. As an alternative to rounding, you can look into the hexbinx() and hexbiny() functions.
For a heat map based on square or hex grids, you may want to try using (partially transparent) colors instead of size to indicate density.

OpenCV: Generating points from image after thinning

I've ran in to an issue concerning generating floating point coordinates from an image.
The original problem is as follows:
the input image is handwritten text. From this I want to generate a set of points (just x,y coordinates) that make up the individual characters.
At first I used findContours in order to generate the points. Since this finds the edges of the characters it first needs to be ran through a thinning algorithm, since I'm not interested in the shape of the characters, only the lines or as in this case, points.
Input:
thinning:
So, I run my input through the thinning algorithm and all is fine, output looks good. Running findContours on this however does not work out so good, it skips a lot of stuff and I end up with something unusable.
The second idea was to generate bounding boxes (with findContours), use these bounding boxes to grab the characters from the thinning process and grab all none-white pixel indices as "points" and offset them by the bounding box position. This generates even worse output, and seems like a bad method.
Horrible code for this:
Mat temp = new Mat(edges, bb);
byte roi_buff[] = new byte[(int) (temp.total() * temp.channels())];
temp.get(0, 0, roi_buff);
int COLS = temp.cols();
List<Point> preArrayList = new ArrayList<Point>();
for(int i = 0; i < roi_buff.length; i++)
{
if(roi_buff[i] != 0)
{
Point tempP = bb.tl();
tempP.x += i%COLS;
tempP.y += i/COLS;
preArrayList.add(tempP);
}
}
Is there any alternatives or am I overlooking something?
UPDATE:
I overlooked the fact that I need the points (pixels) to be ordered. In the method above I simply do scanline approach to grabbing all the pixels. If you look at the 'o' for example, it would grab first the point on the left hand side, then the one on the right hand side. I would need them to be ordered by their neighbouring pixels since I want to draw paths with the points later on (outside of opencv).
Is this possible?
You should look into implementing your own connected components labelling. The concept is very simple: you scan the first line and assign unique labels to each horizontally connected strip of pixels. You basically check for every pixel if it is connected to its left neighbour and assign it either that neighbour's label or a new label. In the second row you do the same, but you also check against the pixels above it. Sometimes you need a label merge: two strips that were not connected in the previous row are joined in the current row. The way to deal with this is either to keep a list of label equivalences or use pointers to labels (so you can easily do a complete label change for an object).
This is basically what findContours does, but if you implement it yourself you have the freedom to go for 8-connectedness and even bridge a single-pixel or two-pixel gap. That way you get "almost-connected components labelling". It looks like you need this for the "w" in your example picture.
Once you have the image labelled this way, you can push all the pixels of a single label to a vector, and order them something like this. Find the top left pixel, push it to a new vector and erase it from the original vector. Now find the pixel in the original vector closest to it, push it to the new vector and erase from the original. Continue until all pixels have been transferred.
It will not be very fast this way, but it should be a start.

Boardgame-Map with crossroads etc

I have a little logical problem over here.
As the title says, I try to build a boardgame as a computer-program (maybe with internet-support, but thats another story)
As for now, I have a map, which has some crossroads in it, hence I cannot simply define the fields as '1, 2, 3, 4, ...' because if there is a crossroad at field 10, I would have more than one field which has to be labeled 11 (Because then there is a field left and right of field 10, for example.)
So the problem is, if I cannot define the Board in numbers then I cannot simply get the possible positions a player can take when he rolls 2d6-dices with calculating 'Field-Nr. + RandomRange(1,6) + RandomRange(1,6)'
Does anybody have an idea, how to define a Map like this on another way, where I still can calculate the possible new-fields for Player X with a 2d6-dice-roll?
Thanks in advance.
If i understand well... (i don't thing so) this might help you. Just use dynamic arrays for your boardgame field and change your actions after the dimensions x,y .... Look at this "type Name = array of {array of ...} Base type; // Dynamic array"
It sounds like you have a graph of connected vertices. When a player is at a particular vertex of N edges, assuming N < 12, the new Field will be reached from traversing edge number N % ( rand(6) + rand(6) ).
You could also just do rand(12), but that would have an even distribution, unlike 2d6.
Instead of dynamic arrays, I would recommend using a linked-list of records to describe the surrounding cells, and traverse the player's location and possible moves using that linked-list.
First, define a record that describes each cell in your board's playable grid (the cells on the grid can be four-sided like a chessboard, or hexagonal like in Civilization V) ... each cell record should contain info such as coordinates, which players are also in that cell, any rewards/hazards/etc that would affect gameplay, etc. (you get the idea).
Finally, the linked-list joins all of these cells, effectively pointing to any connected cells. That way, all you'd need is the cell location of Player X and calculate possible moves over n amount of cells (determined by the dice roll), traversing the adjoining cells (that don't have hazards, for example).
If all you want is to track the possible roads, you can also use this approach to identify possible paths (instead of cells) Player X can travel on.

Efficient Way to Zero Rows with OpenCV

I am using OpenCV 2.2 on Windows 7.
I am making a mask where the rows are all 1 up to row 400 and 0 for rows beyond that. I initialize the mask with cv::Mat::ones() and was wondering what would be the most efficient way to zero the rows beyond 400. I could use for loops but was wondering if there was a more efficient, tidier way to do it.
Thanks,
Peter.
There is more than one way to do it:
First, sub-matrices
Mat bigImg(width, height, CV_8UC3);
bigImg(Rect(0,0,width, height/2)) = Scalar::all(1); // upper half ones
bigImg(Rect(0,height/2,width, height/2)) = Scalar::all(0); // lower half zeros
Or you can use the RowRange and ColRange for the same effect
bigImg(rowRange, colRange) = Scalar::all(n);
Just check the docs on how to use ranges
The only way I know of is to create a matrix of 400xm with cv::Mat::ones() and a matrix of 400x(n-m) with cv::Mat::zeros() and then join the two together. However this has the overhead of making the two matrices and then resizing one to be big enough to contain the other.
I think looping is definitely more efficient. It's C/C++ anyway I assume, and that's about the fastest way for this particular sort of operation.

How to detect if a frame is odd or even on an interlaced image?

I have a device that is taking TV screenshots at precise times (it doesn't take incomplete frames).
Still this screenshot is an interlace image made from two different original frames.
Now, the question is if/how is possible to identify which of the lines are newer/older.
I have to mention that I can take several sequential screenshots if needed.
Take two screenshots one after another, yielding a sequence of two images (1,2). Split each screenshot into two fields (odd and even) and treat each field as a separate image. If you assume that the images are interlaced consistently (pretty safe assumption, otherwise they would look horrible), then there are two possibilities: (1e, 1o, 2e, 2o) or (1o, 1e, 2o, 2e). So at the moment it's 50-50.
What you could then do is use optical flow to improve your chances. Say you go with the
first option: (1e, 1o, 2e, 2o). Calculate the optical flow f1 between (1e, 2e). Then calculate the flow f2 between (1e, 1o) and f3 between (1o,2e). If f1 is approximately the same as f2 + f3, then things are moving in the right direction and you've picked the right arrangement. Otherwise, try the other arrangement.
Optical flow is a pretty general approach and can be difficult to compute for the entire image. If you want to do things in a hurry, replace optical flow with video tracking.
EDIT
I've been playing around with some code that can do this cheaply. I've noticed that if 3 fields are consecutive and in the correct order, the absolute error due to smooth, constant motion will be minimized. On the contrary, if they are out of order (or not consecutive), this error will be greater. So one way to do this is two take groups of 3 fields and check the error for each of the two orderings described above, and go with the ordering that yielded the lower error.
I've only got a handful of interlaced videos here to test with but it seems to work. The only down-side is its not very effective unless there is substantial smooth motion or the number of used frames is low (less than 20-30).
Here's an interlaced frame:
Here's some sample output from my method (same frame):
The top image is the odd-numbered rows. The bottom image is the even-numbered rows. The number in the brackets is the number of times that image was picked as the most recent. The number to the right of that is the error. The odd rows are labeled as the most recent in this case because the error is lower than for the even-numbered rows. You can see that out of 100 frames, it (correctly) judged the odd-numbered rows to be the most recent 80 times.
You have several fields, F1, F2, F3, F4, etc. Weave F1-F2 for the hypothesis that F1 is an even field. Weave F2-F3 for the hypothesis that F2 is an even field. Now measure the amount of combing in each frame. Assuming that there is motion, there will be some combing with the correct interlacing but more combing with the wrong interlacing. You will have to do this at several times in order to find some fields when there is motion.

Resources