Calculating the bearing change from latitude/longitude - geolocation

I have at hand, a dataset of GPS logs containing GPS speeds as well. Here's how the dataset looks like:
id | gpstime | lat | lon | speed
--------+------------+------------+------------+---------
157934 | 1530099776 | 41.1825026 | -8.5996864 | 3.40901
157934 | 1530099777 | 41.1825114 | -8.599722 | 3.43062
157934 | 1530099778 | 41.1825233 | -8.5997594 | 3.45739
157934 | 1530099779 | 41.1825374 | -8.5997959 | 3.40025
157934 | 1530099780 | 41.1825519 | -8.5998337 | 3.41673
(5 rows)
Now I want to compute the bearing change, for each point with respect to the true north.
But I have these questions I am yet to find answers to:
Based on my reading, I come across the formula (as in this answer):
Bearing = atan(y,x)
where x and y are the quantities
y = sin(Blon-Alon) * cosBlat
x = cosAlat * sinBlat -sinAlat * cosBlat * cos(Blon-Alon)
respectively for points A and B. Then from another source, the formula here, the formula is written:
Bearing = atan2(y,x)
So I'm confused, which of the formula should I use?
lat and lon should be converted from degrees to radian before passing to quantities x and y. Being that the values of lon in my dataset are negatives, should I take the absolute value of each?

I think for GPS tracks this would be an overkill. In case the distance between two point are not to big (let's say a few hundreds of meters) I assume this simplified calculation is sufficient.
The latitude/longitude differences are app.
Δlat = 111km * (lat1 - lat2)
Δlon = 111km * cos(lat) * (lon1 - lon2)
So bearing would be
bearing = atan(Δlon / Δlat) * 180/π
bearing = atan(cos(lat) * (lon1 - lon2) / (lat1 - lat2)) * 180/ACOS(-1)
for lat use either lat1 or lat2 or the middle if you like.
lat = (lat1 + lat2)/2 * π/180 = (lat1 + lat2)/2 * ACOS(-1)/180
Consider Δlat or Δlat could be 0

Related

How can I efficiently transform a two-column range into an expanded table?

I'm trying to use geo IP data in snowflake. This involves several things:
1) A source table with a CIDR IP range and a geoname_ID and its lat/long coords
2) I've used the parse_ip function and extracted the range_start and range_end values as simple integer columns in the ipv4 0-4.2bn range. Some ranges consist of 1 IP, some may have as many as 16.7 million.
So, the 3.1 million rows in the intermediary table data look something like this :
RANGE_START RANGE_END GEONAME_ID LATITUDE LONGITUDE
214690946 214690946 4556793 39.84980011 -75.37470245
214690947 214690947 6252001 37.75099945 -97.82199860
214690948 214690951 6252001 37.75099945 -97.82199860
214690952 214690959 6252001 37.75099945 -97.82199860
214690960 214690975 6252001 37.75099945 -97.82199860
As you can see, a geoname ID can have multiple ranges associated with it.
The problem is joining a (parsed into an integer value) IP with this table requires non-equality joins, which are painfully slow in snowflake at the moment (about 1000x slower empirically). So I would like to expand the table above into having one row per IP in range, i.e the last row with the range 214690960 to 214690975 would turn into 16 rows, while preserving geoname and lat long for each of the new rows. The only way I could think to do this was by doing a non-equi join to a generator table, but this took 30 minutes on a 3xl for 1000 rows, generating about 1.2m result rows. I have 3.1 million rows in this range to flatten, so that won't work.
Any ideas, anyone?
Here is what I tried so far:
create OR REPLACE table GENERATOR_TABLE (IP INT);
INSERT INTO GENERATOR_TABLE SELECT ROW_NUMBER() over (ORDER BY NULL) AS IP FROM TABLE(GENERATOR(ROWCOUNT => 4228250627)) ORDER BY IP;
create or replace table GEO_INTERMEDIARY as
(select network_parsed:ipv4_range_start::number as range_start, network_parsed:"ipv4_range_end"::number range_end, geoname_id, latitude, longitude from GEO_SOURCE order by range_start, range_end);
CREATE OR REPLACE TABLE EXPANDED_GEO AS
select * from (select * from GEO_INTERMEDIARY order by geoname_id limit 1000 offset 0) A
JOIN GENERATOR_TABLE B ON B.IP >= A.RANGE_START AND B.IP <= A.RANGE_END
ORDER BY IP;
For such pattern you could indeed try using generator, but I usually end up using JavaScript UDTFs.
Here's an example function and usage on your data:
create or replace table x(
RANGE_START int,
RANGE_END int,
GEONAME_ID int,
LATITUDE double,
LONGITUDE double
) as
select * from values
(214690946,214690946,4556793,39.84980011,-75.37470245),
(214690947,214690947,6252001,37.75099945,-97.82199860),
(214690948,214690951,6252001,37.75099945,-97.82199860);
create or replace function magic(
range_start double,
range_end double,
geoname_id double,
latitude double,
longitude double
)
returns table (
ip double,
geoname_id double,
latitude double,
longitude double
) language javascript as
$$
{
processRow: function(row, rowWriter, context) {
let start = row.RANGE_START
let end = row.RANGE_END
while (start <= end) {
rowWriter.writeRow({
IP: start,
GEONAME_ID: row.GEONAME_ID,
LATITUDE: row.LATITUDE,
LONGITUDE: row.LONGITUDE,
});
start++;
}
}
}
$$;
select m.* from x,
table(magic(range_start::double, range_end::double,
geoname_id::double, latitude, longitude)) m;
-----------+------------+-------------+--------------+
IP | GEONAME_ID | LATITUDE | LONGITUDE |
-----------+------------+-------------+--------------+
214690946 | 4556793 | 39.84980011 | -75.37470245 |
214690947 | 6252001 | 37.75099945 | -97.8219986 |
214690948 | 6252001 | 37.75099945 | -97.8219986 |
214690949 | 6252001 | 37.75099945 | -97.8219986 |
214690950 | 6252001 | 37.75099945 | -97.8219986 |
214690951 | 6252001 | 37.75099945 | -97.8219986 |
-----------+------------+-------------+--------------+
The only gotcha here is that JS only supports double types, but for this data, it's ok, you will not see any precision loss.
I tested it on 1M ranges producing 10M IPs, it finished in seconds.

Interpreting rawPrediction from Spark ML LinearSVC

I am using Spark ML's LinearSVC in a binary classification model. The transform method creates two columns, prediction and rawPrediction. Spark's docs don't provide any way of interpreting the rawPrediction column for this particular classifier. This question has been asked and answered for other classifiers, but not specifically for LinearSVC.
The relevant column from my predictions dataframe:
+------------------------------------------+
|rawPrediction |
+------------------------------------------+
|[0.8553257800650063,-0.8553257800650063] |
|[0.4230977574196645,-0.4230977574196645] |
|[0.49814263303537865,-0.49814263303537865]|
|[0.9506355050332026,-0.9506355050332026] |
|[0.5826887000450813,-0.5826887000450813] |
|[1.057222808292026,-1.057222808292026] |
|[0.5744214192446275,-0.5744214192446275] |
|[0.8738081933835614,-0.8738081933835614] |
|[1.418173816502859,-1.418173816502859] |
|[1.0854125533426737,-1.0854125533426737] |
+------------------------------------------+
Clearly this isn't simply the probability of belonging to each class. What is it?
Edit: Since the input code has been requested, here's a model built on a subset of features in the original dataset. Fitting any data with Spark's LinearSVC will produce this column.
var df = sqlContext
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/full_frame_20180716.csv")
var assembler = new VectorAssembler()
.setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length",
"longest_word_length", "total_words", "repeated_exact_words",
"repeated_bigrams", "repeated_lemmatized_words",
"repeated_lemma_bigrams"))
.setOutputCol("features")
df = assembler.transform(df)
var Array(train, test) = df.randomSplit(Array(.8,.2), 42)
var supvec = new LinearSVC()
.setLabelCol("written_before_2004")
.setMaxIter(10)
.setRegParam(0.001)
var supvecModel = supvec.fit(train)
var predictions = supvecModel.transform(test)
predictions.select("rawPrediction").show(20, false)
Output:
+----------------------------------------+
|rawPrediction |
+----------------------------------------+
|[1.1502868455791242,-1.1502868455791242]|
|[0.853488887006264,-0.853488887006264] |
|[0.8064994501574174,-0.8064994501574174]|
|[0.7919862003563363,-0.7919862003563363]|
|[0.847418035176922,-0.847418035176922] |
|[0.9157433788236442,-0.9157433788236442]|
|[1.6290888181913814,-1.6290888181913814]|
|[0.9402461917731906,-0.9402461917731906]|
|[0.9744052798627367,-0.9744052798627367]|
|[0.787542624053347,-0.787542624053347] |
|[0.8750602657901001,-0.8750602657901001]|
|[0.7949414037722276,-0.7949414037722276]|
|[0.9163545832998052,-0.9163545832998052]|
|[0.9875454213431247,-0.9875454213431247]|
|[0.9193015302646135,-0.9193015302646135]|
|[0.9828623328048487,-0.9828623328048487]|
|[0.9175976004208621,-0.9175976004208621]|
|[0.9608750388820302,-0.9608750388820302]|
|[1.029326217566756,-1.029326217566756] |
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+
only showing top 20 rows
It is (-margin, margin).
override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}
As it is mention by arpad, it is the margin.
And the margin is:
margin = coefficients * feature + intercept
or
y = w * x + b
If you divide the margin by the norm of the coefficients, you will get the distance to the hyperplane for each data point.

Detect words and graphs in image and slice image into 1 image per word or graph

I'm building a web app to help students with learning Maths.
The app needs to display Maths content that comes from LaTex files.
These Latex files render (beautifully) to pdf that I can convert cleanly to svg thanks to pdf2svg.
The (svg or png or whatever image format) image looks something like this:
_______________________________________
| |
| 1. Word1 word2 word3 word4 |
| a. Word5 word6 word7 |
| |
| ///////////Graph1/////////// |
| |
| b. Word8 word9 word10 |
| |
| 2. Word11 word12 word13 word14 |
| |
|_______________________________________|
Real example:
The web app intent is to manipulate and add content to this, leading to something like this:
_______________________________________
| |
| 1. Word1 word2 | <-- New line break
|_______________________________________|
| |
| -> NewContent1 |
|_______________________________________|
| |
| word3 word4 |
|_______________________________________|
| |
| -> NewContent2 |
|_______________________________________|
| |
| a. Word5 word6 word7 |
|_______________________________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| |
| -> NewContent3 |
|_______________________________________|
| |
| b. Word8 word9 word10 |
|_______________________________________|
| |
| 2. Word11 word12 word13 word14 |
|_______________________________________|
Example:
A large single image cannot give me the flexibility to do this kind of manipulations.
But if the image file was broken down into smaller files which hold single words and single Graphs I could do these manipulations.
What I think I need to do is detect whitespace in the image, and slice the image into multiple sub-images, looking something like this:
_______________________________________
| | | | |
| 1. Word1 | word2 | word3 | word4 |
|__________|_______|_______|____________|
| | | |
| a. Word5 | word6 | word7 |
|_____________|_______|_________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| | | |
| b. Word8 | word9 | word10 |
|_____________|_______|_________________|
| | | | |
| 2. Word11 | word12 | word13 | word14 |
|___________|________|________|_________|
I'm looking for a way to do this.
What do you think is the way to go?
Thank you for your help!
I would use horizontal and vertical projection to first segment the image into lines, and then each line into smaller slices (e.g. words).
Start by converting the image to grayscale, and then invert it, so that gaps contain zeros and any text/graphics are non-zero.
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
Calculate horizontal projection -- mean intensity per row, using cv2.reduce, and flatten it to a linear array.
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
Now find the row ranges for all the contiguous gaps. You can use the function provided in this answer.
row_gaps = zero_runs(row_means)
Finally calculate the midpoints of the gaps, that we will use to cut the image up.
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
You end up with something like this situation (gaps are pink, cutpoints red):
Next step would be to process each identified line.
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
Calculate the vertical projection (average intensity per column), find the gaps and cutpoints. Additionally, calculate gap sizes, to allow filtering out the small gaps between individual letters.
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
Filter the cutpoints.
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
And create a list of bounding boxes for each segment.
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
Now you end up with something like this (again gaps are pink, cutpoints red):
Now you can cut up the image. I'll just visualize the bounding boxes found:
The full script:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
def plot_horizontal_projection(file_name, img, projection):
fig = plt.figure(1, figsize=(12,16))
gs = gridspec.GridSpec(1, 2, width_ratios=[3,1])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(projection, np.arange(img.shape[0]), 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([0.0, 255.0])
plt.ylim([-0.5, img.shape[0] - 0.5])
ax.invert_yaxis()
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def plot_vertical_projection(file_name, img, projection):
fig = plt.figure(2, figsize=(12, 4))
gs = gridspec.GridSpec(2, 1, height_ratios=[1,5])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(np.arange(img.shape[1]), projection, 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([-0.5, img.shape[1] - 0.5])
plt.ylim([0.0, 255.0])
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def visualize_hp(file_name, img, row_means, row_cutpoints):
row_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
row_highlight[row_means == 0, :, :] = [255,191,191]
row_highlight[row_cutpoints, :, :] = [255,0,0]
plot_horizontal_projection(file_name, row_highlight, row_means)
def visualize_vp(file_name, img, column_means, column_cutpoints):
col_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
col_highlight[:, column_means == 0, :] = [255,191,191]
col_highlight[:, column_cutpoints, :] = [255,0,0]
plot_vertical_projection(file_name, col_highlight, column_means)
# From https://stackoverflow.com/a/24892274/3962537
def zero_runs(a):
# Create an array that is 1 where a is 0, and pad each end with an extra 0.
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
# Runs start and end where absdiff is 1.
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
row_gaps = zero_runs(row_means)
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
visualize_hp("article_hp.png", img, row_means, row_cutpoints)
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
visualize_vp("article_vp_%02d.png" % n, line, column_means, filtered_cutpoints)
result = img.copy()
for bounding_box in bounding_boxes:
cv2.rectangle(result, bounding_box[0], bounding_box[1], (255,0,0), 2)
cv2.imwrite("article_boxes.png", result)
The image is top quality, perfectly clean, not skewed, well separated characters. A dream !
First perform binarization and blob detection (standard in OpenCV).
Then cluster the characters by grouping those with an overlap in the ordinates (i.e. facing each other in a row). This will naturally isolate the individual lines.
Now in every row, sort the blobs left-to-right and cluster by proximity to isolate the words. This will be a delicate step, because the spacing of characters within a word is close to the spacing between distinct words. Don't expect perfect results. This should work better than a projection.
The situation is worse with italics as the horizontal spacing is even narrower. You may have to also look at the "slanted distance", i.e. find the lines that tangent the characters in the direction of the italics. This can be achieved by applying a reverse shear transform.
Thanks to the grid, the graphs will appear as big blobs.

PySpark: conditional join on calculation

I got a dataframe that contains locations and their GPS coordinates as longitude and latitude. Now I want to find those locations that are in a range of 500m of another. Therefore I'm trying to join the dataframe with itself - but not doing a full join, but only for those values where the condition is met thus reducing the join overall. But I get this error:
Py4JJavaError: An error occurred while calling o341.join. :
java.lang.RuntimeException: Invalid PythonUDF
PythonUDF#(latitude#1655,longitude#1657,lng#1665,ltd#1666),
requires attributes from more than one child.
Any idea how to solve that? I know that you can do conditional joins based on the values of columns. But I need it based on a calculation that needs values of 4 columns.
Here's what I did:
The original dataframe looks like this:
df
|-- listing_id: integer (nullable = true)
|-- latitude: float (nullable = true)
|-- longitude: float (nullable = true)
|-- price: integer (nullable = true)
|-- street_address: string (nullable = true)
From this I'm creating a copy while renaming some columns. This is a pre-requisite since the join operation doesn't like two columns of the same name.
df2 = df.select(df.listing_id.alias('id'),
df.street_address.alias('address'),
df.longitude.alias('lng'),
df.latitude.alias('ltd'),
df.price.alias('prc')
)
Then I got the haversine function that calculates the distance between two geo locations in metric kilometers:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
That's the function I would like to apply to the conditional join:
from pyspark.sql.types import *
from pyspark.sql.functions import udf, col
#berlin_lng = 13.41053
#berlin_ltd = 52.52437
#hav_distance_udf = udf(lambda lng1, ltd1: haversine(lng1, ltd1, berlin_lng, berlin_ltd), FloatType())
#df3 = df.withColumn("distance_berlin", hav_distance_udf(df.longitude, df.latitude))
hav_distance_udf = udf(lambda lng1, ltd1, lng2, ltd2: haversine(lng1, ltd1, lng2, ltd2), FloatType())
in_range = hav_distance_udf(col('latitude'), col('longitude'), col('lng'), col('ltd')) > 0.5
df3 = df.join(df2, in_range)
The disabled function withColumn works fine. But the conditional join raises the error see above. Any idea how to fix that?

CGAffineTransform: How to calculate multiply CGAffineTransform?

I need transform view from origin(250, 250) to origin(352, 315), and width/height change from (100.0, 100.0) to (68, 68).
I know I can combine several CGAffineTransform function together, such as scale, rotate, translate.
But i don't know how to count the order of those transformations, and the exact parameter of them.
I have try several time, but can't move the view to correct position.
Anyone can help?
A little understanding about what is happening behind the scenes is always nice in these matrix transformations.
Apple docs has a great documentation about transforms, so let's use it.
A translation matrix looks like :
| 1 0 0 |
| 0 1 0 |
| tx ty 1 |
where (tx, ty) is your translation vector.
A scaling matrix looks like :
| sx 0 0 |
| 0 sy 0 |
| 0 0 1 |
where sxand sy are the scale factor in the X and Y axis.
You want to concatenate these matrix using CGAffineTransformConcat, but as according to its doc :
Note that matrix operations are not commutative—the order in which you
concatenate matrices is important. That is, the result of multiplying
matrix t1 by matrix t2 does not necessarily equal the result of
multiplying matrix t2 by matrix t1.
You have to translate your view before scaling it, otherwise your translation vector will be scaled according to sx and sy coefficients.
Let's show it easily :
let scaleMatrix = CGAffineTransformMakeScale(0.68, 0.68)
let translateMatrix = CGAffineTransformMakeTranslation(102, 65)
let translateThenScaleMatrix = CGAffineTransformConcat(scaleMatrix, translateMatrix)
NSLog("translateThenScaleMatrix : \(translateThenScaleMatrix)")
// outputs : CGAffineTransform(a: 0.68, b: 0.0, c: 0.0, d: 0.68, tx: 102.0, ty: 65.0)
// the translation is the same
let scaleThenTranslateMatrix = CGAffineTransformConcat(translateMatrix, scaleMatrix)
NSLog("scaleThenTranslateMatrix : \(scaleThenTranslateMatrix)")
// outputs : CGAffineTransform(a: 0.68, b: 0.0, c: 0.0, d: 0.68, tx: 69.36, ty: 44.2)
// the translation has been scaled too
And let's prove it mathematically. Please note that when you perform an operation A then an operation B, the related matrix is computed by doing matB*matA, the first operation is on the right. Since multiplication is not commutative for matrix, it's important.
// Translate then scaling :
| sx 0 0 | | 1 0 0 | | sx 0 0 |
| 0 sy 0 | . | 0 1 0 | = | 0 sy 0 |
| 0 0 1 | | tx ty 1 | | tx ty 1 |
// The resulting matrix has the same value for translation
// Scaling then translation :
| 1 0 0 | | sx 0 0 | | sx 0 0 |
| 0 1 0 | . | 0 sy 0 | = | 0 sy 0 |
| tx ty 1 | | 0 0 1 | | sx.tx sy.ty 1 |
// The translation values are affected by scaling coefficient
struct CGAffineTransform {
CGFloat a, b, c, d;
CGFloat tx, ty;
};
You can get parameters by this struct.And transforms always override,in another words,they won't superpose,pay attention to this.

Resources