PySpark: conditional join on calculation

PySpark: conditional join on calculation - join

I got a dataframe that contains locations and their GPS coordinates as longitude and latitude. Now I want to find those locations that are in a range of 500m of another. Therefore I'm trying to join the dataframe with itself - but not doing a full join, but only for those values where the condition is met thus reducing the join overall. But I get this error:
Py4JJavaError: An error occurred while calling o341.join. :
java.lang.RuntimeException: Invalid PythonUDF
PythonUDF#(latitude#1655,longitude#1657,lng#1665,ltd#1666),
requires attributes from more than one child.
Any idea how to solve that? I know that you can do conditional joins based on the values of columns. But I need it based on a calculation that needs values of 4 columns.
Here's what I did:
The original dataframe looks like this:
df
|-- listing_id: integer (nullable = true)
|-- latitude: float (nullable = true)
|-- longitude: float (nullable = true)
|-- price: integer (nullable = true)
|-- street_address: string (nullable = true)
From this I'm creating a copy while renaming some columns. This is a pre-requisite since the join operation doesn't like two columns of the same name.
df2 = df.select(df.listing_id.alias('id'),
df.street_address.alias('address'),
df.longitude.alias('lng'),
df.latitude.alias('ltd'),
df.price.alias('prc')
)
Then I got the haversine function that calculates the distance between two geo locations in metric kilometers:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
That's the function I would like to apply to the conditional join:
from pyspark.sql.types import *
from pyspark.sql.functions import udf, col
#berlin_lng = 13.41053
#berlin_ltd = 52.52437
#hav_distance_udf = udf(lambda lng1, ltd1: haversine(lng1, ltd1, berlin_lng, berlin_ltd), FloatType())
#df3 = df.withColumn("distance_berlin", hav_distance_udf(df.longitude, df.latitude))
hav_distance_udf = udf(lambda lng1, ltd1, lng2, ltd2: haversine(lng1, ltd1, lng2, ltd2), FloatType())
in_range = hav_distance_udf(col('latitude'), col('longitude'), col('lng'), col('ltd')) > 0.5
df3 = df.join(df2, in_range)
The disabled function withColumn works fine. But the conditional join raises the error see above. Any idea how to fix that?

Related

Calculating the bearing change from latitude/longitude

I have at hand, a dataset of GPS logs containing GPS speeds as well. Here's how the dataset looks like:
id | gpstime | lat | lon | speed
--------+------------+------------+------------+---------
157934 | 1530099776 | 41.1825026 | -8.5996864 | 3.40901
157934 | 1530099777 | 41.1825114 | -8.599722 | 3.43062
157934 | 1530099778 | 41.1825233 | -8.5997594 | 3.45739
157934 | 1530099779 | 41.1825374 | -8.5997959 | 3.40025
157934 | 1530099780 | 41.1825519 | -8.5998337 | 3.41673
(5 rows)
Now I want to compute the bearing change, for each point with respect to the true north.
But I have these questions I am yet to find answers to:
Based on my reading, I come across the formula (as in this answer):
Bearing = atan(y,x)
where x and y are the quantities
y = sin(Blon-Alon) * cosBlat
x = cosAlat * sinBlat -sinAlat * cosBlat * cos(Blon-Alon)
respectively for points A and B. Then from another source, the formula here, the formula is written:
Bearing = atan2(y,x)
So I'm confused, which of the formula should I use?
lat and lon should be converted from degrees to radian before passing to quantities x and y. Being that the values of lon in my dataset are negatives, should I take the absolute value of each?

I think for GPS tracks this would be an overkill. In case the distance between two point are not to big (let's say a few hundreds of meters) I assume this simplified calculation is sufficient.
The latitude/longitude differences are app.
Δlat = 111km * (lat1 - lat2)
Δlon = 111km * cos(lat) * (lon1 - lon2)
So bearing would be
bearing = atan(Δlon / Δlat) * 180/π
bearing = atan(cos(lat) * (lon1 - lon2) / (lat1 - lat2)) * 180/ACOS(-1)
for lat use either lat1 or lat2 or the middle if you like.
lat = (lat1 + lat2)/2 * π/180 = (lat1 + lat2)/2 * ACOS(-1)/180
Consider Δlat or Δlat could be 0

how to compare complex decimal number with integer value

I'm going to get all clinics that are near to my latitude and longitude. i did that with following method. the result of dist is a long value that i need to compare it with a integer value. i don't know why i get this error during the compare dist and distance which is a integer value.
this is my error:
NoMethodError (undefined method `<=' for (-2.693846638591123+0.0i):Complex):
and this is what everyhting that i did for this:
def get_clinic_list
ulat=params[:lat]
ulang=params[:lang]
distance=params[:distance]
#clinic=[]
Clinic.all.each do |clinic|
if clinic_distance(ulat,ulang,distance,clinic.id)
#doctor=DoctorProfile.find_by(user_id: clinic.user_id)
end
end
end
def clinic_distance(ulat, ulang,distance,clinic)
#clinic=Clinic.find(clinic)
diff_lat= ulat.to_f - #clinic.latitude.to_f
diff_lang= ulang.to_f - #clinic.longitude.to_f
#disc=Math.sqrt(((diff_lat*119.574)**2)+(diff_lang * Math.cos(diff_lat) * 111.320))
a=(diff_lat * 119.574) ** 2
b= diff_lang * Math.cos(diff_lat) * 111.320
c=a+b
logger.info "the c parameter is #{c}"
dist=Math.sqrt(c)
dist = dist ** 2
if dist <= distance
return true
else
return false
end
end

Complex numbers don't support <= or >= (although they do support ==)
Simplest solution is to get the absolute part of the number
if dist.abs <= distance

There was a mistake is pretty much every line of your clinic_distance method. I tried my best at correcting it, but I cannot test it without your data.
The problem isn't about Complex numbers. I don't know where this Complex number comes from, possibly from a negative c in your Math.sqrt(c).
EarthRadius = 6371 # km
OneDegree = EarthRadius * 2 * Math::PI / 360 # 1° latitude in km
def get_clinic_list
lat = params[:lat]
lon = params[:lang] # :lang???
max_distance = params[:distance] # :distance should probably be :max_distance
#clinic = [] # What do you do with this empty array?
Clinic.all.each do |clinic|
if distance_in_km(lat, lon, clinic.latitude, clinic.longitude) < max_distance
# Do you really want to keep overriding #doctor every time a clinic is found?
#doctor = DoctorProfile.find_by(user_id: clinic.user_id)
end
end
# You return every clinic, even ones far away...
end
def distance_in_km(lat1, lon1, lat2, lon2)
diff_lat = lat1.to_f - lat2.to_f
diff_lon = lon1.to_f - lon2.to_f
lat_km = diff_lat * OneDegree
lon_km = diff_lon * OneDegree * Math.cos(lat1.to_f * Math::PI / 180) # Math.cos expects a radian angle
Math.sqrt(lat_km**2 + lon_km**2)
end

How to convert GPS coordinates to decimal in Lua?

I need to convert GPS coordinates from WGS84 to decimal using Lua.
I am sure it's been done before, so I am looking for a hint to a code snippet.
corrected question: Code to convert DMS (Degress Minutes Seconds) to DEG ((decimal) Degrees) in Lua?
examples:
Vienna: dms: 48°12'30" N 16°22'28" E
or
Zurich: dms: 47°21'7" N 8°30'37" E
The difficulty I find is to get the numbers out of these strings.
Especially how to handle the signs for degree (°) minutes (') and seconds (").
So that I would have for example a table coord{} per coordinate to deal with.
coord {1} [48]
coord {2} [12]
coord {3} [30]
coord {4} [N]
coord {5} [16]
coord {6} [22]
coord {7} [28]
coord {8} [E]
Suggestions are appreciated, thanks.

Parse the string latlon = '48°12'30" N 16°22'28" E' into DMS+heading components:
This is your string (note the escaped single-quote):
latlon = '48°12\'30" N 16°22\'28" E'
Break it down into two steps: the lat/lon, then components of each. You need captures "()", ignore spaces around the heading (N and E) with "%s*":
lat, ns, lon, ew = string.match(latlon, '(.*)%s*(%a)%s*(.*)%s*(%a)')
The lat is now 48°12'30", ns is 'N', lon is 16°22'28", ew is 'E'. For components of lat, step by step:
-- string.match(lat, '48°12'30"') -- oops the ' needs escaping or us
-- string.match(lat, '48°12\'30"')
-- ready for the captures:
-- string.match(lat, '(48)°(12)\'(30)"') -- ready for generic numbers
d1, m1, s1 = string.match(lat, '(%d+)°(%d+)\'(%d+)"')
d2, m2, s2 = string.match(lon, '(%d+)°(%d+)\'(%d+)"')
Now that you know (d1, m1, s1, ns) and (d2, m2, s2, ew), you have:
sign = 1
if ns=='S' then sign = -1 end
decDeg1 = sign*(d1 + m1/60 + s1/3600)
sign = 1
if ew=='W' then sign = -1 end
decDeg2 = sign*(d2 + m2/60 + s2/3600)
For your values of lat, you get decDeg1 = 48.208333 which is the correct value according to online calculators (like http://www.satsig.net/degrees-minutes-seconds-calculator.htm).

combine time series plot by using R

I wanna combine three graphics on one graph. The data from inside of R which is " nottem ". Can someone help me to write code to put a seasonal mean and harmonic (cosine model) and its time series plots together by using different colors? I already wrote model code just don't know how to combine them together to compare.
Code :library(TSA)
nottem
month.=season(nottem)
model=lm(nottem~month.-1)
summary(nottem)
har.=harmonic(nottem,1)
model1=lm(nottem~har.)
summary(model1)
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
points(y=nottem,x=time(nottem), pch=as.vector(season(nottem)))

Just put your time series inside a matrix:
x = cbind(serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)),
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)))
plot(x)
Or configure the plot region:
par(mfrow = c(2, 1)) # 2 rows, 1 column
serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
require(zoo)
plot(serie1)
lines(rollapply(serie1, width = 10, FUN = mean), col = 'red')
plot(serie2)
lines(rollapply(serie2, width = 10, FUN = mean), col = 'blue')
hope it helps.
PS.: zoo package is not needed in this example, you could use the filter function.
You can extract the seasonal mean with:
s.mean = tapply(serie, cycle(serie), mean)
# January, assuming serie is monthly data
print(s.mean[1])

This graph is pretty hard to read, because your three sets of values are so similar. Still, if you want to simply want to graph all of these on the sample plot, you can do it pretty easily by using the coefficients generated by your models.
Step 1: Plot the raw data. This comes from your original code.
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
Step 2: Set up x-values for the mean and cosine plots.
x <- seq(1920, (1940 - 1/12), by=1/12)
Step 3: Plot the seasonal means by repeating the coefficients from the first model.
lines(x=x, y=rep(model$coefficients, 20), col="blue")
Step 4: Calculate the y-values for the cosine function using the coefficients from the second model, and then plot.
y <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
lines(x=x, y=y, col="red")
ggplot variant: If you decide to switch to the popular 'ggplot2' package for your plot, you would do it like so:
x <- seq(1920, (1940 - 1/12), by=1/12)
y.seas.mean <- rep(model$coefficients, 20)
y.har.cos <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
plot_Data <- melt(data.frame(x=x, temp=nottem, seas.mean=y.seas.mean, har.cos=y.har.cos), id="x")
ggplot(plot_Data, aes(x=x, y=value, col=variable)) + geom_line()

List comprehensions with float iterator in F#

Consider the following code:
let dl = 9.5 / 11.
let min = 21.5 + dl
let max = 40.5 - dl
let a = [ for z in min .. dl .. max -> z ] // should have 21 elements
let b = a.Length
"a" should have 21 elements but has got only 20 elements. The "max - dl" value is missing. I understand that float numbers are not precise, but I hoped that F# could work with that. If not then why F# supports List comprehensions with float iterator? To me, it is a source of bugs.
Online trial: http://tryfs.net/snippets/snippet-3H

Converting to decimals and looking at the numbers, it seems the 21st item would 'overshoot' max:
let dl = 9.5m / 11.m
let min = 21.5m + dl
let max = 40.5m - dl
let a = [ for z in min .. dl .. max -> z ] // should have 21 elements
let b = a.Length
let lastelement = List.nth a 19
let onemore = lastelement + dl
let overshoot = onemore - max
That is probably due to lack of precision in let dl = 9.5m / 11.m?
To get rid of this compounding error, you'll have to use another number system, i.e. Rational. F# Powerpack comes with a BigRational class that can be used like so:
let dl = 95N / 110N
let min = 215N / 10N + dl
let max = 405N / 10N - dl
let a = [ for z in min .. dl .. max -> z ] // Has 21 elements
let b = a.Length

Properly handling float precision issues can be tricky. You should not rely on float equality (that's what list comprehension implicitely does for the last element). List comprehensions on float are useful when you generate an infinite stream. In other cases, you should pay attention to the last comparison.
If you want a fixed number of elements, and include both lower and upper endpoints, I suggest you write this kind of function:
let range from to_ count =
assert (count > 1)
let count = count - 1
[ for i = 0 to count do yield from + float i * (to_ - from) / float count]
range 21.5 40.5 21
When I know the last element should be included, I sometimes do:
let a = [ for z in min .. dl .. max + dl*0.5 -> z ]

I suspect the problem is with the precision of floating point values. F# adds dl to the current value each time and checks if current <= max. Because of precision problems, it might jump over max and then check if max+ε <= max (which will yield false). And so the result will have only 20 items, and not 21.

After running your code, if you do:
> compare a.[19] max;;
val it : int = -1
It means max is greater than a.[19]
If we do calculations the same way the range operator does but grouping in two different ways and then compare them:
> compare (21.5+dl+dl+dl+dl+dl+dl+dl+dl) ((21.5+dl)+(dl+dl+dl+dl+dl+dl+dl));;
val it : int = 0
> compare (21.5+dl+dl+dl+dl+dl+dl+dl+dl+dl) ((21.5+dl)+(dl+dl+dl+dl+dl+dl+dl+dl));;
val it : int = -1
In this sample you can see how adding 7 times the same value in different order results in exactly the same value but if we try it 8 times the result changes depending on the grouping.
You're doing it 20 times.
So if you use the range operator with floats you should be aware of the precision problem.
But the same applies to any other calculation with floats.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

PySpark: conditional join on calculation - join

Related

Calculating the bearing change from latitude/longitude

how to compare complex decimal number with integer value

How to convert GPS coordinates to decimal in Lua?

combine time series plot by using R

List comprehensions with float iterator in F#

Categories

Resources