Clarification on previous post , Geohashing string length and accuracy? - geolocation

Hello I'm a little confused about the answer I found here. The answer says that by increasing the length of the string you can increase the accuracy, which I understand. What I don't understand is how he is getting the numbers for the accuracy that he gets. He goes from a 110km x 110km to a 10km x 10km area by adding a digit. I want to get a 5m x 5m area. Can some give a more in depth explanation of how he got these figures.

Basically, he is talking about the accuracy of a specified point.
For example if we have a Location which we are interested, and we have two points defining that same location
- point1(156.34,-23.34), and
- Point2(156.342,-23.343)
So effectively the first point is somewhere between Latitude (156.335 - 156.344), Logitude (-23.335 - -23.344), and the second point Latitude (156.3415 - 156.3424) and Logitude (-23.3425 - -23.3434).
Basically normal GPS might get to near that 5 meter accuracy, though most often its anyway going to be less accurate, thus, what you need to to, is simply to use the whole value you get from the positioning module.

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

How do I calculate the nearest country on a given heading?

I'd like to calculate the closest country (as viewed on a world map) in a given direction (provided in degrees) from a user's current location.
I realize one way of doing this is to use the formula provided here to step in, for example, 5-mile increments from point to point until I finally reach a country that is not the user's starting country. However, that seems horribly inefficient with regard to use of geocoding resources.
Do any of you know of a better algorithm I could use for this?
Thanks in advance.
One way to reduce the amount of reverse geocoding operations is to treat this problem as a search for the border. If you use a binary search algorithm, and reverse geocode each point, you find where the country changes from your current country to the adjacent country with a minimum number of reverse geocode operations.
In the binary search, your heading is constant, and you have a minimum range (5 miles) and a maximum range (12,000 miles), you are searching for the range at which the border lies. Then you reverse geocode a position just beyond the border to find out what country is there. One problem is that just beyond the border might be ocean.
I would use MKReverseGeocoding. Check this SO question for code examples.

Mahout Recommender: What relative preference values are suitable for a GenericUserBasedRecommender?

In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.
In generating a "preference" value for an item, we have the following 5 data points:
Positive interest
User converted on item (highest possible sign of interest)
Normal like (user expressed interest, e.g. like buttons)
Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")
Negative interest
Indifference (items the user ignored when active on other items, a vague expression of disinterest)
Active dislike (thumbs down, remove item from my view, etc)
Over what range I should express these different attributes, let's use a 1-100 scale for discussion?
Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' # 100, 'Lesser likes' # ~65, 'Dislikes' clustered in the 1-10 range?
On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?
I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.
The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.
These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.
A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.
I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.

Calculating a lot of Lat/Lngs to a set of 2000 Lat/Lngs in Ruby

I am trying to find the best way to solve the problem below:
Problem
I have (up to) 100,000 Lat/Lng points in Set A
I have (up to) 2000 Lat/Lng points in Set B
I need to find the nearest neighbour of points in set B to points in Set A.
Once they have been paired - I then need to calculate their distance which will be:
2000 Set A points to 2000 Set B Points.
These points are "in memory" they do not come from a database - they are the result of other calculations done the in the system.
Current Solution
Using a KDTree implementation in Ruby I can create a KDTree lookup that will match the points I have. I then use a haversine method in Ruby to calculate the distance of the points when they are paired.
KDtree code: Ruby KDTree Code
haversine Code: Haversine Code
Platform
I am running jruby - with rails as the web framework.
Issue
Its slow! Like 30 to 40 seconds slow... I think the main bottle neck is in the KDtree, but the point look up takes a long time too (i think). At smaller numbers in Set B its quick but the higher the number of points in Set B it gets a lot quicker.
The Question
Would anyone think of doing this differently? Is there something I am missing. I think a Java library might be a lot quicker, but how would I implement this, and which one would I use (Not strong in Java - I use Jruby for multithreading ruby code in the JVM)
Is it possible to persist the information to a database? Because then you can use GeoKit, which leverages a geo-aware database (MySQL, Postgres > 8.1, etc) so that you can do this:
Location.find(:all, :origin =>[37.792,-122.393], :within=>10, :order=>"distance asc")
Also, you can find the distance between two points, etc. The response time will be more on par with a DB query, and much faster than what you're seeing.
Just an idea in my mind. If you round your lat/long's to two decimal places then all the points with-in 1.11 km's will be the same. See this for more details. I'm not 100% sure about it but may be it works for you. Off-course for areas near the pols, this will not work as longitude shrinks there.
To speed up the distance calculation between two lat/long's, you can calculate euclidean distance by using simple distance formula rather than geographical distance. This distance will not be accurate off-course but will speed up your process.

Can longitude and latitude change?

I'm working on a GeoTargeting application. I'm curious if longitude and latitude of a point on the earth can change?
If you know the exact position of the statue of liberty how sure is it that longitude and latitude will stay the same.
Does it change according to the season, time in the year, or slowly over time
Wikipedia to the rescue:
The surface layer of the Earth, the
lithosphere, is broken up into several
tectonic plates. Each plate moves in a
different direction, at speeds of
about 50 to 100 mm per year. As a
result, for example, the longitudinal
difference between a point on the
equator in Uganda (on the African
Plate) and a point on the equator in
Ecuador (on the South American Plate)
is increasing by about 0.0014
arcseconds per year.
It depends on the map projection variables you use. Currently WGS-84 is used mostly.
The same point can have different coordinates depending on the variables. They do not differ a lot, I remember the difference between EUR-50 (or something like that) and WGS-84 was at most 50 meters or something.
You're tangentially referring to geodetics, which is the science of modelling (representing) the shape of the earth. So while a physical location may not change, the datum (model) used by a geodetic coordinate system will change, fortunately this does not happen frequently.
In North America NAD83 is the mostly widely used datum, which replaced NAD27.
Did I mention that Geographic Information Systems (GIS) was my foray into software development?
Yes. Zip codes get split all the time, and doing so would move the center of the zip code to a new location.
47.554 always equals 47.554
But if the shape of the earth changes or you are using different methods of calculations (there are plenty) or if the input data changes in precision or if if your compiler treats floating point differently..
you'll end up in different long/lat

Resources