Let's say I have 1000 objects that have a latitude/longitude value.
What I'd like to do is send up a user's latitude/longitude and return the 50 closest objects. I know how I could do that with one value - if I had just a latitude, but location seems different with the two values so I'm not sure how I'd do that.
What would be the best way to accomplish that? Feel free to point me to a tutorial or anything - I'm very new to using location in rails :).
There isn't any detail in the original question regarding data representation, but let's suppose you have a list of lat/long values in the form, [lat, long] in a list, locations. Let's suppose further you have a metric method for the distance between them, dist(lat, long).
Then this will collect all of the location pair combinations and their respective distances as a collection of triples, [dist, p1, p2]:
location.combination(2).collect { |p| [dist(p[0], p[1]), p[0], p[1]] }
You can then sort this which will order by distance, and pick off the top 50:
location.combination(2).collect { |p| [dist(p[0], p[1]), p[0], p[1]] }.sort.first(50)
You'd have to see if it works with 1,000 objects, as this will create 999,000 combinations initially and I don't know what the Ruby array capacity is.
This is a classic computer science problem called the nearest neighbor problem. See this wikipedia article for two popular solutions.
The article describes the point with x, y coordinates. You can just substitute longitude for x and latitude for y.
Related
I have a dataset which contains information about houses worldwide with the following features: house size, number of bedrooms, city name, country name, garden or not, ... (and many other typical house information). And the target variable is the price of the house.
I know that strings are not acceptable as input in a Machine Learning or Neural Network model so instead of doing one hot encoding for the city name and the country name (because I would end up with a few hundred columns) I decided to replace the city name with its geographical coordinates (one column with longitude and one column with latitude).
The city where a house is located will obviously help determine the price of the house.
So does changing the city name with its longitude and latitude preserve this important information? Is it alright to change the city name with its longitude and latitude ?
Cartesian coordinates can be useful for the model to some extent. However, for certain models such as decision trees, properly modeling the dependency of the target variable on geographical coordinates can require overly complex models. For a clear and visual understanding of this you may check this.
A common approach in these cases is to transform the coordinates into polar coordinates, and add them as new features. When you think about it, you're adding a new way of expressing a same thing, just in a different scale or system. That way a tree will require less splits to be able to model this spatial dependency of the samples.
That being said, I would not completely replace the existing geolocation data with coordinates. It would probably be interesting too to add some aggregates/statistics based on the country of city data, rather than one hot encoding them or just replacing them by coordinates.
Is it possible to sort returned objects from Backand based on how near the location field of type "point" is to the querying users current location?
From the Backand docs I have only seen support for querying based on a maximum distance from a point but nothing about sorting by geo points.
I was able to create a custom query in Backand which I can hit from the Backand API. Unfortunately in order to sort on the distance of nearby users I need to calculate the distance from the current user to every other user in the database and then sort based on this. Seems very complex - a lot of calculations every time the query is called! Will probably see big performance hits as the database gets larger. Guess it answers this question, but I am hopeful still of finding a better alternative.
Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.
I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query
I'm working on an iOS game where players villages are displayed on a large kingdom map. Each village has a x,y location on that map, and each village is stored as an object in a database on a server (Parse.com).
What I want to be able to do is pull down all the villages around the current players village. Usually this would be straightforward as you would just use the shortest distance algorithm, but to use that I would need to download all of the villages in the database, and then run the algorithm on each one, then sort them according to distance from player, which is not exactly a quick/efficient way of doing it. So does anyone know of a more refined/efficient way of doing the above? what would be great would be to be able to pull down the villages around the current player actually in the query to the database, kill 2 birds with one stone so to speak, but I can't see any way to do that. I suspect the answer lies in perhaps storing more information about the location of the village in the database, so a query could pull the closest ones down without having to run an algorithm to make it happen.
Any ideas?
I'll leave this question up as I'm still interested in how to do it with basic math, the Manhatten distance approach should be ok, but for anyone using Parse.com it might be possible to use geoPoints maybe? It's a nutty idea, but I'm going to try it.
You can easily get all villages within D units of Manhattan distance away from point (X,Y) by querying for all villages matching the constraints X - D < x and x < X + D and Y - D < y and y < Y + D.
You can then do further filtering based on Euclidean distance on the client if you want to.