Detecting Possible Duplicates in Rails

Detecting Possible Duplicates in Rails - ruby-on-rails

I have a Rails 3 application that has a model w/ a Name, and a Geographic Location (lat/lng). How would I go about search for possible duplicates in my model. I want to create a cron job or something that checks to see if two objects have a similar name and that they are less than 0.5 miles away from each other. If this matches then we'll flag the objects or something.
I am using Ruby Geocoder and ThinkingSphinx in my application.

Levenshtein is as good a way as any for judging the similarity of two text strings, ie the names.
What i would suggest is to (as well as, or instead of, the single "lat;long" string) store the latitude and longitude seperately. Then you can do an sql query to find other records that are within a certain distance, THEN run the levenshtein on their names. You want to try to run the lev as few times as possible as it's slow.
Then you could do something like this: let's say your model name is "Place":
class Place < ActiveRecord::Base
def nearby_places
range = 0.005; #adjust this to get the proximity you want
#lat and long are fields to hold the latitude and longitude as floats
Place.find(:all, :conditions => ["id <> ? and lat > ? and lat < ? and long > ? and long < ?", self.id, self.lat - range, self.lat + range, self.long - range, self.long + range])
end
def similars
self.nearby_places.select do |place|
#levenshtein logic here - return true if self.name and place.name are similar according to your criteria
end
end
end
I've set range to 0.005 but i've no idea what it should be for 1/2 a mile. Let's work it out: google says one degree of latitude is 69.13 miles, so i guess half a mile in degrees would be 1/(69.13 * 2) which gives 0.0072, so not a bad guess :)
Note that my search logic would return places that are anywhere within a square which is a mile per side, with our current place in the centre. This would potentially include more places than a circle with 1/2 mile radius with our current place in the centre, but it's probably fine as a quick way of getting some nearby places.

Related

Improve Neo4j query performance

I have a Neo4j query with searched multiple entities and I would like to pass parameters in batch using nodes object. However, I the speed of query execution is not quite high. How can I optimize this query and make its performance better?
WITH $nodes as nodes
UNWIND nodes AS node
with node.id AS id, node.lon AS lon, node.lat AS lat
MATCH
(m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic),
(t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->
(e:Event)-[a_r:AT]->(v:Venue)
WHERE mt.topic_id = gt.topic_id AND
distance(point({ longitude: lon, latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000 AND
mt.member_id = id
RETURN
distinct id as member_id,
lat as member_lat,
lon as member_lon,
g.group_name as group_name,
e.event_name as event_name,
v.venue_name as venue_name,
v.lat as venue_lat,
v.lon as venue_lon,
distance(point({ longitude: lon,
latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) as distance
Query profiling looks like this:

So, your current plan has 3 parallel threads. One we can ignore for now because it has 0db hits.
The biggest hit you are taking is the match for (mt:MemberTopics) ... WHERE mt.member_id = id. I'm guessing member_id is a unique id, so you will want to create an index on it CREATE INDEX ON :MemberTopics(member_id). That will allow Cypher to do an index lookup instead of a node scan, which will reduce the DB hits from ~30mill to ~1 (Also, in some cases, in-lining property matches is faster for more complex queries. So (mt:MemberTopics {member_id:id}) is better. It explicitly makes clear that this condition must always be true while matching, and will reinforce to use the index lookup)
The second biggest hit is the point-distance check. Right now, this is being done independently, because the node scan takes so long. Once you make the changes for MemberTopic, The planner should switch to finding all connected Venues, and then only doing the distance check on thous, so that should become cheaper as well.
Also, it looks like mt and gt are linked by a topic, and you are using a topic id to align them. If t and t1 are suppose to be the same Topic node, you could just use t for both nodes to enforce that, and then you don't need to do the id check to link mt and gt. If t and t1 are not the same node, the use of a foriegn key in your node's properties is a sign that you should have a relationship between the two nodes, and just travel along that edge (Relationships can have properties too, but the context looks a lot like t and t1 are suppose to be the same node. You can also enforce this by saying WHERE t = t1, but at that point, you should just use t for both nodes)
Lastly, Depending on the number of rows your query returns, you may want to use LIMIT and SKIP to page your results. This looks like info going to a user, and I doubt they need the full dump. So Only return the top results, and only process the rest if the user wants to see more. (Useful as results approach a metric ton) Since you only have 21 results so far, this won't be an issue right now, but keep in mind as you need to scale to 100,000+ results.

rails Geocoder gem order by location from me

I have Place model with lat and long fields.
So i need to select all places and order them using my geoposition.
I think that i need to use Geocoder gem.
So user send to me his lat & long
and i need to do smth like
Place.where(status: 1).and order them using my lat long.
How can i do?
Thanks

You would use the near method:
Place.where(status: 1).near( "Party City, Utah", 20 ) # Finds places within 20 miles
This will grab all Place records that are within a specific radius of the given location. In this case, your location is "Party City, Utah". You can add additional arguments to the near method to limit the number of results, and also edit the radius of your query.
More information in the GeoCoder Docs

PostGIS: How to find N closest sets of points to a given set?

I am using PostGIS/Rails and have sets of points with geolocations.
class DataSet < ActiveRecord::Base # these are the sets containing the points
has_many :raw_data
# attributes: id , name
end
class RawData < ActiveRecord::Base # these are the data points
belongs_to :data_set
# attributes: id, location which is "Point(lon,lat)"
end
For a given set of points I need to find the N closest sets and their distance;
or alternatively:
For a given max distance and set of points I need to find the N closest sets.
What is the best way to do this with PostGIS?
My versions are PostgreSQL 9.3.4 with PostGIS 2.1.2

The answer on how to find the N-closest neighbours in PostGIS are given here:
Postgis SQL for nearest neighbors
To summarize the answer there:
You need to create a geometry object for your points. If you are using latitude, longitude, you need to use 4326.
UPDATE season SET geom = ST_PointFromText ('POINT(' || longitude || ' ' || latitude || ')' , 4326 ) ;
Then you create an index on the geom field
CREATE INDEX [indexname] ON [tablename] USING GIST ( [geometryfield] );
Then you get the kNN neightbors:
SELECT *,ST_Distance(geom,'SRID=4326;POINT(newLon newLat)'::geometry)
FROM yourDbTable
ORDER BY
yourDbTable.geom <->'SRID=4326;POINT(newLon newLat)'::geometry
LIMIT 10;
Where newLon newLat are the query points coordinates.
This query will take advantage of kNN functionality of the gist index (http://workshops.boundlessgeo.com/postgis-intro/knn.html).
Still the distance returned will be in degrees, not meters (projection 4326 uses degrees).
To fix this:
SELECT *,ST_Distance(geography(geom),ST_GeographyFromText('POINT(newLon newLat)')
FROM yourDbTable
ORDER BY
yourDbTable.geom <->'SRID=4326;POINT(newLon newLat)'::geometry
LIMIT 10;
When you calculate the ST_distance use the geography type. There the distance is always in meters:
http://workshops.boundlessgeo.com/postgis-intro/geography.html
All this functionality will probably need a recent Postgis version (2.0+). I am not sure though.
Check this for reference https://gis.stackexchange.com/questions/91765/improve-speed-of-postgis-nearest-neighbor-query/
EDIT. This covers the necessary steps for one point. For set of points:
SELECT n1.*,n2.*, ST_Distance(n1.geom,n2.geom)
FROM yourDbTable n1, yourDbTable n2
WHERE n1.setId=1 AND n1.setId=2 //your condition here for the separate sets
AND n1.id<>n2.id // in case the same object belong to 2 sets
ORDER BY n1.geom <->n2.geom
LIMIT 20;

finding locations within a particular distance using db2

I am using html5 geolocation api to get my position in latitude and longitude. I want to store them in a table of locations and want to retrieve those locations within a particular distance.
my current latitudes and longitudes are stored in variables "latval", "longval", "distance"
My table is "location"
columns are "location", "lat", "long"
I am using DB2 Express C as database and latitude and longitude columns are set as double type now. What type should I use to store these values and what would be the query to get location names within a distance
Thank you.

It looks like there's an extension for Express C that includes Spatial processing. I've never used it (and can't seem to get access at the moment), so I can't speak to it. I'm assuming that you'd want to use that (find all locations within a radius is a pretty standard query).
If for some reason you can't use the extension, here's what I would do:
Keep your table as-is, or maybe use a float data-type, although please use full attribute names (there's no reason to truncate them). For simple needs, the name of the 'location' can be stored in the table, but you may want to give it a numeric id if more than one thing is at the same location (so actual points are only in there once).
You're also going to want indicies covering latitude and longitude (probably one each way, or one covering each column).
Then, given a starting position and distance, use this query:
SELECT name, latitude, longitude
FROM location
WHERE (latitude >= :currentLatitude - :distance
AND latitude <= :currentLatitude + :distance)
AND (longitude >= :currentLongitude - :distance
AND longitude <= :currentLongitude + :distance)
-- The previous four lines reduce the points selected to a box.
-- This is, or course, not completely correct, but should allow
-- the query to use the indicies to greatly reduce the initial
-- set of points evaluated.
-- You may wish to flip the condition and use ABS(), but
-- I don't think that would use the index...
AND POWER(latitude - :currentLatitude, 2) + POWER(longitude - :currentLongitude, 2)
<= POWER (:distance, 2)
-- This uses the pythagorean theorem to find all points within the specified
-- distance. This works best if the points have been pre-limited in some
-- way, because an index would almost certainly not be used otherwise.
-- Note that, on a spherical surface, this isn't completely accurate
-- - namely, distances between longitude points get shorter the farther
-- from the equator the latitude is -
-- but for your purposes is likely to be fine.
EDIT:
Found this after searching for 2 seconds on google, which also reminded me that :distance would be in the wrong units. The revised query is:
WITH Nearby (name, latitude, longitude, dist) as (
SELECT name, latitdude, longitude,
ACOS(SIN(RADIANS(latitude)) * SIN(RADIANS(:currentLatitude)) +
COS(RADIANS(latitude)) * COS(RADIANS(:currentLatitude)) *
COS(RADIANS(:currentLongitude - longitude))) * :RADIUS_OF_EARTH as dist
FROM location
WHERE (latitude >= :currentLatitude - DEGREES(:distance / :RADIUS_OF_EARTH)
AND latitude <= :currentLatitude + DEGREES(:distance / :RADIUS_OF_EARTH))
AND (longitude >= :currentLongitude -
DEGREES(:distance / :RADIUS_OF_EARTH / COS(RADIANS(:currentLatitude)))
AND longitude <= :currentLongitude +
DEGREES(:distance / :RADIUS_OF_EARTH / COS(RADIANS(:currentLatitude))))
)
SELECT *
FROM Nearby
WHERE dist <= :distance
Please note that wrapping the distance function in a UDF marked DETERMINISTIC would allow it to be placed in both the SELECT and HAVING portions, but only actually be called once, eliminating the need for the CTE.

zipcode distance range formula. Which formula is correct?

I need to find all the zipcodes with a certain range from a zipcode. I have all the zipcode lat/lon in a DB.
I have found two formulas online that vary slightly from each other. Which one is correct?
Formula 1:
def latRange = range/69.172
def lonRange = Math.abs(range/(Math.cos(Math.toRadians(zip.latitude)) * 69.172));
def minLat = zip.latitude - latRange
def maxLat = zip.latitude + latRange
def minLon = zip.longitude - lonRange
def maxLon = zip.longitude + lonRange
Formula 2: (is identical to formula one except for the following:)
def lonRange = Math.abs(range/(Math.cos(zip.latitude) * 69.172));
(The second one does not have Math.toRadians )
After I get the min/max Lat/Lon, I intend to search a DB table using a between criteria. Which formula is correct?

It depends on the units of your lat/long. If they are in degrees, you need to convert to radians.

I would suggest letting the db do all the heavy lifting. Several DBs have geo add-ons, but here are a couple examples: MongoDB, postgres+postgis

If your latitude and longitude data isn't already in radians then you'll need to use the one that converts. You should be aware though that the way things are set up now you'll end up with a square range.
You'll be better off doing the distance calculation in your mysql query, using the Pythagorean theorem to find the distance so you end up with a circular range.
This question should help you get started with that: mySQL select zipcodes within x km/miles within range of y .
If you need to improve it's performance you can index it using Sphinx.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart