I've just implemented the Sunspot gem into my application and I really like it except for the fact that when I do a location search it seems to be excluding some results. For example: I live in Columbus Ohio so if I search for "Columbus Ohio" my application translates that into a lat/lng and I do:
#search = (Skatepark.search {
with(:coordinates).near lat, lng, :precision => 3
fulltext text
paginate :page => params[:page], :per_page => 15
})
This returns some records that are geocoded on the west side of columbus but none of my records that I have in my DB that are on the east side. Am I doing something wrong w/ my search?
You can try it out for yourself at http://skateparks.co/search
If you search for "Columbus Ohio" you'll get totally different results than if you search for "Lancaster Ohio" which is only a few miles to the southeast.
This is because Sunspot depends on gem 'pr_geohash' which generates the geospatial index.
A geohash is a form of z-order curve.
You are attempting to solve a nearest neighbor problem using this index, which by its nature can only produce approximate results. The author would have chosen this approach since Solr is designed to deal with large datasets, which suffer from the curse of dimensionality.
Depending on your requirements, perhaps you should try
a siloless solution eg. Freebase (there are several gems)
a saas solution
a spatial database,
a geodatabase,
or your own approach?
Proposed Solution
Use geojson objects for cities, towns, & villages rather than plain lat/long. Use semantic autocompletion so that users can select the polygon directly from Freebase. (Interface example: Quora search box) This way, all results in the requested city will be returned as you are now doing a polygonal bounding search rather than a radial search.
I urge you to make your data available as Freebase Locations where the parent is a City/Town/Village if this is at all possible given your business model. Turns out there is both a location type and a location topic all ready for you.
Update 1
I notice that you're on Heroku.
If you have a dedicated db you could use PostGIS.
If not, read How do you do GIS queries on Heroku using the shared database?
Related
I have a search record that stores an all_words column. It is a list of words separated by a space. I have another model named Lead. I want to search all the columns of all the rows of the leads table for the values in all_words. And any record that produces a match in any of its columns will be retrieved. Kind of like this:
possible_values = search.all_words.split
Lead.where(first_name: possible_values )
.where(last_name: possible_values )
.where(status: possible_values )
...
But this doesn't look clean. How can I go about this?
Indexing
You'll be much better suited to using an index-based search solution
I wrote about this the other day - if you're going to "search" your database (especially multiple attributes), you really need to use a third party solution to provide access to the data you require.
The "common" way to search databases, across all flavours of SQL, is to use full text search (which basically looks up information within an array of different attributes / columns, rather than specific matches).
The following solutions are popular for Rails based projects:
Thinking Sphinx
Sunspot Solr
ElasticSearch
--
References
The magic of these is that they will index any of the data you wish to search, storing the data in a semi-persistent data set.
This is vitally important, as one of the main reasons why full text searching your database is a bad idea is the performance implications it will cause. You'll be best using one of the aforementioned gems to get it working correctly
There's a good Railscast about this here:
What you are looking for is full-text search. Depending on the type of database you have you will use different strategies.
You will be able to create a search index on as many columns as you like.
For Postgresql
The good thing is that Postgresql already has full-text search capabilities. You can use those gems to benefit from it.
PG Search
Textacular
For Mysql
Dusen (uses FULLTEXT index capabilities in MySQL and LIKE queries)
Thinking sphinx (uses Sphinx search server)
Sunspot (uses solr search server)
I am currently using Solr 1.4 (soon to upgrade to 3.3). The friendship table is pretty standard:
id | follower_id | user_id
I would like to perform a regular keyword solr search and order the results by degrees of separation as well as the standard score ordering. From the result set, given the keyword matched any of my immediate friends, they would show up first. Secondly would be the friends of my friends, and thirdly friends by 3rd degree of separation. All other results would come after.
I am pretty sure Solr doesn't offer any 'pre-baked' way of doing this therefore I would likely have to do a join on MySQL to properly order the results. Curious if anyone has done this before and/or has some insights.
It's simply not possible in Solr. However, if you aren't too restricted and could use another platform for this, consider neo4j?
This "connections" and degrees is exactly where Neo4j steps in.
http://neo4j.org/
One way might be to create fields like degree_1, degree_2 etc. and store the list of friends at degree x in the field degree_x. Then you could fire multiple queries - the first restricting the results to those who have you in degree_1, the second restricting the results to those who have you in degree_2 and so on.
It is a bit complicated, but the only solution I could think of using Solr.
I haven't represented a graph in solr before, but I think at a high level, this is what you could do. First, represent people as nodes and the social network as a graph in the database. Implement transitive closure function in sql to allow you to walk the graph. Then you would index the result into solr with the social network info stored into payloads, for example.
I was able to achieve this by performing multiple queries and with the scope "with" to restrict to the id's of colleagues, 2nd and 3rd degree colleagues, using the id's and using mysql to do the select.
#search_1 = perform_search(1, options)
#search_2 = perform_search(2, options)
if degree == 1
with(:id).any_of(options[:colleague_ids])
elsif degree == 2
with(:id).any_of(options[:second_degree_colleagues])
end
It's kinda of a dirty solution as I have to perform multiple solr queries, but until I can use dynamic field sorting options (solr 3.3, not currently supported by sunspot) I really don't know any other way to achieve this.
I'd like to be able to order my search results by score and location. Each user in the DB has lat/lot and I am currently indexing:
location :coordinates do
Sunspot::Util::Coordinates.new latlon[0], latlon[1]
end
The model which I would performing the search against is also indexed in the same manner. Essentially what I am trying to achieve is that the results be ordered by score and then by location. So if I search for Walmart, I would like to see all Walmart's ordered by their geo proximity to my location.
I remember reading something about solr's new geo-sort but not sure if it is out of alpha and/or if sunspot has implemented a wrapper.
What would you recommend?
Because of the way that Sunspot calculates location types you'll need to do some extra leg work to have it sort by distance from your target as well. The way it works is that it creates a geo-hash for each point and then searches using regular fulltext search on that geo-hash. The result is that you probably won't be able to determine if a point 10km away is further than a point that is 5km away, but you'll be able to tell if a point 50km away is further than a point 1-2km away. The exact distances are arbitrary but the result is that you probably won't have as fine-grained of a result as you would like and the search acts more as a way to filter points that are within an acceptable proximity. After you have filtered your points using the built-in location search, there are three ways to accomplish what you want:
Upgrade to Solr 3.1 or later and upgrade your schema.xml to use the new spatial search columns. You'll then need to make custom modifications to Sunspot to create fields and orderings that work with these new data types. As far as I know these aren't available in Sunspot yet, so you'll have to make those connections on your own and you'll have to dig around in Solr to do some manual configurations.
Leverage the Spatial Solr Plugin. You'll have to install a new JAR into your Solr directory and you'll have to make some modifications to Sunspot, but they are relatively painless and the full instructions can be found here.
Leverage your DB, if your DB is also indexed on the location columns then you can use the Sunspot built-in location search to filter your results down to a reasonable sized set. You can then query the DB for those results and order them by proximity to your location using your own distance function.
I currently have a Postgres DB filled with approx. 300.000 data-sets of moving vehicles all over the world. My very frequently repeated query is: Give me all vehicles in a 5/10/20mile radius. Currently I spend around 600 to 1200 ms in the DB to prepare the set of located vehicle-objects.
I am looking to vastly improve this time by ideally one or two orders of magnitude if possible. I am working in a Ruby on Rails 3.0beta environment if this is relevant.
Any ideas how to architect the whole system to accelerate this query? Any NoSQL database able to deliver this kind of geolocation performance? I know of MongoDB working on an extension to facilitate this scenario but haven't tried it yet. Any intelligent use of Redis to achieve this?
One problem with SQL-DBs here seems to be that I can't possibly use indexes because my vehicles are mostly moving around, meaning I had to constantly created DB indexes which, by itself, is probably more expensive than just doing the searching without index.
Looking forward to your thoughs, Thanks!
If you use the right algorithm for organizing your data, you will be able to use a spatial index which can dramatically speed up your queries.
The best practice for the geolocation domain is to use a geohash, quad-tree, R-tree or similar data structure (R-trees are the most generic, but it sounds like you're querying point data, so that may not matter). In each case, you can create a spatial index that uses a single, linear column where each value represents a bounding box of varying size and shape. This should let you answer most queries with a single range query in your database. Spatial indices can be implemented in SQL (PostGIS, MS SQL, MySQL all have spatial datatypes and spatial indices which use one of these techniques) or NoSQL (popular for its horizontal scalability; AppEngine has geomodel, SimpleGeo uses Cassandra, Foursquare uses MongoDB).
Using an index can be complicated by constantly moving points, but I would suspect that writes, even slightly heavier writes that update indices, wouldn't be your bottleneck.
Even though your vehicles are moving around all the time, I assume they have some kind of speed limit. What you can do is to create some kind of discrete coordinate system, one example would be the integer part of the lat/long coordinate. Then you put those values in separate columns, keeping the exact location in another column. You should then be able to index the integer columns, as the vehicles won't move so much that they change those values very often.
When doing a search, you first find out what "squares" are interesting, and restrict your query to the vechicles within those sqeares, using the indexed columns. Then you have to do a full search of all vehicles within each square. The number of vehicles you have to do a full search over should now only be a small fraction of all vechiles. The efficiency of this strategy of course depends on the distribution of your vechiles. If 50% of them are in a certain city somewhere this will not work, but assuming the largest group of vehicles in one place is 5-10% it should improve performance.
Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.