RoR Location Search - ruby-on-rails

Sorry for the fairly open question but I was wondering whether anyone had any advice on the best way to create an app that searches for properties within a particular radius.
The best example of what I am looking to achieve is RightMove.
I was wondering what the best setup would be for adding city, town and postcode data and making it searchable.
I have been reading about Geocoder but was wondering whether this would be the best option for such an app or whether there are good alternatives. For example would you recommend storing all the location data in my own database or using an API to feed in this information.
Any advice or links people can offer really would be appreciated! Thanks.

The approach purely depends on the requirements and the availability of Geocoded data for the location for which you want the geocoded data.
Using Geocoder gives you an advantage that you don't have to bother about updating your Geo-database for a given Location. It has its own downside (request timeout, Data not available for a particular location, Licensing, Query limits etc), but they can be addressed.
If you are okay with storing the data in your DB, then you can achieve the same thing using Postgresql+PostGIS setup. PostGIS module gives you ability to do spatial querying in terms of Radius, checking if a given goe-point falls with-in a pre-defined polygon etc and since these are executed inside the DB, the performance is also very good. This approach has two advantages, you don't have to sign up for any service and no timeout errors. The downside of this approach is that you have to maintain/update the location data yourself.
I have done a handful of ROR projects with the second approach and its working fine for us quite well.
Hope this helps.

Related

Geodata Querying Optimisations

I am planning to write a Node.js-powered RESTful web service that I will use for a mobile application which provides some sort of location based features. The most basic use case is going to look something like this:
the user can create a resource by sending a request to the web service containing the resource's name and the user's current location (latitude and longitude)
the web service will store the metadata about this resource internally in some sort of collection
the user can query the web service for a list of resources within 5km of his current location
One of the first problems that came up in my mind was scalability. Let's suppose that at some point in the future the server will hold metadata for 1 million resources. When a user will query for nearby results, looping through 1 million entries to compute the distance will take forever.
There are many services out there that have the same flow, so I thought implementing something like this is not going to take me a lot of time. I might have been wrong.
I am now two days into researching proven methods and algorithms. By now I have read everything I could put my hands on about QuadTrees, Geohases, databases with spatial indexing support, formulas and so on. However, I still can't get the whole picture of how everything is going to work.
I was hoping that maybe someone who has worked on something similar could share his insight on what approach might be the most suitable considering this use case and the technologies that I am planning to use. Also, a short description of how it can be implemented would help me a lot!
For those who are also looking for more information on this topic out of curiosity, my answer might not provide much clearance. However, some answers in here might help you understand how you could achieve proximity searches using Geohashes.
My approach, after doing a little research on Redis, will be not to overcomplicate things and just use the tools that are already out there. It has out of the box support for spatial indexing and will most probably meet all my persistance requirements for this project.
Apparently MongoDB also comes with built-in support for geodata. In fact, even RDBMS like MySQL or SQLite do come with such capabilities.

Using machine learning to de-duplicate data

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

Storing facebook friend list in coredata?

I am developing an app that uses a users facebook friends for some sort of interaction.
Now I am using core data to store some user data and I am not sure whether I would like to store the users friend in the database as well for caching.
It's a speed over storage kind of situation as storage-wise it's O(n) storage over connection speed fetching each time the friends list and then manipulating it as I need to.
Of course there has to be a handler to check if the friend list got bigger or smaller but let's assume that I have that validation happening lazily and in the background while the application loads.
Any thoughts would it be wise to save it to the core data database or should I just be fetching it and re-populating the database every time the application runs?
Your question is for thoughts pertaining to what is "wise" in this situation. Actually, my answer is the same for every situation.
Write code that is simple for humans to understand.
Then, do lots of performance analysis to determine where you may need to focus on performance. Fortunately, XCode ships with a pretty nice tool for that purpose (Instruments).
So, IMO, it would be size to implement it in the way that is the easiest and most straight-forward. Then run performance analysis. Address the needs that the performance tools tell you need to be addressed.

How to Design Eventing System in Rails 3.1

I'm building something like Facebooks Wall inside of Rails. It will look something like this:
Stacey S. Wants to be Friends
You've been invited to the Summer Social
Pat Replied to your Message: Hey!!!
American Pet Society has a new Post: Love Your Cat!
There are two ways to do this. I could have each of these different events write to the events database when they are created or I could pull from the relationships, invitation, inbox and posts tables and create the events on the fly.
I'm leaning towards the events database approach because it seems cleaner to just call that one table than all the other tables and then sort them correctly. Is this how you would do it?
I'm building a system with similar requirements now, and I think you'll find that the performance characteristics of the latter approach make it extremely untenable. depending on how much usage you intend to get out of the system, you may find the event table to be a performance hog during the request as well. What I'm doing is using an architecture that's basically CQS with event sourcing which builds the feeds for a given user in the background and caches them in a thoroughly denormalized fashion to make the request cycle very short.
Another approach you should look at is using Chronologic: https://github.com/gowalla/chronologic. It may save you quite a bit of effort.
By all means, it will save you from a lot of complicated queries and sorting. Go for the event table approach.

Opinion Mining - What Database Type?

I am entering a project to make a Opinion Mining (Data Mining -> Web Mining -> Opinion Mining) to get semantic orientation of the words contained. We will use a crawler to get the pages opinion. Now the question is, what type of DataBase should I use (OO, Relational, hierachycal, etc), is best to use in this type of project.
I know this is a specific question, Im not expecting everybodies response but at least someone that already did it, that would help.
Regards!
If you need something large scale and responsive, you would probably need to go for Google's BigTable or something of that nature. At the prototype level, I am sure you can use traditional relational databases, but at certain point you'd hit the performance wall. See Brewer's CAP Theorem.
From my experience in such kind of scenarios a relational database can serve your purpose pretty well. You need to be extra careful when storing the web content part of it - whether you want to at all use a database to store it or will storing on as simple as a file system can do. BLOBs specially require extra care and they increase your maintenance work.
Also based on the nature of the project, you would certainly be using a lot of already built in components etc. many of which would already support/easy to extend to use a relational DB as a data store.

Resources