I'm looking for a good dataset of business locations, hopefully for all of the USA. I'd love to have "name," "business type," and "lat/long," although I'd settle for "street address" rather than "lat/long," and I could geocode the points myself.
Are there any free or relatively cheap data sources for business locations? Can I get this information from google?
You might want to settle for yelp dataset. It seems to be the closest to what you are looking for. You'll need to be associated with an academic institution to qualify for access though. Check out this link : http://www.yelp.com/academic_dataset
Related
I am looking for competent people to create a platform for scientific studies and globalized surveys. I would like each human being to have an identification number (identity verification is important otherwise the study will not be valid) and with this system each person will be able to give their opinion on different issues. The objective is therefore on a large scale, the system will have to be efficient anywhere and with many people connected at the same time.
The data war has already begun. My goal is to counterbalance the power by giving it to humanity. The platform will therefore disturb -> many hacking attacks to be expected and coveted for its precious data that we must protect.
It is a project of the heart, the dimension is not pecuniary even if we will earn money, it is humanistic. Let's stop the mass manipulation, let's put the people with real practical solutions in the spotlight and let's be aware of the real percentages behind an idea.
I'm going on a world tour in a few days, I'd like to take the opportunity to highlight the project around the globe but for that I need the interface to be functional.
I hope you like the idea. I look forward to talking to you about it and finding out how and how quickly it can be done.
Sincerely Louise
I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.
Sorry for the fairly open question but I was wondering whether anyone had any advice on the best way to create an app that searches for properties within a particular radius.
The best example of what I am looking to achieve is RightMove.
I was wondering what the best setup would be for adding city, town and postcode data and making it searchable.
I have been reading about Geocoder but was wondering whether this would be the best option for such an app or whether there are good alternatives. For example would you recommend storing all the location data in my own database or using an API to feed in this information.
Any advice or links people can offer really would be appreciated! Thanks.
The approach purely depends on the requirements and the availability of Geocoded data for the location for which you want the geocoded data.
Using Geocoder gives you an advantage that you don't have to bother about updating your Geo-database for a given Location. It has its own downside (request timeout, Data not available for a particular location, Licensing, Query limits etc), but they can be addressed.
If you are okay with storing the data in your DB, then you can achieve the same thing using Postgresql+PostGIS setup. PostGIS module gives you ability to do spatial querying in terms of Radius, checking if a given goe-point falls with-in a pre-defined polygon etc and since these are executed inside the DB, the performance is also very good. This approach has two advantages, you don't have to sign up for any service and no timeout errors. The downside of this approach is that you have to maintain/update the location data yourself.
I have done a handful of ROR projects with the second approach and its working fine for us quite well.
Hope this helps.
Good Afternoon,
I'm currently planning a web-app/service project with a geolocation-enabled user model (lat/lng etc) and I was wondering what would be the best approach to find out the n biggest 'hot spots', e.g. geolocations with a given radius (e.g. 10 miles) where the most users are located at?
Does anyone know a good, practical clustering algorithm or other (existing) solution(s)? This is a pretty bird-view kind of question, I know... but backend technology wise I'm still open to anything as this particular feature is obviously only one of the whole feature set, but might help making a decision towards a particular set of tools/languages/environments.
Cheers & thanks,
-J
SQL Server's spatial data types would be worth a look. It allows you to index on the geography column and do queries for distance. Not sure how easy it would be to group by radius, but at least having the geography data type and building indexes on it should help a lot with this type of problem.
Geography Methods Supported by Spatial Indexes
Been hunting through previous questions on Geocoding and while many are helpful I'm not finding one to my needs.
I need to group multiple addresses to the nearest city centers. My only address information is city, country, and state (if applicable). For example, all addresses in San Francisco and within miles should be listed as San Francisco. I'll need to know the count of addresses rolled-up to San Francisco.
I'm open to suggestions on how to approach this. I don't particularly want to manually identify a list of major cities if possible. Is there a list of these I can start from?
What about using an average lat/long location of all addresses within miles? Granted the final 'center point' would move around a bit as the average is computed but perhaps that is an approximate solution. Not quite sure how to do this so again, appreciate input!
Great question. I think more generally what you want is some standard way of rolling up cities into metropolitan areas and you're exactly right that you don't want to create or maintain a list of your own.
Yahoo! GeoPlanet provides a geographic ontology with a pretty thorough hierarchy. If you were happy with standard administrative divisions (like county or state), it would be easy, but I think you're looking for something a little more general than that. But GeoPlanet also provides zones, often -- in the US -- including the town's Metropolitan Statistical Area.
If you have each city name, you could use GeoPlanet to find any MSA zones that the city belongs to and roll up to that (and GeoPlanet provides a bounding box and centroid for each MSA so you can easily place it on a map). For rural towns that aren't a part of a US census bureau MSA you may not need to group it to the nearest city (which may be far away anyway).