Google geocoding API Inner Workings - machine-learning

I'm currently working with some large datasets that include some location based information but lack direct latitude and longitude measurements which I need in order to create visualizations.
In order to resolve this problem, I've been using geocoding APIs that require addresses or address-like information as input and provide latitude and longitude information as output.
I started by using the Nominatim API. Unfortunately, due to the nature of the address-like data that I have, many of my queries failed so I started using the Google geocoding API. The Google API provided me with a significantly higher success rate, but it is a paid API which is not ideal.
I realize that given the incredible resources that Google has that it would be virtually impossible to build a system that rivals their geocoding API within a reasonable amount of time, but it's made me wonder what's going on under the hood.
Is a BERT-like translational system at work? What happens to the text after it's sent off?

I'm using n-grams for similar usage by creating an index and an inverted index. See this package ngram
import ngram
...
country = filename.replace('.csv', '')
ind[country] = ngram.NGram()
inv[country] = {}
s_csv = csv.reader(stream, delimiter=';')
next(s_csv)
for row in s_csv:
coord = tuple(map(float, row[0:2]))
ad = ' '.join(row[2:]).lower()
ind[country].add(ad)
inv[country][ad] = (coord, address)
then you can use the find function
Take care of the memory consumption ~16GB RAM for a country like France and OSM Data
To see an implementation of that, check this OpenGeoCode HTTP API Service source code

Related

'GetPositionLowerLimits' and 'GetPositionUpperLimits' does not support get by ModelInstance as other functions

What I find in Drake API is that 'GetPositionLowerLimits' and 'GetPositionUpperLimits' does not support get by ModelInstance as other functions.
Anybody knows how to query this when I have multiple robots and I am interested in the Lower and Upper Limits for every robot?
I believe you can take the result of GetPositionLowerLimits (or GetPositionUpperLimits) and feed it through GetPositionsFromArray to select just one model instance at a time.
The model instances documentation has some more information.

Get location data when out of service

I'm working on a iPhone app that stores location data from a user. However, sometime the user doesn't have service.
Is there an API that estimates location data when the phone gets back into service? Or any other suggestions
No, there is no such API, because that would create wrong locations.
You have to write yourself such a method, that hopefully works in the scope of your application demands:
E.g You could do a linear interpolation when the GPS service has an outage for some seconds.
e.g:
A liner interpolation of lat and lon values work without special geographic calculations.
Just it would not work if you cross the datum limit (border longitude = 180E to 180 W),
and maybe not if you cross the poles.
But both situations will practically not happen.

How to get nearby city or state name of a geopoint in water in ios?

I am developing a location-based application in which I need to get nearby location name of any geopoint selected by user. I'm using Google Places API which is working fine for me.
Only problem is the service returns null for geopoints in water. Is there any way that I can retrieve nearby locations for a geopoint in water or ocean?
AFAIK the API has no way to do that.
So, you've got two options, in order of the effort it takes:
When user taps water just throw a dialog saying "Please select a
point on land". Next to no effort and will slightly annoy the user.
Try to find the closest land geopoint yourself and use it to run the API request on
(instead of the original point). Below are some ideas on that.
A good approach can be based on this answer: basically you can get a KML file with land polygons. For performance reasons, you can simplify the polygons to the extent that makes sense for your zoom levels. Now if your point is in one of those polygons -- it's sea. And you can simply iterate over all polygon edges and pick the one that's closest to your point, then pick a point on it - again closest to your point - and do one little epsilon-sized step towards the outside of the polygon to get a land point you can do a geocode request on. Also, the original author suggests you can use Haversine formula to determine neares land point -- I'm not really familiar with the appliance of that one.
The downside is, you have to deal with KML, iterate over a lot of polygons and optimize them (and lose precision doing that, in addition to possible differences between marineregions.org data and Google Places data)
Another cool trick you could try is using Sobel Filter [edge detection] on the visible map fragment to determine where coastline is (although you will get some false positives there), then trace it (as in raster->vector) to get some points and edges to calculate the closest land position with, in a manner similar to the former approach. Here's a clumsy drawing of the idea
For Sobel edge detection, consider GPUImage lib -- they have the filter implemented and it's probably going to work crazy fast since the lib does all the calculations on GPU.
UPD Turns out there's also a service called Koordinates that has coastline data available, check the answer here

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?
I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.
This is how a BTDigg-like site works:
You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
index the information you find in there, map it to the info-hash
Provide a front-end for that index
If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.
So, the claim that you don't store .torrent files nor any content is true.
It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".
EDIT:
Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.
For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:
PDF of his paper:
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
His presentation at DEFCON 18 (parts 1 & 2)
http://www.youtube.com/watch?v=v4Q_F4XmNEc
http://www.youtube.com/watch?v=mO3DfLtKPGs
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
The method used in Section 3 seems to suggest a database to store all the torrent data is required. While performance is better, it may not be a true DHT search engine.
Section 8, while less efficient, seems to be a DHT search engine as long as the keywords are the store values.
From Section 3, Bootstrapping Bittorent Search:
"The system handles user queries by treating the
concatenation of each torrent's filenames and description as a
document in the typical information retrieval model and using an
inverted index to match keywords to torrents. This has the advantage
of being well supported by popular open-source relational DBMSs. We
rank the search results according to the popularity of the torrent,
which we can infer from the number of peers listed in the DHT"
From Section 8, Related Work:
the usual approach to distributing search using a DHT is
with an inverted index, by storing each (keyword, list of matching
documents) pair as a key-value pair in the DHT. Joung et al. [17]
describe this approach and point out its performance problems: the
Zipf distribution of keywords among files results in very skewed load
balance, document information is replicated once for each keyword in
the document, and it is difficult to rank documents in a distributed
environment
It is divided into two steps.
To achieve bep_0005 protocol got infohash, you do not need to implement all protocol requires only now find_node (request), get_peers (response), announce_peer (response). Here's one of my open source dhtspider.
To achieve bep_0009 protocol got metainfo index it, here are my own a bittorrent search engine, every day can get unique infohash 300w +, effective metainfo 50w +.

Blackberry cache reverse geocode address info with proximity

Most people are limited to about 5 or 6 locations on a daily basis (work, home, school, store, etc). I want to speed up address display by caching a few of these most visited locations. I've been able to get the address info using both google maps GPS and JSON and Locator.reverseGeocode. What would be the best way to cache this information and to check proximity quickly? I found this GPS distance calculation example and have it working. Is there a faster way to check for proximity?
Please see similar question first: Optimization of a distance calculation function
There are several things we can change in distance calculations to improve performance:
Measure device speed and decrease or increase period of proximity test accordingly
Trigonometric calculations takes most of performence, but it may done much faster. First make bold distance calculations using lookup table method, then if distance is less than proximity limit + uncertainty limit, use CORDIC method for more precise calculation.
Use constants for Math.PI/180.0 and 180.0/Math.PI
several links that may be helpful:
Very useful explanations of CORDIC, especially doc from Parallax for dummies
Fast transcendent / trigonometric functions for Java
Cordic.java at Trac by Thomas B. Preusser
Cordic.java at seng440 proj
Sin/Cos look-up table source at processing.org by toxi

Resources