Where does raw geoip data come from? - geolocation

This question is a general version of a more specific question asked here. However, those answers were unusable.
Question: What is the raw source for geoIP data?
Many websites will tell me where my IP is, but they all appear to be using databases from fewer than 5 companies (most are using a database from MaxMind). These companies offer limited free versions of their databases, but I'm trying to determine what they're using for their source data?
I've tried using Linux/Unix commands such as ping, traceroute, dig, whois, etc., but they don't provide predictably accurate information.

Preamble: I believe this is actually a very valid question for SO website as understanding how such things work is important to understanding how such datasets can be used in software. However the answer to this question is rather complex and full of historical remarks.
First - it is worth mentioning that there is NO unified raw geoip data. Such thing just does not exist. Second - the data for this comes from multiple resources and often is not reliable and/or outdated.
To understand how that comes to be one need to know how Internet came into existence and spread around the world. Short summary is below:
IANA is a global [non-profit] organization which manages assignment of IP blocks to regional organizations: https://www.iana.org/numbers This happens upon request and regional organization requests specified block size
Regional organizations may assign those IP blocks to either ISP directly or to country level sub-organizations (who would assign that to ISP then).
ISP assigns IP addresses to local branches etc.
From above you can easily see that:
There is no single body which is responsible for IP block assignment to this or that location
Decisions how to (and whether to) release information about which IP belongs to which location are not taken uniformly and instead each organizations decides how to (and whether do it at all) release that information
All of above creates a whole lot of mess. It takes a lot of dedication and long time to obtain, aggregate and sort this data. And this is why most up-to-date and detailed geoip datasets are commercial commodity.
Whoever takes on a challenge of building their own dataset should be able to obtain this information directly from end users (ISPs), because higher level organizations do not know to which location each IP address will be assigned. Higher level organizations only distribute IP blocks among applicants (and keep some reserve for faster processing) and it is a lowest level organizations who decide which location gets which IP address and they are not obligated to release this information publicly.
UPD:
To start building your own dataset you can begin with this list of blocks and how they are assigned

Related

Is hyper ledger fabric considered as centralized blochain

Hyper ledger has some classic/old world mechanisms that brings up the question, is it really decentralized?
Having a REST server to communicate with the blockchain brings up the cloud model behavior.
Even though the hyper ledger is distributed, someone calling a rest API will may be written to the server logs with some data such as IP address, GEO info and more.
So, is hyper ledger fabric considered as centralized blockchain or maybe decentralized blockchain?
Thanks
If you mean decentralised as in not controlled by any one entity and anonymous then no, it is not that.
Fabric is the backbone of blockchain applications and certain things are plug and play. it is not meant to be anonymous, it is meant to be controlled and only known parties have access to anything.
It's simply an eco-system which allows anyone to build blockchain based applications. Imagine a system where your bank holds your money and you want to pay someone. The bank needs to make sure that you are who you say you are and no one else can authorise payments from your account. That's what the permissions mean. It's not meant to bypass the man in the middle like bitcoin or other cryptocurrencies are. Those however are just an implementation of a block chain system, they are not the only way you can use such a system though.
The immutability of the ledger offers certain advantages. Imagine an audit system where every action is recorded and can't be changed. If your audit records are in a sql database for example, anyone with access can go in, change or delete that data. What goes on the ledger stays there for ever and can't be change. This doesn't mean that your asset data cannot change. That is a fundamental thing to understand. Underlying data can be changed via a new transaction against the same asset, but the history of the asset is clearly visible and can't be modified.
In this world, you build a something, someone controls it, gives access to other organisations and people within those and every action has its source identified.
It is decentralised in the sense that the ledger does not live in one place only, a copy of the ledger exists on every peer that is joined to a channel.
However, it is not meant to be anonymous, all the participants are known and their access level controlled, that's the whole point.

Erlang fault-tolerant application: PA or CA of CAP?

I have already asked a question regarding a simple fault-tolerant soft real-time web application for a pizza delivery shop.
I have gotten really nice comments and answers there, but I disagree in that it is a true web service. Rather than a web service, it is more of a real-time system to accept orders from customers, control the dispatching of these orders and control the vehicles that deliver those orders in real time.
Moreover, unlike a 'true' web service this system is not intended to have many users - it is just a few dispatchers (telephone operators) and a few delivery drivers that will use it (as for now I have no requirement to provide direct access to the service to the actual customers; only the dispatchers and delivery drivers will have the direct access).
Hence this question is a bit more general.
I have found that in order to make a right choice for a NoSQL data storage option for this application first thing that I have to do is to make a choice between CA, PA and CP according to the CAP theorem.
Now, the Building Web Applications with Erlang book says that "while it [Mnesia] is not a SQL database, it is a CA database like a SQL database. It will not handle network partition". The same book says that the CouchDB database is a PA database.
Having that in mind, I think that the very first thing that I need to do with my application is to decide what the 'fault-tolerance' term means regarding to CAP.
The simple requirement that I have is to have the application available 24/7(R1). The other one is that there is no need to scale, the application will have a very modest amount of users (it is probably not possible to have thousands of dispatchers) (R2).
Now, does R1 require the application to provide Consistency, Availability and Partition Tolerance and with what priorities?
What type of data storage option will better handle the following issues:
Providing 24/7 availability for a dispatcher (a person who accepts phone calls from customers and who uses a CRM) to look up customer records and put orders into the system;
Looking up current ongoing served orders and their status (placed, baking, dispatched, delivering, delivered) in real time;
Keep track of all working vehicles' locations and their payloads in real time;
Recover any part of the system after system crash or network crash to continue providing 1,2 and 3;
To sum it up: What kind of Data Storage (CA, PA or CP) will suite the system described above better? What kind of Data Storage will better satisfy the R1 requirement?
For your 24/ requirement you are searching a database with (High) Availability because you want your requests to succeed everytime (even if they are only error results).
A netsplit would bringt your whole system down, when you have no partition tolerance
Consistency is nice to have, but you can only have 2 of 3.
Your best bet will be a PA solution. I highly recomment a solution which has been inspired by Amazon Dynamo. The best known dynamo implementations are riak and couchdb. Riak even allows you to change PA to some other form by tuning the read and write replicas.
First, don't confuse CAP "Availability" with "High Availability". They have nothing to do with each other. The A in CAP simply means "All DB nodes can answer queries". To get High Availability, you must be in multiple data centers, you must have robust documented procedures for maintenance, expansion, etc. None of that depends on your CAP choice.
Second, be realistic about your requirements. A stock-trading application might have a requirement for 100% uptime, because every second of downtime could loose millions of dollars. On the other hand, I'm guessing your pizza joint might loose tens of dollars for every minute it's down. So it doesn't make sense to spend millions trying to keep it up. Try to compute your actual costs.
Third, always evaluate your choice vs mainstream. You could just go CA (MySQL) and quickly fail-over to the slaves when problems happen. Be realistic about the costs (and risks) of building on new technology. If you really expect your system to run for 5 years without downtime, ask for proof that someone else has run that database for 5 years without downtime.
If you go "AP" and have remote people (drivers, etc.) then you'll need to write an app that stores their data on their phone and sends it in the background (with retries). Of course, you could do this regardless of weather your database was CA or AP.
If you want high uptimes, you can either:
Increase MTBF (Mean Time Between Failures) - Buy redundant power supplies, buy dual ethernet cards, etc..
Decrease MTTR (Mean Time To Recovery) - Just make sure when failure happens you can recover quickly. (Fail over to slave)
I've seen people spend tens of thousands of dollars on MTBF, only to be down for 8 hours while they restore their backup. It makes more sense to ensure MTTR is low before attacking MTBF.

Is it safe to assume uniform geo-ip resolution for same first-16-bit IP addresses?

I have a geo-sensitive webapp for which I send a request's IP to a remote, commercial ip-to-location service, and get back the country, city, ISP, etc. for the IP.
I currently cache the IP lookups in my database in order to make subsequent lookups faster and free (the commercial service charges per lookup).
I wonder if I can further optimize my caching by assuming that the first 16 bits (i.e. the aaa.bbb in a aaa.bbb.ccc.ddd addresss) always have a uniform location. That way I can have at most 2^15 records to cache.
I don't mind so much about uniformity of ISP but that info would be helpful as well.
I'd recommend going down to at least /24 resolution. Oftentimes a /16 will tell you the ISP but not the city, or vice versa.
If you want a good idea of what the maps really look like, you can spend 49 USD on a developer license to Geobytes's GeoNetMap database. A developer license allows you to download the entire map from IP blocks to locations as a bunch of CSV files, but doesn't cover deploying it onto a production server. Geobytes has the added advantage of being entirely local, so lookups are liquid fast.
MaxMind also has a free downloadable map offering, although it is somewhat cut down from the full map, producing approximately double the error rate.
No, it's not safe. For example, if you do a GeoIP lookup on 216.34.181.45 (Slashdot) you get Mountain View, California. If you do a lookup on 216.34.1.1 you get Chesterfield, Missouri.
With respect to your caching, keep in mind that IPs can move around spatially. If an ISP goes bankrupt and its block gets bought by someone else, that block of IPs will move location.

Exactly how accurate is IP Geolocation?

I'm setting up a iPhone tracking system for my friends, so they can submit their location to my website by their iPhone, anywhere, anytime - by WiFi or cellular data.
The website will use Google Maps for their coordination's so that my other friends can track where they are, however, it is the accuracy of the IP to coordinates to Google Maps is what I'm concerned about, exactly how accurate is it to use Google Maps that would track down the locations by an IP address?
I was thinking about 95%, but this was tested in a village which was quite fairly accurate, but what happens if it was in a city? Would this cause unaccurate locations?
Any kind help appreciated.
IP geolocation is really hit-or-miss, depending on both how the user's ISP assigns IPs and on the IP geolocation database you're using. For instance, I made a simple PHP script, IP2FireEagle, which looks up your IP. I found that the database kept placing me 10+ km to the west of where I really was. Updating my entry in Host IP wasn't the greatest, as it soon got reverted, presumably by someone also occasionally assigned that IP by my ISP! That being said, I found that Clarke has very accurate coordinates (not that this it's using IP geolocation per se but rather Skyhook's API and their WiFi geolocation database).
If it's a website for your friends and you know they have iPhones, I would suggest using its browser's support for navigator.geolocation.getCurrentPosition(). That is, get the location via Javascript and submit it to your server via an AJAX call. Even better since you want to use Google Maps, they give you a short tutorial on how get your friends' locations and then update a map.
Excerpt From:
http://www.clickz.com/822881
IP targeting has been around since the early days of ad serving. It's not very hard to write code that will strip the IP address from a request, compare it to a database, and deliver an ad accordingly. The true difficulty, as we shall see, is building and maintaining an IP database.
One of the first applications of information in an IP database was targeting to specific geographic regions. Most commercial ad management systems have IP databases that can make geographic targeting possible. However, there are a couple weaknesses in this method. The first (and biggest) problem is that, for various reasons, not all IPs can be mapped to an accurate location.
Take all the IPs associated with AOL users, for instance. Anybody who has seen a WebTrends report knows that all AOL users appear to be coming from somewhere in Virginia. This is caused by AOL's use of proxy servers to handle their web requests.
In the interest of saving space, we won't get into the reasons why AOL makes use of proxy servers. The important thing is that AOL does use them, and as a result, all its users appear to be accessing the web from Virginia. Thus, it is impossible to attach meaningful geographic location data to an AOL IP, and those IPs must be discarded from any database that wants to maintain a reasonable degree of accuracy.
Other ISPs and networks may use a method known as dynamic IP allocation for its users. In other words, a user might have a different IP address every time he visits the Internet. You can see how this might affect the accuracy of a database.
But the real difficulty in discerning geography from an IP address has to do with the level of specificity that a media planner might expect from this targeting method. The first few geo-targeted campaigns that I put together early in my career had to be accurate to the ZIP code level. This level of specificity is not practical via IP targeting.

How to specifically identify server issues from a load test result (using LoadRunner)?

How do you isolate a performance issue to a specific component of the application infrastructure? Specifically, are there distinct markers in the result logs that distinguish between bottlenecks at web, application and/or database server levels?
I was asked this question in an interview and went blank on it. Seems this information is not available anywhere.
In addition to SiteScope and other agentless monitoring of system components, you need to make sure your scenario and scripts are working as expected. This includes proper error checking and use of transactions (and a host of other things). If the transactions are granular enough, this will give you insight into at least the requests that have performance issues. Once you have these indicators, work with the infrastructure team to review logs and other information. Being an iterative process, tests can be made to focus on a smaller and smaller section of the infrastructure.
In addition, loadrunner scripts don't have to be made strictly 'coming in through the frontdoor'. If you have a multi-tiered system, scripts can be made to hit the web/app/database servers directly.
For what to look for, focus on any measurements that have 'knees' or 'hockey stick' type of behaviour. You can hook into any of the server resource type measurements directly in the controller and integrate other team's stats in the analysis phase. Compare with benchmarks at lower virtual user levels to determine what is acceptable and unacceptable.
Good luck!
If the interview is focused on LoadRunner and SiteScope is considered - I'd come to conclusion that it's more focused on HP/Mercury solutions.. In that case I'd suggest you to look into HP Diagnostics and it's LoadRunner integration capabilities.
This type of information is usually not available by just looking at the standard results from a performance test.
Parts of the information you are looking for MAY be found by using SiteScope to monitor all the relevant servers in the test. SiteScope offers many counters to look at such as CPU, Memory, Disk I/O and Network I/O - as seen on each server.
This information perhaps gives clues as to where the bottleneck is, and the more counters you add to SiteScope, the bigger the change to pinpoint the bottleneck.
It is a very common misconception that AppServer and DBServer bottlenecks could be identified by just looking at the raw response times or hits, pages etc (web protocol), unless of course the URI accessed defines the exact component(s) in the system...

Resources