What is the accuracy on country level of GeoLite2 Country DB? - geolocation

What is the accuracy on the country level of GeoLite2 Country DB? Also, is there any data about the accuracy on the country level for all different DBs(including both paid & free)available in the market?

You can find the accuracy data in their web site.
GeoIP2 Accuracy:
https://www.maxmind.com/en/geoip2-city-database-accuracy
IP2Location Accuracy:
https://www.ip2location.com/data-accuracy

Related

Which accuracy (training or test) is being reported in journal articles?

I am new to neural network. When I read articles, they often say “we noted a 98% accuracy”. I carefully read the articles (see below two articles), but there is no further information whether the accuracy is referring to training or test (validation). Please let me know which accuracy is the one the authors are implying.
Grinblat, G. L., Uzal, L. C., Larese, M. G., & Granitto, P. M. (2016). Deep learning for plant identification using vein morphological patterns. Computers and Electronics in Agriculture, 127, 418-424.
Satti, V., Satya, A., & Sharma, S. (2013). An automatic leaf recognition system for plant identification using machine vision technology. International journal of engineering science and technology, 5(4), 874.
For what I read, accuracy refer to test, when you test with a large number of data, you give an opportunity to your machine learning to have a high accuracy. Of cause, the test always determine if your work gives the result expected

Why are data not split in training and testing for unsupervised learning algorithms?

We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.
Let's Say if I have a data with 2 columns:
First column: Employee Age
Second Column: Employee Salary Type
With 100 records similar to this:
Employee Age Employee Salary Type
25 low
35 medium
26 low
37 medium
44 high
45 high
if the Training data is split by the ratio 70:30,
Let the Target variable be Employee Salary Type and predicted variable be Employee Age
The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.
Let's say, 25 out of 30 records have accurate prediction.
Accuracy of the model = (25/30)*100 = 83.33%
Which means the model is good
Lets apply same thing for an unsupervised learning like Clustering.
Here there's no target variable, Only cluster variables are present.
Lets consider both Employee age and Employee Salary as Cluster Variables.
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records).
Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.
Hence accuracy can be accurately measured here.
Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.
I beleive you have a few misconceptions, here is a quick review:
Review
Unsupervised learning
This is when you have data inputs but no labels, and learn something about the inputs
Semi-supervised learning
This is when you have data inputs and same labels, and learn something about the inputs and their relationship to the labels
Supervised learning
This is when you have data inputs and labels, and learn what input maps to which label
Questions
Now you have a few things you mention that dont seem right:
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
This is only guaranteed If you features represent employees using the age and salary, and you are using a clustering algorithm, you need to define a distance metric which says age and salaray are closer to one another
You also mention:
If the Training ratio is applied here,
We can cluster 70 random records and use rest of the
30 records for testing/validating
the above model instead of testing with
some other data (and their records).
Hence accuracy can be accurately measured here.
How do you know the labels? If you are clustering, you would not know what each cluster means as they are assigned only by your distance metric. A cluster usually only signifies distances being either closer or farther away.
You can never know what a correct label is unless you know that a cluster represents a certain label, but if you are using features to cluster and check distance on, they could not also be used for validation.
This is because you would always get 100% accuracy, since a feature is also a label.
A semi-supervised example
I think your misconception comes as you may be confusing learning types, so let's make an example using some fake data.
Let's say you have a table of data with Employee entries like the following:
Employee
Name
Age
Salary
University degree
University graduation date
Address
Now let's say some employees dont want to say their age, since it is not mandatory, but some do. Then you can use a semi-supervised learning approach to cluster employees and get information about their age.
Since we want to get the age, we can approximate by clustering.
Let's make features that represent the Employee age to help us cluster them together:
employee_vector = [salary, graduation, address]
With our input, we are making the claim that age can be determined by salary, graduation date and address, which might be true.
Let's say we have represented all these values numerically, then we can cluster items together.
What would these clusters mean with a standard distance metric Euclidian distance?
People who have less distant salaries, gratuation dates and addresses would be clustered together.
Then we could look at the clusters they are in and look at information about the ages we do know.
for cluster_id, employees in clusters:
ages = get_known_ages(employees)
Now we could use the ages to do lot's of operations to guess missing employee ages like using a normal distribution or just showing a min/max range.
We could never know what the exact age is, since the clustering does not know that.
We could never test for age, since it is not always known, and is not used in the feature vectors for the employees.
This is why you could not use purely unsupervised approaches since you have no labels.
I do not know to who you refer with "why don't people prefer ..." but usually if you are doing an unsupervised analysis you do not have label data and therefore, you cannot measure accuracy. In this case, you can use methods like silhouette or l-curve to estimate the performance of the model.
On the other hand, if you have a supervised task with label data (this example) you can compute the accuracy with cross-validation (test-train split).
Because most unsupervised algorithms are not optimization based. (K-means is an exception!)
Examples: Apriori, DBSCAN, Local Outlier Factor.
And if you do not optimize, how are you going to overfit? (And if you do not use labels, you in particular cannot overfit to these labels).

find hot region(spot) in spatial data

I have a dataset about restaurant income data.
Each row of the dataset looks like:
[restaurant_id, longitude, latitude, income]
Now I want to find out geographical regions with highest restaurant income. E.g. the top 5 regions with high income.
I don't have a criteria about income, neither a criteria about what is a 'region'.
I have no experience dealing with this kind of data. I've thought about first building a heat map about income, then doing some image segmentation to find out hottest regions. Any suggestion would be appreciated!

mutual information and prediction accuracy

what's the relation between mutual information and predict accuracy for classification or MSE for regression? Is it possible to have high accuracy/low MSE with low mutual information in data mining?
Mutual information is defined for pairs of probability distributions. Much of what can be said regarding its relationship to other quantities depends heavily on how you compute and represent these probability distributions (e.g. discrete versus continuous probability distributions).
Given a set of probability distributions, the relationship between classification accuracy and mutual information has been studied in the literature. In short, one quantity puts bounds on the other, at least for discrete probability distributions.
I don't know of any formal studies looking at the relationship between the MSE and mutual information.
All of that being said, if I had a concrete data set and got low mutual information scores for two variables but also a very low MSE in a regression model, I would take a hard look at how the mutual information was computed. 99 out of 100 times this occurs because the original formulation of Shannon entropy (and by extension mutual information) is used on continuous / floating point data, even though this method only applies to discrete data.

Stanford sentiment - Cannot replicate same experiments with same accuracy - I get 79% instead of 85%

I am using the Stanford NLP for Sentiment Analysis,
but after training the model for more or less 24 hours, the session ended for maximum training time exceeded.
After running the evaluation of the created models, I have found out that the results in accuracy are far less performing than the ones from the Stanford paper.
These are the results of the Evaluation:
Tested 82600 labels
65166 correct
17434 incorrect
0.788935 accuracy
Tested 2210 roots
828 correct
1382 incorrect
0,374661 accuracy
Approximate Negative label accuracy: 0,595578
Approximate Positive label accuracy: 0,663263
Combined approximate label accuracy: 0,634001
Approximate Negative root label accuracy: 0,665570
Approximate Positive root label accuracy: 0,601760
Combined approximate root label accuracy: 0,633718
I decided to retrain the model and set a MaximumTrainTimeSeconds to 3 days, hoping to get better accuracy performance.
Has anyone encountered the same issue?
Do you think that retraining the algorithm for a longer period would make me achieve the expected accuracy?
Moreover, I am not entirely sure of how the score described in the model (e.g. 79,30 in the model in the picture) relates to the accuracy-best performance of the model.
I'm very new to NLP so if I am missing any required information or anything at all please let me know! Thank you!

Resources