find hot region(spot) in spatial data - image-processing

I have a dataset about restaurant income data.
Each row of the dataset looks like:
[restaurant_id, longitude, latitude, income]
Now I want to find out geographical regions with highest restaurant income. E.g. the top 5 regions with high income.
I don't have a criteria about income, neither a criteria about what is a 'region'.
I have no experience dealing with this kind of data. I've thought about first building a heat map about income, then doing some image segmentation to find out hottest regions. Any suggestion would be appreciated!

Related

Why are data not split in training and testing for unsupervised learning algorithms?

We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.
Let's Say if I have a data with 2 columns:
First column: Employee Age
Second Column: Employee Salary Type
With 100 records similar to this:
Employee Age Employee Salary Type
25 low
35 medium
26 low
37 medium
44 high
45 high
if the Training data is split by the ratio 70:30,
Let the Target variable be Employee Salary Type and predicted variable be Employee Age
The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.
Let's say, 25 out of 30 records have accurate prediction.
Accuracy of the model = (25/30)*100 = 83.33%
Which means the model is good
Lets apply same thing for an unsupervised learning like Clustering.
Here there's no target variable, Only cluster variables are present.
Lets consider both Employee age and Employee Salary as Cluster Variables.
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records).
Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.
Hence accuracy can be accurately measured here.
Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.
I beleive you have a few misconceptions, here is a quick review:
Review
Unsupervised learning
This is when you have data inputs but no labels, and learn something about the inputs
Semi-supervised learning
This is when you have data inputs and same labels, and learn something about the inputs and their relationship to the labels
Supervised learning
This is when you have data inputs and labels, and learn what input maps to which label
Questions
Now you have a few things you mention that dont seem right:
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
This is only guaranteed If you features represent employees using the age and salary, and you are using a clustering algorithm, you need to define a distance metric which says age and salaray are closer to one another
You also mention:
If the Training ratio is applied here,
We can cluster 70 random records and use rest of the
30 records for testing/validating
the above model instead of testing with
some other data (and their records).
Hence accuracy can be accurately measured here.
How do you know the labels? If you are clustering, you would not know what each cluster means as they are assigned only by your distance metric. A cluster usually only signifies distances being either closer or farther away.
You can never know what a correct label is unless you know that a cluster represents a certain label, but if you are using features to cluster and check distance on, they could not also be used for validation.
This is because you would always get 100% accuracy, since a feature is also a label.
A semi-supervised example
I think your misconception comes as you may be confusing learning types, so let's make an example using some fake data.
Let's say you have a table of data with Employee entries like the following:
Employee
Name
Age
Salary
University degree
University graduation date
Address
Now let's say some employees dont want to say their age, since it is not mandatory, but some do. Then you can use a semi-supervised learning approach to cluster employees and get information about their age.
Since we want to get the age, we can approximate by clustering.
Let's make features that represent the Employee age to help us cluster them together:
employee_vector = [salary, graduation, address]
With our input, we are making the claim that age can be determined by salary, graduation date and address, which might be true.
Let's say we have represented all these values numerically, then we can cluster items together.
What would these clusters mean with a standard distance metric Euclidian distance?
People who have less distant salaries, gratuation dates and addresses would be clustered together.
Then we could look at the clusters they are in and look at information about the ages we do know.
for cluster_id, employees in clusters:
ages = get_known_ages(employees)
Now we could use the ages to do lot's of operations to guess missing employee ages like using a normal distribution or just showing a min/max range.
We could never know what the exact age is, since the clustering does not know that.
We could never test for age, since it is not always known, and is not used in the feature vectors for the employees.
This is why you could not use purely unsupervised approaches since you have no labels.
I do not know to who you refer with "why don't people prefer ..." but usually if you are doing an unsupervised analysis you do not have label data and therefore, you cannot measure accuracy. In this case, you can use methods like silhouette or l-curve to estimate the performance of the model.
On the other hand, if you have a supervised task with label data (this example) you can compute the accuracy with cross-validation (test-train split).
Because most unsupervised algorithms are not optimization based. (K-means is an exception!)
Examples: Apriori, DBSCAN, Local Outlier Factor.
And if you do not optimize, how are you going to overfit? (And if you do not use labels, you in particular cannot overfit to these labels).

Before clustering should i do an analysis on time series?

I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions

Clustering origin/destination points

I have 1000 geo-points (lat, long) as origin/destination points. There is also a historical data that shows the cost of traveling between some of the O-D pairs. For some of the O-Ds there is no record in the dataset and some have multiple records with different costs (e.g. because of seasonality).
I want to cluster these 1000 points to a few clusters (e.g. 20) not only based on their location (lat, long), but also considering the average cost of travel and shared destination points.
I appreciate if you could let me know if you have any suggestion on clustering these data.
You have to deal somehow with missing values - assign some given label for them or take some mean/median value. Then you can use any algorithm you want (different types of features can be used together as an input to the algorithm)
If there is not too many dimensions of the data and you know more or less how many cluster there may be, k-means algorithm should work good.
If you want to visualize your data and clusters on 2d and 3d, and you'll have more features, you will have to apply dimensionality reduction (PCA, t-SNE).

Finding similarity between two user profiles

I have user profiles with the following attributes.
U={age,sex,country,race}
What is the best way to find similarity between two users?
for example I have following 2 users.
u1={25,M,USA,White}
u2={30,M,UK,black}
I have searched and found Cosine similarity are mentioned a lot. Is it good for my problem or any other suggestions.
Similarity measures between object in clustering analysis is a broad subject.
What I would suggest for You is to consider approach of 'divide and conquer'. Treat similarity between two user profiles as weighted average from all attributes similarity. Just remember to user normalized values for Your attributes similarity before doing avg. Weights for the average should be decided on the data and a use case. If you consider one of the dimension as more important when it match between two profiles it should have more weight in overall result.
For attributes distance You can try: age -> simple Euclidian; sex, race, country -> 0/1. If You have time, distance between two countries can be better defined based on geoloc. or cultural similarity (on e.g.language, religion, political system, GDP,...). But probably experimentation with weights for final average and Your clusters result analysis would give You more payoff ;-)

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources