Can you use sampling to normalize data? - normalization

Context
I have a retail data set that contains sales for a large number of customers. Some of these customers received a marketing treatment (i.e. saw a TV ad or similar) while others did not. The data is very messy with most customers having $0 in sales, some having negative, some positive, a lot of outliers/influential cases etc. Ultimately I am trying to "normalize" the data so that assumptions of the General Linear Model (GLM) are met and I can thus use various well-known statistical tools (regression, t-Test, etc.). Transformations have failed to normalize the data.
Question
Is it appropriate to sample groups of these customers so that the data starts to become more normal? Would doing so violate any assumptions for the GLM? Are you aware of any literature on this subject?
Clarification
For example, instead of looking at 20,000 individual customers (20,000 groups of 1) I could pool customers into groups of 10 (2,000 groups of 10) and calculate their mean sale. Theoretically, the data should begin to normalize as all of these random draws from the population begin to cluster around the population mean with some standard error. I could keep breaking them into larger groups (i.e. 200 groups of 100) until the data is relatively normal and then proceed with my analysis.

Related

Any Statistical or Machine Learning Method to Predict Salary

I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).

feature selection in wrapper method and information filtering?

I see one example in old-mid exam from well-known person Tom Mitchell, as follows:
Consider learning a classifier in a situation with 1000 features total.
50 of them are truly informative about class. Another 50 features are
direct copies of the first 50 features. The final 900 features are not
informative. Assume there is enough data to reliably assess how useful
features are, and the feature selection methods are using good
thresholds.
How many features will be selected by mutual information filtering?
Solution: 100
How many features will be selected by a wrapper method?
solution: 50
My challenge is how these solution is achieved? I do lots of try, but couldn't understand the idea behind this.
How many features will be selected by mutual information filtering?
Mutual information feature-selection evaluates the candidacy of each feature independently. Since there are essentially 100 features that are truly informative, we will ended up with 100 features by mutual information filtering.
How many features will be selected by a wrapper method?
A wrapper method evaluates a subset of features thus it takes the interactions between features into account. Since 50 features are direct copies of the other 50 features, the wrapper method is able to find out that conditioned on the first 50 features, the second set of 50 features is not adding any extra information at all. We ended up with 50 features after filtering. Suppose the first set of 50 features are A1, A2, ..., A50 and the copy of the 50 features are C1, C2, ..., C50. The final result of selected features might look like:
A1, C2, A3, A4, C5, C6, ..., A48, A49, C50.
Thus each unique feature should have only one occurrence (either from the feature set of A or from the feature set of C).
How many features will be selected by mutual information filtering?
If we go by the question description, we should only have 50 features selected. But this filtering is based on correlation with the variable to predict. And, also one the major drawbacks of Mutual Information Filtering is, they tend to select redundant variables because they does not consider the relationships between variables.
How many features will be selected by a wrapper method?
Consider it as a Heuristic Search approach of space of all possible feature subsets. By definition, "A wrapper method evaluates a subset of features thus it takes the interactions between features into account."
Example: Hill Climbing , i.e. keep adding features one at a time until no further improvement can be achieved.
Since we have 50 feature which have the most information, other 50 a copy of the former and 900 feature are or no use. Therefore, we get only 50 features.

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Predicting the item to sell, given a list of items

We have the data set which contains the mapping of customer to the product he buy like
c1->{P1, P2, p5}
c2->{P3, P5, p4}
c3->{P5, P2, p3}
....
on that basis we need to recommend a product for the customer,
let say for cx customer we need to recommend the product, since we have the data what cx is buying from the above set, and we run apriori to figure out the recommendation, but for big data set it's very slow ?
can someone please give us some suggestion by which we can crack that problem ?
I assume the items merchant is selling is your training data and then a random item is your testing data. So the most probable item to sell will depend upon the "features" of the items which merchant is selling currently. "Features" mean the price of the item, category, these are the details you will have. Then to decide the algorithm, I recommend you to have a look at the feature space. If there are small clusters, then even nearest-neighbor search would work better. If the distribution is complex then you can go for SVM. There are various data visualization techniques. Taking PCA and taking visualizing first two dimensions can be a good choice.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources