Dataset Sample
Can I use any algorithm to train above dataset ?
Because Each Row (Id) has Dependent Variable(Status) . But Each "Id" again as Mulitple Rows as per Features
You Can Assume it as "Each Id has multiple transaction and All transactions have common Status"
Will Machine learning find some Patterns from these transaction
Is there any other approach to solve these type of problems
Just fill your ID row with the value from the above row , same for the status row, this will lead to:
df
ID Feature1 Feature2 Feature3 Status
8079 100 Asia High Approved
8079 200 Africa Low Approved
When you run a classification algorithm, you can use: ID, Feature1, Feature2, Feature3as features and Status as target. A classifier will learn with this and everything is completly the same as before.
The features are still independet. Dependet features you will only have if the variables are somehow dependet to each other, in your case the ID 8079 does not lead to Feature1: Africa. They are independet.
You can fill your cells with:
import numpy as np
df[df[0]==""] = np.NaN
df.fillna(method='ffill')
Based on your comments, the approach can be slightly different, you need to convert your entries to new features (Python pandas convert rows to columns where multiple columns exist):
The dataframe then should look like:
ID Feature1 Feature2 Feature3 Feature1a .... Feature3z Status
8079 100 Asia High 200 Approved
you can either assume that each row is independent and ignore the id column or if every ID has 3 rows, you could extend the dataset with more features
Related
I am trying to build a model that can predict insurance names based on insurance Id.
Before putting the question to this forum, I have tried KNN and Decian Tree, but the accuracy does not exceed more than 60%.
In my data frame, I have one column as a feature and the other as a label.
I can also extract other features from this data as well like Is Numeric, length, etc.
I have 2.8M rows of data in this shape.
insurance_id
insurance_name
XOH830990804
Medicare
XOH01179276
Medicare
H55575577
Medicare
H71096147
WELLMED
IBPW01981926
BCBS
MT25110S
Aetna
WXQQ07123
Aetna
6WU7NSSGY63
Oxford
MX7ZZ35T
Oxford
DU00079Z
Welcare
PB95800M
UHC
Please guide me on which approach or model can help me to achieve an accuracy of more than 80%.
You can try to diversify your inputs
As an example, you can pass aditional features to the network, such as:
Length of the insurance_id
Quantity of numbers in the insurance_id
Quantity of letters in the insurance_id
Sum of all numbers in the insurance_id
And any other transform you might think of.
As the output layer of your network, you might wanna use Dense(n_of_different_insurance_names, activation='softmax')
and a categorical_crossentropy loss function when compiling the model
I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
update:
Example:
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv
Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv
Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Now,
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?
The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)',
'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
random.shuffle(tokens)
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.
I wrote an article which is similar to your problem but not exactly the same.
https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.
I am trying to work on a classification problem: The data is of reviews of a particular product category from an e-commerce platform. Please find below the description of each attribute:
id: Unique identifier for each tuple.
category: The reviews have been categorized into two categories representing positive and negative reviews. 0 represents positive reviews and 1 represents negative reviews.
text: Tokenized text content of the review.
The sample dataset is attached in the picture.
I am thinking to try TF-IDF however, given the text format don't know how to use the same.
I expect to predict the category based on the text column provided.
You can use the column textas several features, I would recommend you to split that column (How do I split a string into several columns in a dataframe with pandas Python?):
#first load dataframe (I assume it is excel format)
import pandas as pd
df = pd.read_excel('YourPath', header=True)
df['Text'].str.split('', expand=True)
then you can conver it to a (0,1) dataframe:
df1 = (pd.get_dummies(df.set_index(['id', 'category']).stack())
.max(level=0)
.rename(columns=int)
.reset_index())
this will leads to something like:
id category 5002 7400 ....
1 A 1 0 .....
2 B 0 1
where the columns are the values from your dataframe, and only filled if the value exists in that category
Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
Result
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
N.B
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
Result
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).
I have around 2-3 million products. Each product follows this structure
{
"sku": "Unique ID of Product ( String of 20 chars )"
"title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
"brand":"Brand of product eg OnePlus",
"cat1":"First Category of Product Phone",
"cat2":"Second Category of Product Mobile Phones",
"cat3":"Third Category of Product Smart Phones",
"price":500.00,
"shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
"longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}
The problem statement is
Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.
For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"
1) apple iphone 6s gold or other color
2) apple iphone 6s plus options
3) apple iphone 6s options with other configurations
4) other apple iphones
5) other smart-phones in that price range
What I have tried so far
A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.
B) One hot encoder + Singular Value Decomposition ( SVD ) + Cosine Similarity
I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.
But I have some questions here
1) Is using One Hot Encoder and SVD is right approach in my use case?
2) Is there any way or trick to give extra weight the title and brand attribute in the training.
3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )
4) What should be my dimension value when I want to train for 2 million products.
5) How much memory I would need to deploy the SVD trained model to find in-memory cosine similarity for 2 million products.
What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.
Now that I've stated my objections to the posting, I will give some guidance on the questions:
"Right Approach" often doesn't exist in ML. The supreme arbiter is whether the result has the characteristics you need. Most important, is the accuracy what you need, and can you find a better method? We can't tell without having a significant subset of your data set.
Yes. Most training methods will adjust whatever factors improve the error (loss) function. If your chosen method (SVD or other) doesn't do this automatically, then alter the error function.
Yes, it's scalable. The basic inference process is linear on the data set size. You got poor results because you didn't scale up the hardware when you enlarged the data set; that's part of "scale up". You might also consider scaling out (more compute nodes).
Well, how should a dimension scale with the data base size? I believe that empirical evidence supports this being a log(n) relationship ... you'd want 600-700 dimension. However, you should determine this empirically.
That depends on how you use the results. From what you've described, all you'll need is a sorted list of N top matches, which requires only the references and the similarity (a simple float). That's trivial memory compared to the model size, a matter of N*8 bytes.