Questionnaire-based dataset machine learning classification problem - machine-learning

I am attempting to classify a given dataset, in the form of a questionnaire, into sick and normal conditions. The responses for each questionnaire are numerical(1, 2, 3...), and the distribution of answers from participants is diverse. This can result in data imbalance in each questionnaire, and so to handle this issue, one possible solution is resampling; although this may lead to an unequal number of datasets in each questionnaire. How should I handle this issue?
EX)
SICK NORMAL
CATEGORY 1 Q1 3555 3433
Q2 32 56
Q3 1 8
Q4 30 30
nothing so far. I haven't found the solution yet.

Related

Ideas for model selection for predicting sales at locations based on time component and class column

I am trying to build a model for sales prediction out of three different storages based on previous sales. However there is an extra (and very important) component to this which is a column with the values of A and B. These letters indicate a price category, where A siginifies a comparetively cheaper price compared to similar products. Here is a mock example of the table
week
Letter
Storage1 sales
Storage2 sales
Storage3 sales
1
A
50
28
34
2
A
47
29
19
3
B
13
11
19
4
B
14
19
8
5
B
21
13
3
6
A
39
25
23
I have previously worked with both types of prediction problems seperately, namely time series analysis and regression problems, using classical methods and using machine learning but I have not built a model which can take both predicition types into account.
I am writing this to hear any suggestions as how to tackle such a prediction problem. I am thinking of converting the three storage sale columns into one, in order to have one feature column, and having three one-hot encoder columns to indicate the storage. However I am not sure how to tackle this problem with a machine learning approach and would like to hear if anyone knows where to start with such a prediction problem.

Machine Learning Training Perspective [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just need some guidance. I'm seeing a lot of directions to go and I want to see what would be my best Avenue. So essentially I have a pandas dataframe of groups similar to this(groups are in 4's):
Name Role XP Acumen
0 Johnny Tsunami Driver 1000 39
1 Michael B. Jackson Pistol 2500 46
2 Bobby Zuko Pistol 3000 50
3 Greg Ritcher Lookout 200 25
4 Johnny Tsunami Driver 1000 39
5 Michael B. Jackson Pistol 2500 46
6 Bobby Zuko Pistol 3000 50
7 Appa Derren Lookout 250 30
8 Baby Hitsuo Driver 950 35
9 Michael B. Jackson Pistol 2500 46
10 Bobby Zuko Pistol 3000 50
11 Appa Derren Lookout 250 30
So basically I want to train the model to pick similar groups based of the dataframe above. End goal is I want to give it a massive dataset and have it pick out rows to create groups similar to the one above. Maybe refine it so it picks out similar numbers accuracy in values.
What's the best route to take? Supervised unsupervised. Linear....k clusters? Where do I need to point my research. What are the best steps to take.
The first step I would take is to understand how you want to calculate the similarity in the above mentioned data that seems fairly categorical. The most basic approach would be to run a clustering/ classification algorithm (mostly unsupervised in your case). Personally, even k-means runs fairly quickly and accurately if you have no idea of how to proceed (DBSCAN is my fav). I would also do an exploratory analysis (Self-Organizing Maps/ Kohonen Maps maybe useful in your case) to understand how the data is distributed.
You want to create groups and compare the groups to one another on after your clustering/ classification, right? You will also need to come up with a similarity metric like the KL Divergence to compare.
The main issue is coming up with a 'k' that will cluster your data, but I feel like you will need to try out different values and your intuition will play an important role!
Links:
SOM: https://www.ncbi.nlm.nih.gov/pubmed/16566459
DBSCAN: https://scikit-learn.org/stable/modules/clustering.html#dbscan
KL Divergence/ Cross-Entropy Loss: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

Unbalanced model, confused as to what steps to take

This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

What is wrong with 20newsgroup-18828 dataset?

i am currently using 20NewsGroup-18828 dataset in weka. I have selected a subset of document with 100 per category (total 2000 documents) which i divided in a split of 70%(training) and 30%(testing) when i tried classification with naive bayes, SVM and K-nn its accuracy is very low.Here are list of operations i am performing on the dataset
StringtoWordVector (indexing and term weighting with Tf-Idf, Smart stopword list, Snowball stemmer)
Dimensionality reduction with feature selection (InformationGain)
Dimensionality reduction with feature transformation (Random Projection)
When i use original dataset with 20,000 docs it performs well but it has duplications like some documents are classified in multiple categories.
Did any one used this dataset or can someone tell me what i am doing wrong ?
Regarding differences between datasets
The main difference between 20newsgroup ( o riginal dataset) and 20newsgroup-18828 (m odified) is:
o contains duplicates, m does not
o contains trivial problem, as it includes newsgroup identification header, m includes only from and subject headers (so it is still easy version of the problem, but harder than o), for example:
FILE 51126 regarding atheism
in original form:
Path:
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
From: keith#cco.caltech.edu (Keith Allan Schneider) Newsgroups:
alt.atheism Subject: Re: >>>>>>Pompous ass Message-ID:
<1pi9btINNqa5#gap.caltech.edu> Date: 2 Apr 93 20:57:33 GMT References:
<1ou4koINNe67#gap.caltech.edu> <1p72bkINNjt7#gap.caltech.edu>
<93089.050046MVS104#psuvm.psu.edu> <1pa6ntINNs5d#gap.caltech.edu>
<1993Mar30.210423.1302#bmerh85.bnr.ca> <1pcnqjINNpon#gap.caltech.edu>
Organization: California Institute
of Technology, Pasadena Lines: 9 NNTP-Posting-Host:
punisher.caltech.edu
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
In modified form (-18828 version)
From: keith#cco.caltech.edu (Keith Allan Schneider)
Subject: Re: >>>>>>Pompous ass
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
As you can see, original data is so simple, that you actually can find the name of the label inside of the file... this is why you will always get good scores on such data, even if your whole processing concept is very, very wrong.
So the question is not "what is wrong with 20newsgroup-18828" but rather "what is wrong with the original dataset".
General ideas
First, why would you assume that anything is wrong? You are performing very arbitrary methods of data representation processing (two different dimensionality reduction steps) on the very small (70 training vectors per class) dataset. There is nothing wrong with this data, this is a simple NLP data, which, as most of the NLP tasks require large amounts of data, and "naive" (not NLP-based) dimensionality reduction techniques have no guarantees to actually help.
Secod, even if you do something wrong, in 90% os cases (arbitrary high number) the error is between what user think he does, and what he actually does. So describing what you do won't lead to any help, you have to show what you exactly do (by giving a reproducible example).

Resources