clinical characterisation p-value calculation - p-value

I have 3 groups of cancer patients and I want to check if there are any effect of their clinical data on the results like sex, treatment, or age, which statistical test would be suitable? is it one-way ANOVA? and what would be the fastest way to calculate p-values?
For example:
|parameter|n|group1|group2|group3|
|:--|:-|:----|:-----|:------|
|Sex|F|10|11|17|
|Sex|M|13|9|9|
|treatment|none|22|1|18|
|treatment|IT|0|1|3|

Related

ML/DL Train Model on Single Column Feature DataSet

I am trying to build a model that can predict insurance names based on insurance Id.
Before putting the question to this forum, I have tried KNN and Decian Tree, but the accuracy does not exceed more than 60%.
In my data frame, I have one column as a feature and the other as a label.
I can also extract other features from this data as well like Is Numeric, length, etc.
I have 2.8M rows of data in this shape.
insurance_id
insurance_name
XOH830990804
Medicare
XOH01179276
Medicare
H55575577
Medicare
H71096147
WELLMED
IBPW01981926
BCBS
MT25110S
Aetna
WXQQ07123
Aetna
6WU7NSSGY63
Oxford
MX7ZZ35T
Oxford
DU00079Z
Welcare
PB95800M
UHC
Please guide me on which approach or model can help me to achieve an accuracy of more than 80%.
You can try to diversify your inputs
As an example, you can pass aditional features to the network, such as:
Length of the insurance_id
Quantity of numbers in the insurance_id
Quantity of letters in the insurance_id
Sum of all numbers in the insurance_id
And any other transform you might think of.
As the output layer of your network, you might wanna use Dense(n_of_different_insurance_names, activation='softmax')
and a categorical_crossentropy loss function when compiling the model

how to do Classification based on the correlation of multiple features for a Supervised scenario

I have 2 features: 'Contact_Last_Name' and 'Account_Last_Name' based on which I want to Classify my data:
The logic is that if the 2 features are same i.e. Contact_Last_Name is same as Account_Last_Name - then the result is 'Success' or else it is 'Denied'.
So. for example: if Contact_Last_Name is 'Johnson' and Account_Last_Name is 'Eigen' - the result is classified as 'Denied'. If both are equal say - 'Edison' - then the result is 'Success'.
How, can I have a Classification algorithm for this set of data?
[please note that usually we discard High Correlation columns but over here the correlation between columns seems to have the logic for Classification]
I have tried to use Decision Tree(C5.0) and Naive Bayes(naiveBayes) in R but both of these fail to Classify the dataset correctly.
First of all its not a good use case for machine learning, because this can be done with just string match, but still if you want to give to a classification algorith, then create a table with values as 'Contact_Last_Name' and 'Account_Last_Name' and 'Result' and give it for decision tree and predict the third column.
Note that you partition your data for training and testing.

Predict price range of houses

I have a dataset with several features of houses including type, location, the number of bedrooms, etc. For example:
Type: Apartment, Semi-detached House, Single-detached House
Location: (Lat, Lon) Pairs like (40.7128° N, 74.0059° W)
Number of Bedrooms: 1, 2, 3, 4 ...
The target variable I want to predict is the house price. However, the house price given in the original dataset is the intervals of prices instead of numeric values, for example:
House Price: [0,100000), [100000,150000), [150000,200000), [200000,250000), etc.
So my question is what model should I use if I want to predict the range of house price? Simple regression models seem not work because we are predicting intervals instead of continuous numeric values.
Thanks in advance.
I would use the median of the price range and run a linear regression. In your case the labels would be {50000, 125000, 175000, 225000, ...}. After you get the predicted price just pick the range it falls into.
Alternatively, if the price ranges are fixed, you can use a one-vs-all logistic regression, although I am sure this is not the best approach.

What is the difference between a feature and a label? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm following a tutorial about machine learning basics and there is mentioned that something can be a feature or a label.
From what I know, a feature is a property of data that is being used. I can't figure out what the label is, I know the meaning of the word, but I want to know what it means in the context of machine learning.
Briefly, feature is input; label is output. This applies to both classification and regression problems.
A feature is one column of the data in your input set. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.
Once you've trained your model, you will give it sets of new input containing those features; it will return the predicted "label" (pet type) for that person.
Feature:
In Machine Learning feature means property of your training data. Or you can say a column name in your training dataset.
Suppose this is your training dataset
Height Sex Age
61.5 M 20
55.5 F 30
64.5 M 41
55.5 F 51
. . .
. . .
. . .
. . .
Then here Height, Sex and Age are the features.
label:
The output you get from your model after training it is called a label.
Suppose you fed the above dataset to some algorithm and generates a model to predict gender as Male or Female, In the above model you pass features like age, height etc.
So after computing, it will return the gender as Male or Female. That's called a Label
Here comes a more visual approach to explain the concept. Imagine you want to classify the animal shown in a photo.
The possible classes of animals are e.g. cats or birds.
In that case the label would be the possible class associations e.g. cat or bird, that your machine learning algorithm will predict.
The features are pattern, colors, forms that are part of your images e.g. furr, feathers, or more low-level interpretation, pixel values.
Label: Bird
Features: Feathers
Label: Cat
Features: Furr
Prerequisite: Basic Statistics and exposure to ML (Linear Regression)
It can be answered in a sentence -
They are alike but their definition changes according to the necessities.
Explanation
Let me explain my statement. Suppose that you have a dataset, for this purpose consider exercise.csv. Each column in the dataset are called as features. Gender, Age, Height, Heart Rate, Body_temp, and Calories might be one among various columns. Each column represents distinct features or property.
exercise.csv
User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
14733363 male 68 190.0 94.0 29.0 105.0 40.8 231.0
14861698 female 20 166.0 60.0 14.0 94.0 40.3 66.0
11179863 male 69 179.0 79.0 5.0 88.0 38.7 26.0
To solidify the understanding and clear out the puzzle let us take two different problems (prediction case).
CASE1: In this case we might consider using - Gender, Height, and Weight to predict the Calories burnt during exercise. That prediction(Y) Calories here is a Label. Calories is the column that you want to predict using various features like - x1: Gender, x2: Height and x3: Weight .
CASE2: In the second case here we might want to predict the Heart_rate by using Gender and Weight as a feature. Here Heart_Rate is a Label predicted using features - x1: Gender and x2: Weight.
Once you have understood the above explanation you won't really be confused with Label and Features anymore.
Let's take an example where we want to detect the alphabet using handwritten photos. We feed these sample images in the program and the program classifies these images on the basis of the features they got.
An example of a feature in this context is: the letter 'C' can be thought of like a concave facing right.
A question now arises as to how to store these features. We need to name them. Here's the role of the label that comes into existence. A label is given to such features to distinguish them from other features.
Thus, we obtain labels as output when provided with features as input.
Labels are not associated with unsupervised learning.
A feature briefly explained would be the input you have fed to the system and the label would be the output you are expecting. For example, you have fed many features of a dog like his height, fur color, etc, so after computing, it will return the breed of the dog you want to know.
Suppose you want to predict climate then features given to you would be historic climate data, current weather, temperature, wind speed, etc. and labels would be months.
The above combination can help you derive predictions.

How many kinds of criterion to measure which feature distinguish the label better?

I have a data set like this:
label feature1 feature2 feature3 feature4 ...
0 value11 value21 value31
1 value12 value22 ...
4 value13 value23 ...
2 value14 value24 ...
1 value15 value25 ...
3 value16 value26 ...
...
The value of label may be {0,1,2,3,4}
feature1 ranges from 0 to 10000
feature2 ranges from -4 to 3
and so on
For feature1 and feature2, I want to check which feature can distinguish the label better, how many ways to make it?
I have thought of the following plans:
check the pearson correlation between label and feature
check the variance of feature1 and feature2? But they have different range.
simultaneously use feature1 and feature2 to split a decision tree and check which feature has larger information gain.
do a linear regression using feature1 and feature2 and check the coefficient?
plot the distribution plot of feature1 and feature2 but without the information of label
I want to know which method of the following is solid enough? Are there any other better methods? Which method is the best? Thanks in advance.
A very common approach is to use a cross validation set and perform "model selection",measuring using performance metrics like: precision ,recall and f1 score. Your workflow would be(in pseudocode,not real code):
list of models to evaluate = you define multiple model candidates, for example
one features,two features, polynomial features .
for every model "m" on you defined to evaluate
Train the model "m" on your train dataset
Obtain the performance metrics using cross validation set
Select your optimal model based on your perfomance metrics(obtained from the cross validation set)
This a very common a powerful approach . You can find more info on Andrew Ng. videos on this subject on youtube

Resources