What are the steps should we take to analyze a dataset? [closed]

What are the steps should we take to analyze a dataset? [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.

What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.

Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight

Related

Fine tuning GPT2 for generative question anwering

I am trying to finetune gpt2 for a generative question answering task.
Basically I have my data in a format similar to:
Context : Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
I was looking on the huggingface documentation to find out how I can finetune GPT2 on a custom dataset and I did find the instructions on finetuning at this address:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
The issue is that they do not provide any guidance on how your data should be prepared so that the model can learn from it. They give different datasets that they have available, but none is in a format that fits my task well.
I would really appreciate if someone with more experience could help me.
Have a nice day!

Your task right now is ambiguous, it could be any of:
QnA via Classification (answer is categorical)
QnA via Extraction (answer is in the text)
QnA via Language Modeling (answer can be anything)
Classification
If all you're examples have Answer: X, where X is categorical (i.e. always "Good", "Bad", etc ...), you can do classification.
In this setup, you'd would have text-label pairs:
Text
Context: Matt wrecked his car today.
Question: How was Matt's day?
Label
Bad
For classification, you're probably better off just fine-tuning a BERT style model (something like RoBERTTa).
Extraction
If all you're examples have Answer: X, where X is a word (or consecutive words) in the text (for example), then it's probably best to do a SQuAD-style fine-tuning with a BERT-style model. In this setup, you're input is (basically) text, start_pos, end_pos triplets:
Text
Context: In early 2012, NFL Commissioner Roger Goodell stated that the league planned to make the 50th Super Bowl "spectacular" and that it would be "an important game for us as a league".
Question: Who was the NFL Commissioner in early 2012?
Start Position, End Position
6, 8
Note: The start/end position values of course positions of tokens, so these values will depend on how you tokenize your inputs
In this setup, you're also better off using a BERT-style model. In fact, there are already models on huggingface hub trained on SQuAD (and similar datasets). They should already be good at these tasks out of the box (but you can always fine-tune on top of this).
Language Modeling
If all you're examples have Answer: X, where X can basically be anything (it need not be contained in the text, and is not categorical), then you'd need to do language modeling.
In this setup, you have to use a GPT-style model, and your input would just be the whole text as is:
Context: Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
There is no need for labels, since the text itself is the label (we're asking the model to predict the next word, for each word). Larger models like GPT-3 and https://cohere.com (full disclosure, I work at Cohere) should be good at these tasks without any finetuning (if you give it the right prompt + examples), but of course, these are accessed behind APIs. These platforms also allow you to fine-tune models (via language modeling), so you don't need to run any code yourself. Not sure how much mileage you'll get with finetuning a smaller model like GPT-2. If this project is for learning, then yeah, definitely go ahead and fine-tune a GPT-2 model! But if performance is key, I highly recommend using a solution like https://cohere.com, which will just work out of the box.

Confusion about test & validation set labels in machine learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?

Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.

The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...

Machine Learning Two class classification [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I built this ML model in Azure ML studio with 4 features including a date column.
Trying to predict if the price is going to be higher tomorrow than it is today. Higher = 1, not higher = 0
It is a Two class neural network (with a Tune model hyperparameters).
When I test it I expect to get a answer between 0 - 1 which I do. The problem comes when I change the feature from 1 to 0. And get almost a similar answer.
I thought that if 1 = a score probabilities of 0.6
Then a 0 (with the same features) should give a score of 0.4
A snapshot of the data (yes I know I need more)
Model is trained/tuned on the "Over5" feature, and I hope to get an answer from the Two class neural network module in the range between 0 -1.
The Score module also produce results between 1 and 0. Everything looks to be correct.
I changed normalization method (after rekommendation from commenter) but it does not change the output much.
Everything seems to be in order but my goal is to get a prediction of the likelihood that a day would finish "Over5" and result in a 1.
When I test the model by using a "1" in the Over5 column I get a prediction of 0.55... then I tested the model with the same settings only changing the 1 to a 0 and I still get the same output 0.55...
I do not understand why this is since the model is trained/tuned on the Over5 feature. Changing input in that column should produce different results?

Outputs of a neural network are not probabilities (generally), so that could be a reason that you're not getting the "1 - P" result you're looking for.
Now, if it's simple logistic regression, you'd get probabilities as output, but I'm assuming what you said is true and you're using a super-simple neural network.
Also, what you may be changing is the bias "feature", which could also lead to the model giving you the same result after training. Honestly there's too little information in this post to say for certain what's going on. I'd advise you try normalizing your features and trying again.
EDIT: Do you know if your neural network actually has 2 output nodes, or if it's just one output node? If there are two, then the raw output doesn't matter quite as much as which node had the higher output. If it's just one, I'd look into thresholding it somewhere (like >0.5 means the price will rise, but <=0.5 means the price will fall, or however you want to threshold it.) Some systems used in applications where false positives are more acceptable than false negatives threshold at much lower values, like 0.2.

Is Machine Learning the relationship between input & output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
According to an article I read here, Machine Learning is to do with teaching a machine how to do certain tasks through 'learning' input/output relations.
What is a more accurate definition of machine learning?

Machine Learning is to do with teaching a machine how to do certain tasks through input/output relations. Is this kind of correct?
The short answer is yes, kind of. Read on.
Definition of Machine Learning
To understand what Machine Learning is let's first define the term Learning. The often quoted definition by Tom M. Mitchell (1) is as follows:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E
Meaning?
This sounds quite formal, however it just says computers learn from experience that they are presented with in terms of data. The data to enable learning exists relative to a specific task and consists of several parameters:
T, a task to accomplish, e.g. predict housing price predictions
E, some value of experience, e.g. prices observed
P, some value of performance, e.g. how many prices are predicted
Example: Housing prices
Once a program has learnt from these inputs, it can take a new, previously unseen experience and from that predict, in our example, the specific housing price. The housing price might be strongly correlated to say location, age and size of house or apartment, and the luxury of its interiors.
What is the result of a learning algorithm?
In its simplest form then a machine learning algorithm for housing prices might implement a multi-variate regression analysis. It takes as input a body of data that relates real, observed prices to the four features location, age, size, luxury. The process of learning produces a regression model that in essence assigns a weight to each feature, of the form
y^ = w_location * location + w_age * age + w_size * size + w_luxury * luxury
That is, the weights w_* are learned from the input data, y^ is the predicted price. The learning is considered successful once the formula is able to successfully predict housing prices based on a list of features alone. Usually a prediction is considered successful if it falls within a certain bound (%-range) of the real price.
Note that the definition of successful very much depends on the kind of task that the program must learn, however the result needs to be substantially better than a pure random guess (that is, the ratio of correct results needs to be statistically significant).
Is there more to it?
Yes, a lot. Some pointers can be found in this Wikipedia article. If you are keen to get into the subject, professor Andrew Ng's Standford lecture is quite famous, although there are many more courses if you look for it. Pick the one that best suits your interests.
References
(1): Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2. as referenced by Wikipedia

Using Naive Bayes Classification to Identity a Twitter User's Gender [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.
Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.

I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is RC and can answer "Male", "Female", "Do not know", you can create a labelling of your data X using RC in a natural way:
X_m = { x in X : RC(x)="Male" }
X_f = { x in X : RC(x)="Female" }
Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating RC - so in this case - users' names (I assume, that RC answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call it SC. After that, you can simply create a "complex" classifier:
C(x) = "Male" iff RC(x)= Male" or
(RC(x)="Do not know" && SC(x)="Male")
"Female" iff RC(x)= Female" or
(RC(x)="Do not know" && SC(x)="Female")
This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.

You need to develop a vocabulary linking name and gender.
Then you have to define features for each tweet.
Finaly you can use weka (java), Matlab, Python to build the learing set.
Main issues:
Your language? To identify sex from name is easy in Italian (-a Female, -o Male [except Andrea, Luca] ) or get an eye here Does anyone know of a good library for mapping a person's name to his or her gender?
second issue is a bit complicate you a need a semantic dictionary or you van analyse only the destination of the tweet (#to) or presence of url or image

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart