Using Naive Bayes Classification to Identity a Twitter User's Gender [closed] - twitter

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.
Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.

I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is RC and can answer "Male", "Female", "Do not know", you can create a labelling of your data X using RC in a natural way:
X_m = { x in X : RC(x)="Male" }
X_f = { x in X : RC(x)="Female" }
Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating RC - so in this case - users' names (I assume, that RC answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call it SC. After that, you can simply create a "complex" classifier:
C(x) = "Male" iff RC(x)= Male" or
(RC(x)="Do not know" && SC(x)="Male")
"Female" iff RC(x)= Female" or
(RC(x)="Do not know" && SC(x)="Female")
This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.

You need to develop a vocabulary linking name and gender.
Then you have to define features for each tweet.
Finaly you can use weka (java), Matlab, Python to build the learing set.
Main issues:
Your language? To identify sex from name is easy in Italian (-a Female, -o Male [except Andrea, Luca] ) or get an eye here Does anyone know of a good library for mapping a person's name to his or her gender?
second issue is a bit complicate you a need a semantic dictionary or you van analyse only the destination of the tweet (#to) or presence of url or image

Related

What are the steps should we take to analyze a dataset? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.
What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.
Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight

Confusion about test & validation set labels in machine learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?
Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.
The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...

Handing high Cardinality features with supervised ratio and weight of evidence [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Is Machine Learning the relationship between input & output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
According to an article I read here, Machine Learning is to do with teaching a machine how to do certain tasks through 'learning' input/output relations.
What is a more accurate definition of machine learning?
Machine Learning is to do with teaching a machine how to do certain tasks through input/output relations. Is this kind of correct?
The short answer is yes, kind of. Read on.
Definition of Machine Learning
To understand what Machine Learning is let's first define the term Learning. The often quoted definition by Tom M. Mitchell (1) is as follows:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E
Meaning?
This sounds quite formal, however it just says computers learn from experience that they are presented with in terms of data. The data to enable learning exists relative to a specific task and consists of several parameters:
T, a task to accomplish, e.g. predict housing price predictions
E, some value of experience, e.g. prices observed
P, some value of performance, e.g. how many prices are predicted
Example: Housing prices
Once a program has learnt from these inputs, it can take a new, previously unseen experience and from that predict, in our example, the specific housing price. The housing price might be strongly correlated to say location, age and size of house or apartment, and the luxury of its interiors.
What is the result of a learning algorithm?
In its simplest form then a machine learning algorithm for housing prices might implement a multi-variate regression analysis. It takes as input a body of data that relates real, observed prices to the four features location, age, size, luxury. The process of learning produces a regression model that in essence assigns a weight to each feature, of the form
y^ = w_location * location + w_age * age + w_size * size + w_luxury * luxury
That is, the weights w_* are learned from the input data, y^ is the predicted price. The learning is considered successful once the formula is able to successfully predict housing prices based on a list of features alone. Usually a prediction is considered successful if it falls within a certain bound (%-range) of the real price.
Note that the definition of successful very much depends on the kind of task that the program must learn, however the result needs to be substantially better than a pure random guess (that is, the ratio of correct results needs to be statistically significant).
Is there more to it?
Yes, a lot. Some pointers can be found in this Wikipedia article. If you are keen to get into the subject, professor Andrew Ng's Standford lecture is quite famous, although there are many more courses if you look for it. Pick the one that best suits your interests.
References
(1): Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2. as referenced by Wikipedia

Is it unsupervised learning if I don't have labeled data to train my models on

I would like to collect user information to determine whether they are male or female. I have zero labeled data for my users, but I know some features that can easily predict their gender. An example would be texts created by the users that contain words strongly associated with one gender (ex: Male: beer, football game, boxers. Female: facial, makeup, bra).
Would this be considered unsupervised learning, since I don't have labelled data to train my models on?
This is neither supervised nor unsupervised. You are just applying some predefined rules to classify between male/fame.
This is also not machine learning, because you don't use any learning method...
A supervised learning method would use all of the text used by the users and allow the machine to determine which words are important and by how much by trying to guess the user's gender and then correcting itself with the label.
An unsupervised method would be to provide the machine with all the text by the users and allow it to try and create different pattern groups out of it. However, there are many more ways to group users than 'male' and 'female' so this is not exactly an ideal unsupervised problem.
Telling the system which words are important and separating the system into groups based on that would just be a regular program and can be accomplished by any programming language that can match text and provide an output.
pro·gram
noun
1.
a planned series of future events, items, or performances.

Resources