Machine Learning technique for learning string patterns - machine-learning

I am new to machine learning and I am looking for a technique to learn string patterns based on a training data set.
My problem:
I have different types of words, belonging to different categories. Each category has somekind of its own pattern (for example one has a fixed length with only special characters, another exists of other characters which only occur in this category of "word").
For example:
"ABC" -> type1
"ACC" -> type1
"a8 219" -> type2
"c 827" -> type2
"ASDF 123" -> type2
...
I am searching for a machine learning technique to learn these pattern on its own, based on training data. I already tried to define some predictor variables (for example wordlength, number of special characters, ...) on my own and then used a Neural-Networks to learn and predict the category. But thats acutally not what i want. I want a technique to learn the pattern for each category on its own - even to learn patterns which I never thought about.
I want give the algorithm the learning data (consisting of the word-category examples) and want it to learn patterns for each category to predict the category from similar or equal words later in production.
Is there a state-of-the-art way to do it?
Thanks for your help

Since you have tag weka, the process would be
1. Create the feed the arff file
Example
#relation weka_mymodel_model
#attribute text string
#attribute ##class## {type1,type2}
#data
'boy am I stupid. I mean, wow, that was a major oversight. let\'s blame it on monday.',type1
..... all your data
2. Load the file in weka software
In the pre-processing tab you can filter (transform) the data; to for example a StringToWordVector that can be used with J48 classifiers etc, but we will leave this for now and only use classifiers that can handle directly your input
3. Classify
In tab "Classify", select attribute ##class## and then select a classifier that can support text directly a good start is the NaiveBayesMultinominal
In the interface of the classifier, set you settings, Stemmer, StopWords, Tokenizer etc.
The classifier to use and with which settings depends on the data, but you can test run the classifier either on "Using training set", "Supplied test set" or "Cross-fold" to understand what outcome your different settings have.
4 Create the model
When you are happy with your settings, export the model (right click the result>>Save model).
5 Use the model
Load the model in java, create the Instance, pass it to the model and retrive your result.
Conclusion
The weka software lets you test different classifier algorithms with different settings, the best way to find the best classifier is to test run the different classifier (use filters, select attributes etc) with different settings on a "Supplied test set" and check outcome.
]

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

Difference between parameters, features and class in Machine Learning

I am a newbie in Machine learning and Natural language processing.
I am always confused between what are those three terms?
From my understanding:
class: The various categories our model output. Given a name of person identify whether he/she is male or female?
Lets say I am using Naive Bayes classifier.
What would be my features and parameters?
Also, what are some of the aliases of the above words which are used interchangeably.
Thank you
Let's use the example of classifying the gender of a person. Your understanding about class is correct! Given an input observation, our Naive Bayes Classifier should output a category. The class is that category.
Features: Features in a Naive Bayes Classifier, or any general ML Classification Algorithm, are the data points we choose to define our input. For the example of a person, we can't possibly input all data points about a person; instead, we pick a few features to define a person (say "Height", "Weight", and "Foot Size"). Specifically, in a Naive Bayes Classifier, the key assumption we make is that these features are independent (they don't affect each other): a person's height doesn't affect weight doesn't affect foot size. This assumption may or not be true, but for a Naive Bayes, we assume that it is true. In the particular case of your example where the input is just the name, features might be frequency of letters, number of vowels, length of name, or suffix/prefixes.
Parameters: Parameters in Naive Bayes are the estimates of the true distribution of whatever we're trying to classify. For example, we could say that roughly 50% of people are male, and the distribution of male height is a Gaussian distribution with mean 5' 7" and standard deviation 3". The parameters would be the 50% estimate, the 5' 7" mean estimate, and the 3" standard deviation estimate.
Aliases: Features are also referred to as attributes. I'm not aware of any common replacements for 'parameters'.
I hope that was helpful!
#txizzle explained the case of Naive Bayes well. In a more general sense:
Class: The output category of your data. You can call these categories as well. The labels on your data will point to one of the classes (if it's a classification problem, of course.)
Features: The characteristics that define your problem. These are also called attributes.
Parameters: The variables your algorithm is trying to tune to build an accurate model.
As an example, let us say you are trying to decide to whether admit a student to gard school or not based on various factors like his/her undergrad GPA, test scores, scores on recommendations, projects etc. In this case, the factors mentioned above are your features/attributes, whether the student is given an admit or not become your 2 classes, and the numbers which decide how these features combine together to get your output become your parameters. What the parameters actually represent depends on your algorithm. For a Neural Net, it's the weights on the synaptic links. Similarly, for a regression problem, the parameters are the coefficients of your features when they are combined.
take a simple linear classification problem-
y={0 if 5x-3>=0 else 1}
here y is class, x is feature, 5,3 are parameters.
I just wanted to add a definition that distinguishes between attributes and features, as these are often used interchangeably, and it may not be correct to do so. I'm quoting 'Hands-On Machine Learning with SciKit-Learn and TensorFlow'.
In Machine Learning an attribute is a data type (e.g., “Mileage”),
while a feature has several meanings depending on the context, but
generally means an attribute plus its value (e.g., “Mileage =
15,000”). Many people use the words attribute and feature interchangeably,
though.
I like the definition in “Hands-on Machine Learning with Scikit and Tensorflow” (by Aurelian Geron) where
ATTRIBUTE = DATA TYPE (e.g., Mileage)
FEATURE = DATA TYPE + VALUE (e.g., Mileage = 50000)
Regarding FEATURE versus PARAMETER, based on the definition in Geron’s book I used to interpret FEATURE as the variable and the PARAMETER as the weight or coefficient, such as in the model below
Y = a + b*X
X is the FEATURE
a, b are the PARAMETERS
However, in some publications I have seen the following interpretation:
X is the PARAMETER
a, b are the WEIGHTS
So, lately, I’ve begun to use the following definitions:
FEATURE = variables of the RAW DATA (e.g., all columns in the spreadsheet)
PARAMETER = variables used in the MODEL (ie after selecting the features that will be in the model)
WEIGHT = coefficients of the parameters of the MODEL
Thoughts ?
Let's see if this works :)
Imagine you have an excel spreadsheet which has data about a specific product and the presence of 7 atomic elements in them.
[product] [calcium] [magnesium] [zinc] [iron] [potassium] [nitrogen] [carbon]
Features - are each column except the product because all the other columns are independent, coexisting, has measurable impact on the target i.e. the product. You can even choose to combine some of them to be called Essential Elements i.e. dimension reduction to make it more appropriate for analysis. The term "Dimension Reduction" is strictly for explanation here, not be confused by the PCA technique in unsupervised learning. Features are relevant for supervised learning technique.
Now, imagine a cool machine that has the capability of looking at the data above and inferring what the product is.
parameters are like levers and stopcocks to the specific to that machine which you can juggle with, and make sure that if the machine says "It's soap scum" it really/truly is. If you you think about yourself doing the dart board practice, what are the things you'd do to yourself to get closer to the bullseye (balance bias/variance)?
Hyperparameters are like parameters, BUT external to this machine we're talking about. What if the machine parts/mechanical elements are made of a specific compound e.g. carbon fibre or magnesium poly-alloy? How would this change what the machine can/can't do better?
I suppose it's an oversimplification of what things are, but hopefully acceptable?

One class SVM to detect outliers

My problem is
I want to build a one class SVM classifier to identify the nouns/aspects from test file.
The training file has list of nouns. The test has list of words.
This is what I've done:
I'm using Weka GUI and I've trained a one class SVM(libSVM) to get a model.
Now the model classifies those words in test file that the classifier identified as nouns in the generated model. Others are classified as outliers. ( So it is just working like a look up. If it is identified as noun in trained model, then 'yes' else 'no')
So how to build a proper classifier?. ( I meant the format of input and what it information it should contain?)
Note:
I don't give negative examples in training file since it is one class.
My input format is arff
Format of training file is a set of word,yes
Format of test file is a set of word,?
EDIT
My test file will have noun phrases. So my classifier's job is to get the nouns words from candidates in test file.
Your data is not formatted appropriately for this problem.
If you put
word,class
pairs into a SVM, what you are really putting into the SVM are sparse vectors that consist of a single one, corresponding to your word, i.e.
0,0,0,0,0,...,0,0,1,0,0,0,...,0,0,0,0,yes
Anything a classifier can do on such data is overfit and memorize. On unknown new words, the result will be useless.
If you want your classifier to be able to abstract and generalize, then you need to carefully extract features from your words.
Possible features would be n-grams. So the word "example" could be represented as
exa:1, xam:1, amp:1, mpl:1, ple:1
Now your classifier/SVM could learn that having the n-gram "ple" is typical for nouns.
Results will likely be better if you add "beginning-of-word" and "end-of-word" symbol,
^ex:1, exa:1, xam:1, amp:1, mpl:1, ple:1, le$:1
and maybe also use more than one n-gram length, e.g.
^ex:1, ^exa:1, exa:1, exam: 1, xam:1, xamp:1, amp:1, ampl:1, mpl:1, mple1:1, ple:1, ple$.1, le$:1
but of course, the more you add the larger your data set and search space grows, which again may lead to overfitting.

Classification in weka fails, caused by case sensitiveness of nominal values?

I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.

NLP text tagging

I am a newbie in NLP, just doing it for the first time.
I am trying to solve a problem.
My problem is I have some documents which are manually tagged like:
doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX
Here I have a fixed set of categories and any document can have any number of tags associated with it.
I want to train the classifier using this input, so that this tagging process can be automated.
Thanks
What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.
As for how this can be done, here are two references:
RCV1 : A New Benchmark Collection for Text Categorization
Research
Improved Nearest Neighbor Methods For Text Classification With
Language Modeling and Harmonic Functions
Most of classifier works on Bag of word model . There are multiple use case to get expected result.
Try out most general Multinomial naive base classifer with changing different input paramters and check out result.
Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)
You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.
Most initial approach is, you get started with simple classifier using scikit learn.
Put each category as traning class and train the classifier with this classes
For any input docX, classifier with trained model
You will get probability result for each category
Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.
its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)
Here are some simple tools that can help get you started
Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)

Resources