text2vec - Do topics' words update with new data? - text2vec

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.
While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?
Thank you for answering!

Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.
During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.
And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.

Related

Using Keras to create a model that can generate new, similar data

I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.

Can doc2vec be used if my text data is incrementally increasing?

I am new to Doc2vec use. In case I could get some advice before I start on it, it will save a LOT of time.
My data is an stream of text data (such as tweets) continuously coming in time. For clustering these tweets, I was thinking of using doc2vec to reduce the text content into a fixed size vector and use that to compare between documents.
So in this case, the text data is getting accumulated over time, can this be still used with Doc2Vec, I may have to learn the model again and again (may be!) or could I use some large corpus such as Wikipedia or a large newscorpus to train the Doc2Vec model.
Any suggestions will help!
Thanks in Advance.
The gensim Doc2Vec class does not support adjusting the model with new documents, but it can 'infer' and report a vector for new documents, based on the model learned from an earlier bulk training.
So, you can use that new inferred vector to compare the new document to older ones, or feed it to a trained classifier, etc.
If new documents continue to arrive, and especially if the balance of topics/meaning in your documents drifts over time, you would likely at some point want to discard a model based on older data, and create a new model based on your larger (or more recent) data.
(Note that vectors from the old model and new model won't be directly comparable. Training sessions involve a lot of randomness, and the meanings of dimensions/directions in any one model are somewhat arbitrary. It's the relative positions of vectors, from within the same model, that has some interpretive power.)

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Already having a decision tree model with binary class, how can I get a probability when i test a new instance?

I hava construct a decision tree model for a binary classification problem. What is bothering me is that when i have a new test instance, how can i get the probability or score which it belongs to.(not the specific classify result)
A simple way can be to use the frequencies attached to the leaves, but this frequentist approach suffers from issues related to data quantities, so you can smooth those estimates in various ways.
Also, have a look at this question about C4.5.

A few implementation details for a Support-Vector Machine (SVM)

In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.
I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).
Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.
There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.
Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)

Resources