Combine N-gram model with Bag of Word approach - machine-learning

I'm trying to classify offensive and non-offensive sentences using N-gram model with a set of training data. I would like to know if there are any approach to add bad word dictionary to the n-gram model.
I'm very new to machine learning and recently started this project. I want to know what kind of the methods are available for this problem if the above approach is not possible.
Thanks in Advance.

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

Using Keras to create a model that can generate new, similar data

I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

text2vec - Do topics' words update with new data?

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.
While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?
Thank you for answering!
Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.
During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.
And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Resources