training fasttext models with social generated content - machine-learning

I am currently learning about text classification using Facebook FastText. I have found some data from Kaggle that contains characters such as �� or twitter username and hashtags. I tried searching the web however there is no clarification of how you really need to clean/pre-process your text before training a model.
In some blogs I've seen authors writing about tokenisation however its not mentioned in fasttext. Another point it that fasttext git has examples of clean data, such as stackoverflow but nothing for twitter or such platform.
Question is, what is the best practice to pre-process user(social) generated content before training a model? What needs to be redacted?
Thanks

Since the FastText-Classifier does not work with pretrained embeddings, you can pretty much choose your own way how to clean your data. I would suggest you:
convert everything to lower case (or upper case if you want, it shouldn't matter).
And I would remove special characters beside # and #.
Everything else is up to you. You can decide to keep hashtags, or to remove them, the same is true for usernames. I would probably remove usernames because I guess there isn't a lot information in them. But in some cases it could be informative: Think about tweets about and answers to Donald Trump, his username is often used I guess. Just try what works best for your case. FastText is super fast, so a few experiments won't be much of a problem.

Related

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

What's a good ML model/technique for breaking down large documents/text/HTML into segments?

I'd like to break down HTML documents into small chunks of information. With sources like Wikipedia articles (as an example) this is reasonably easy to do, without machine learning, because the content is structured in a highly predictable way.
When working with something like a converted Word Doc or a Blog post, the HTML is a bit more unpredictable. For example, sometimes there are no DIVs, more than one H1 in a document, or no Headers at all etc.
I'm trying to figure out a decent/reliable way of automatically putting content breaks into my content, in order to break it down into chunks of an acceptable size.
I've had a little dig around for existing trained models for this application but I couldn't find anything off-the-shelf. I've considered training my own model, but I'm not confident of the best way to structure the training data. One option I've considered in relation to training data is providing a sample of where section breaks are numerically likely to exist within a document but I don't think that's the best possible approach...
How would you approach this problem?
P.s. I'm currently using Tensorflow but happy to go down a different path.
I've found the GROBID library quite robust for different input documents (since it's based on ML models trained on a large variety of documents). The standard model parses input PDF documents into structured XML/TEI encoded files, which are much easier to deal with. https://grobid.readthedocs.io/en/latest/Introduction/
If your inputs are HTML documents the library also offers the possibility to train your own models. Have a look at: https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/

Extracting information from web-pages using NER

My task is to extract information from a various web-pages of a particular site. Now, the information to be extracted can be of the form as product name, product id, price, etc. The information is given in text using natural language. Also, I have been asked to extract that information using some Machine Learning algorithm. I thought of using NER (Named Entity Recognition) and training it on custom training data (which I can prepare using the scraped data and manually labeling the integers/data as required). I wanted to know if the model can even work this way?
Also, let me know if I can improve this question further.
You say a particular site. I am assuming that it means you have some fair idea of what the structures of webpages are, if the data is in table form or a free text form, how the website generally looks. In this case, a simple regex (prices, ids etc) supported by some POS tagger to extract product names and all is enough for you. A supervised approach is definitely an overkill and might underperform than the simpler regex.

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Parsing nonuniform data

I am trying to parse a collection of data that has two (or one) useful pieces, but may be organized in many different ways:
V01C01
Vol 1 Chapter 1
Chapter 1 Volume 1 - Alt title
V1.1
etc.
I don't want to use a massive collection of regexs, because there is no way to predict all of the combinations of how things will be organized (also some will have extraneous text). I feel like there is a branch of machine learning that may be perfect for this, but I'm not experienced in it enough to know.
Well that is an interesting problem for sure and there are a couple of things you could try.
Making the assumption that you don't have labels on your data, then the first thing I would try to do, is to check the connections between each instance using a clustering algorithm like k-means (http://en.wikipedia.org/wiki/K-means_clustering), keep in mind that this wouldn't solve your problem but would help you to explore your data and hopefully find a set of features to train a supervised learning classifier.
In the case that you do have labels on your data, or you could manually tag your set. Then you are in front a more manageable problem. At first glance, it would look a lot like a text or document classification problem (like classify emails as Spam/NoSpam), in which case a naive bayes classifier could be a good first attempt to attack the problem since is a easy algorithm to implement and can provide reasonable good results.
About Naives Bayes Classifier (https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html)
I made some assumptions here and I might be wrong based on that. Maybe if you clarify some points (like if you are able to manually tag the data) we would be able to help you further.

Resources