Using GANs to generate data for sparse datasets - machine-learning

I have been recently introduced to the applications of ML in Cybersecurity, and I was interested in working on an application of GANs to generate data for sparse datasets
(Something like this https://becominghuman.ai/deep-learning-can-tackle-the-problem-of-sparse-data-by-augmenting-available-data-e9a4e0f1fc92)
However, I am not aware of the sort of datasets that can be used for this purpose. Could someone guide me through a few example datasets I can use to train a GAN on and to generate data? Are text datasets any good for GAN related generation?
My objective here is to simply understand how this whole process should work. Any help would be appreciated.

Have you come across this repository? I guess it contains datasets and more!

Related

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

Looking for a dataset that contain string value in Machine Learning

I'm learning Machine Learning with Tensorflow. I've work with some dataset like Iris flower data and Boston House, but all those data's values was float.
Yes I'm looking for a dataset that contain data's values are in string format to practice. Can you give me some suggestions?
Thanks
I provide you just two easy-to-start places:
Tensorflow website has three very good tutorials to deal with word embedding, language modeling and sequence-to-sequence models. I don't have enough reputation to link them directly but you can easily find them here. They provide you with some tensorflow code to deal with human language
Moreover, if you want to build a model from scratch and you need only the dataset, try ntlk corpora. They are easy to download directly from the code.
Facebook's ParlAI project lists a good amount of datasets for Natural Language Processing tasks
IMDB's reviews dataset is also a classic example, also Amazon's reviews for sentiment analysis. If you take a look at kernels posted on Kaggle you'll get a lot of insights about the dataset and the task.

Tensorflow: Use case for determining a dose of medication

I'm new to machine learning and trying to figure out where to start and how to apply it to my app.
My app is pulling a bunch of health metrics and based on all of them is suggesting a dose of medication (some abstract medication, doesn't matter) to take. Taking a medication is affecting health metrics and I can see if my suggestion was right of if it needs adjustments to be more precise the next time. Medications are being taken constantly so I have a lot of results and data to work with.
Does that seem like a good case for machine learning and using some of neural networks to train and make better predictions? If so - could you recommend an example for Tensorflow or Keras?
So far I only found image recognition examples and not sure how to apply similar algorithms to my problem.
I'm also a beginner into machine learning, but based on my knowledge, one way would be to use supervised learning with Keras, which uses Tensorflow as a backend. Keras is a lot easier to program than Tensorflow, but eventually Tensorflow might as well do the trick (depending on your familiarity with machine learning libraries).
You mentioned that your algorithm suggests medication based on data (from the patient).
One way to predict medication is to store all your preexisting data in a CSV file, and use the CSV module to read it. This tutorial covers the basics of reading CSV files (https://pythonprogramming.net/reading-csv-files-python-3/).
Next, you can store the data in a multi-dimensional array, and run a neural network through it. Just make sure that you have sufficiently enough data (the more the better) in comparison with the size of your neural network.
Another way, as you mentioned, would be using Convolutional Neural Networks, which theoretically could and should work, but I have very little experience programming them, so I'm afraid I can't give you any advice for that (you can program CNNs in both Keras and Tensorflow).
I do wish you good luck in your project!

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

How to perform linear/logistic regression on predictions of different models (say randomforest, gbm, svm etc)?

Basically it is done to improve the predictions by creating an ensemble. But how do we do that. Could somebody please explain using a sample code in R? I am just a learner. Any help would greatly be appreciated.
Thank you.
Prediction aggregation in ensembles can be done in a large variety of ways. The simplest approach is majority voting (classification) or averaging the predictions of all base models (regression).
Often, complex aggregation schemes are not much better than the basic ones (and are very sensitive to overfitting). This is why specialized packages such as EnsembleSVM only permit very basic aggregation (a linear combination at best).

Resources