Deep Learning Algorithm to Predict Bash Commands - machine-learning

Im new to machine learning, and I want to develop an application that takes all the data from multiple user's bash history, and predict the next command of another user based on other's executed commands.
I searched for it a lot but didnt find any good answer. Appreciate the ML expert's help if know about sample of similar code, or have any comments that might be useful such as what algorithm.etc. should I look into.

You can check Language Modeling topic, which is able to predict the next word in the sequence given the words that precede it. You probably work with RNN or LSTM based networks for Language Modeling.

Related

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

How to "teach" TensorFlow to compute expected output?

So as a fun project, I've been messing with the TensorFlow API (in Java unfortunately.. but I should be able to get some results out anyways). My first goal is to develop a model for 2D point cloud filtering. So I have written code that generates random clouds in 224x172 resolution, computes the result of a neighbor density filter, and stores both (see images below).
So basically I have generated data for both an input and expected output, which can be done as much as needed for a massive dataset.
I have both the input and output arrays stored as 224x172 binary arrays (0 for no point at index, 1 for a point at that index). So my input and output are both 224x172. At this point, I'm not sure how to translate my input to my expected result. I'm not sure how to weight each "pixel" of my cloud, or how to "teach" the program the expected result. Any suggestions/guidance on whether this is even possible for my given scenario would be appreciated!
Please don't be too hard on me... I'm a complete noob when it comes to machine learning.
Imagine, that Tensorflow is a set of building blocks (like LEGO) that allows constructing machine learning models. After the model is constructed, it could be trained and evaluated.
So basically your question could be divided into three steps:
1. I'm new to machine learning. Please guide me how to choose the model that fits the task.
2. I'm new to tensorflow. I have the idea of model (see 1) and I want to construct it via Tensorflow.
3. I'm new to Java's API of tensorflow. I know how to build model using tensorflow (2), but I'm stuck in Java.
This sounds scaring, but that's not too bad really. And I'd suggest you the following plan to do:
1. You need to look through the machine learning models to find the model that suits your case. So you need to ask yourself: what are the models that could be used for cloud filtering? And basically, do you really need some machine learning model? Why don't you use simple math formulas?
That are the questions you may ask to yourself.
Ok, assume you've found some model. For example, you've found a paper describing neural network able to solve your tasks. Then go to the next step
2. Tensorflow has a lot of examples and code snippets. So you could even
find the code with your model implemented yet.
The bad thing, that most code examples and API are Python based. But as you want to go into machine learning, I'd suggest you studying Python. It's easy to enter. It's very common to use python in the science world as it allows not to waste time on wrappers, configuration, etc. (as Java needs). We just start solving our task from the first line of the script.
As I've told initially about tensorflow and LEGO similarity, I'd like to add that there are more high-level additions to the tensorflow. So you work with the not building blocks but some kind of layers of blocks.
Something like tflearn. It's very good especially if you don't have deep math or machine learning background. It allows building machine learning models in a very simple and understandable way. So do you need to add some neural network layer? Here you are. And that's all without complex low-level tensor operations.
The disadvantage that you won't be able to load tflearn model from Java.
Anyway, we assume, at the end of this step you are able to build your model, to train it and to evaluate the model and prediction quality.
3. So you have your machine learning model, you understand Tensorflow mechanics, and if you still need to work with Java that should be much easier yet.
I note, that you won't be able to load tflearn model from Java. You can try to use jython to call python's functions directly from Java, though I haven't tried it.
And on this way (1-3) you will definitely have some more questions. So welcome to SO.

Incorporating user feedback in a ML model

I have developed a ML model for a classification (0/1) NLP task and deployed it in production environment. The prediction of the model is displayed to users, and the users have the option to give a feedback (if the prediction was right/wrong).
How can I continuously incorporate this feedback in my model ? From a UX stand point you dont want a user to correct/teach the system more than twice/thrice for a specific input, system shld learn fast i.e. so the feedback shld be incorporated "fast". (Google priority inbox does this in a seamless way)
How does one build this "feedback loop" using which my system can improve ? I have searched a lot on net but could not find relevant material. any pointers will be of great help.
Pls dont say retrain the model from scratch by including new data points. Thats surely not how google and facebook build their smart systems
To further explain my question - think of google's spam detector or their priority inbox or their recent feature of "smart replies". Its a well known fact that they have the ability to learn / incorporate (fast) user feed.
All the while when it incorporates the user feedback fast (i.e. user has to teach the system correct output atmost 2-3 times per data point and the system start to give correct output for that data point) AND it also ensure it maintains old learnings and does not start to give wrong outputs on older data points (where it was giving right output earlier) while incorporating the learning from new data point.
I have not found any blog/literature/discussion w.r.t how to build such systems - An intelligent system that explains in detaieedback loop" in ML systems
Hope my question is little more clear now.
Update: Some related questions I found are:
Does the SVM in sklearn support incremental (online) learning?
https://datascience.stackexchange.com/questions/1073/libraries-for-online-machine-learning
http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/
https://en.wikipedia.org/wiki/Concept_drift
Update: I still dont have a concrete answer but such a recipe does exists. Read the section "Learning from the feedback" in the following blog Machine Learning != Learning Machine. In this Jean talks about "adding a feedback ingestion loop to machine". Same in here, here, here4.
There could be couple of ways to do this:
1) You can incorporate the feedback that you get from the user to only train the last layer of your model, keeping the weights of all other layers intact. Intuitively, for example, in case of CNN this means you are extracting the features using your model but slightly adjusting the classifier to account for the peculiarities of your specific user.
2) Another way could be to have a global model ( which was trained on your large training set) and a simple logistic regression which is user specific. For final predictions, you can combine the results of the two predictions. See this paper by google on how they do it for their priority inbox.
Build a simple, light model(s) that can be updated per feedback. Online Machine learning gives a number of candidates for this
Most good online classifiers are linear. In which case we can have a couple of them and achieve non-linearity by combining them via a small shallow neural net
https://stats.stackexchange.com/questions/126546/nonlinear-dynamic-online-classification-looking-for-an-algorithm

Online machine learning for obstacle crossing or bypassing

I want to program a robot which will sense obstacles and learn whether to cross over them or bypass around them.
Since my project, must be realized in week and a half period, I must use an online learning algorithm (GA or such would take a lot time to test because robot needs to try to cross over the obstacle in order to determine is it possible to cross).
I'm really new to online learning so I don't really know which online learning algorithm to use.
It would be a great help if someone could recommend me a few algorithms that would be the best for my problem and some link with examples wouldn't hurt.
Thanks!
I think you could start with A* (A-Star)
It's simple and robust, and widely used.
There are some nice tutorials on the web like this http://www.raywenderlich.com/4946/introduction-to-a-pathfinding
Online algorithm is just the one that can collect new data and update a model incrementally without re-training with full dataset (i.e. it may be used in online service that works all the time). What you are probably looking for is reinforcement learning.
RL itself is not a method, but rather general approach to the problem. Many concrete methods may be used with it. Neural networks have been proved to do well in this field (useful course). See, for example, this paper.
However, to create real robot being able to bypass obstacles you will need much then just knowing about neural networks. You will need to set up sensors carefully, preprocess data from them, work out your model and collect a dataset. Not sure it's possible to even learn it all in a week and a half.

Machine learning/information retrieval project

I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)

Resources