Random forest regression from scratch - machine-learning

I want to know is there any link where I can find the scratch code of RANDOM FOREST REGRESSION in python? If yes please share the link with me.

Here's one: https://github.com/amstuta/random-forest
But you'll have better luck searching GitHub directly

Related

Multicolinearity check (pertubation test) in Logistic Regression?

I am in learning phase of ML and i wanna know how to check for multicolinearity in Logistoc Regression? with codes and explanantion and pre-requiste to check for it? or any link will also do what is dummies pls cover that too as i was watching youtube they all were discussing? please over from scratch if possible please ?

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

Binary Classification Task on Very Similar Patterns

I'm trying to do a binary classification task on a set of sentences which are so similar to each other. My problem is I'm not sure how to deal with this problem with such similarity between samples. Here are some of my questions:
(1). Which classification technique will be more suitable in this case?
(2). Will feature selection help in this case?
(3). Could sequence classification algorithms, based on recurrent neural network (LSTM) be a potential approach to follow?
I'll be glad to see any hint or help regarding to this problem, thank you!
(only a potential Answer to 3)
Assuming you only have to classify if they are in a certain category you wouldn't want to use RNN's unless you actually want it to make something new out of it (sequence-to-sequence)
That said it is possible to classify it if you end it with a sequence-flattener and a fully-connected-Layer

Using scikit-learn to decide if a given text is similar to previously learnt texts

I am a newbie to skilearn.
What I want to do is quite simple - just feed my model with a bunch of similar texts.
Then, I want to be able to give it a new text, and see if it is similar to the existing texts in the dataset.
How should this be done?
Thanks very much in advance.
One good aproach might be using cosine similarity. This is a very good tutorial for starting:
Machine Learning :: Cosine Similarity for Vector Space Models (Part III)
Another good approach would be a Bayesian Classifier, like the ones used for SPAM detection. Take a look at this link to learn more about them.

How does StackOverflow's tag suggestions work?

I've got a database of hundreds of thousands of forum posts, and would like to tag them in an unsupervised way.
I noticed that StackOverflow's tag system suggests tags as I go. How does this algorithm work?
I also found this that implies it is SVM based- is it official? http://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL&CFID=522960920&CFTOKEN=15091676
You could also follow a shallow (authors call it deep though) inverse regression using Gensim and word embeddings for document classification. Ideally, using both the titles and text of the forum posts, you should be able to build a pretty decent classification system. Follow along here in this notebook and paper.

Resources