ML models can only be used for making future predictions? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm aware that ML models are used to make future prediction but can they also be used for making predictions in the past?
I've a model that predicts the accident prone zones for a given location and given date and time. The model has been developed by studying previous 2 years data (2020 and 2021). I've few datasets that I am required to predict on, which are in the year 2019. This is required to verify if the predictions actually tally.
Now, would it be feasible to use this ML model to test on the dataset for the year 2019?
I'm using sklearn and the model used is Random forest.

Theoretically it is possible. It doesn't matter which direction you go. e.g. if a trend is seen to increase in the future, this means the trend is probably decreasing in the past. So for the model it doesn't matter much - it is going to predict a decrease (for example). However, how relevant is your prediction, it is something to sought for.

Related

Project Using Machine Learning for Optimization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 months ago.
Improve this question
I want to do a website project that uses machine learning to optimize car throughput in a city. This would be a cartoonish grid of dots attempting to navigate through a grid of streets with stoplights at each intersection. However, I have not been able to find the right resources for learning about this type of ML optimization.
The idea to start is that the grid of stoplights is given the same set of cars each epoch and the stoplights guess their own frequency of green/red to maximize traffic flow. So the metric that the model will learn against is number of cars through the light (or time for all cars to clear the city, not sure yet).
I have done the Google ML Crash Course and the book A Programmer's Guide to Artificial Intelligence, but I have yet to find the right type of ML I am looking for. I am looking for a learning resource on training a model with no labeled data, with a metric for optimization.
Reinforcement learning was what I was looking for and I’m now looking into tensorflow documentation on how a virtual light signal can take actions and receive rewards from a model

Improving Machine learning model for trading and trend prediction [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on making predictions and decisions based on stocks and crypto data.
First I implemented a decision tree model and I had Model Accuracy: 0.5. After that I did some research and found out that decision tree is not enough and I tried to improve it with random forest and adaboosting.
After that I noticed that I have 3 above mentioned algorithms with the same training and test data, and I get three different results.
Now the question is if it is possible to make the three algorithms work together by combining them in some way and benefit from the previous result?
You can combine classifiers, yes. This is considered an ensemble. It's a bit weird to make an ensemble from a decision tree and a random forest, though. A random forest is an ensemble of decision trees. That's why it's called a forest.

Random Forest Calibration [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am experimenting with Random Forests for a Classification problem and am wondering about calibration for individual features. For example, a 2-level factor such as home field advantage in a sporting event, that we can be certain has an average effect of around +5% on win rate and its effect is not captured by any other feature in the data.
It appears as though the nature of random forests (selecting N random features to consider at each split) will not allow the model to fully capture the effect of any one particular feature like this. Setting the max_features parameter to None or to include all features appears to solve the problem but then it loses the power of diversity between trees.
I am wondering if there are any good methods of dealing with this type of feature that we want to be fully captured based on some sort of domain knowledge we have about the problem?

How to choose which model to fit to data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.
No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.

Survey to determine satisfaction: how to find the questions that mattered? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If a survey is given to determine overall customer satisfaction, and there are 20 general questions and a final summary question: "What's your overall satisfaction 1-10", how could it be determined which questions are most significantly related to the summary question's answer?
In short, which questions actually mattered and which ones were just wasting space on the survey...
Information about the relevance of certain features is given by linear classification and regression weights associated with these features.
For your specific application, you could try training an L1 or L0 regularized regressor (http://en.wikipedia.org/wiki/Least-angle_regression, http://en.wikipedia.org/wiki/Matching_pursuit). These regularizers force many of the regression weights to zero, which means that the features associated with these weights can be effectively ignored.
There are many different approaches for answering this question and at varying levels of sophistication. I would start by calculating the correlation matrix for all pair-wise combinations of answers, thereby indicating which individual questions are most (or most negatively) correlated with the overall satisfaction score. This is pretty straightforward in Excel with the Analysis ToolPak.
Next, I would look into clustering techniques starting simple and moving up in sophistication only if necessary. Not knowing anything about the domain to which this survey data applies it is hard to say which algorithm would be the most effective, but for starters I would look at k-means and variants if your clusters are likely to all be similarly-sized. However, if a vast majority of the responses are very similar, I would look into expectation-maximization-based algorithms. A good open-source toolkit for exploring data and testing the efficacy of various algorithms is called Weka.

Resources