Predicting next week incidents using text analysis on incident reports - machine-learning

I want to do a project but I'm not sure if it's really viable neither which path I should take to try and solve this.
I have a dataset with numerous incidents from different places and their risk classification. For example:
Incident: "John stumbled and fell down the stairs"
Risk: Severe
Warehouse: A
Date: 2020-07-11
---
Incident: "Mary left the door open"
Risk: Low
Warehouse: B
Date: 2020-07-10
My idea is to compile the incidents by warehouse by week and give a probability of incidents happening in each warehouse for every risk.
Warehouse A
Probability of low risk incidents next week: 60%
Probability of severe risk incidents next week: 30%
But I'm not really sure how to get around this problem. It's not really text classification because I know the classification of every report (risk). Is there a way to use this dataset and get any prediction for the next week?

Well, I think this incidents are not correlated, so it should be hard to find an useful pattern.
That problem can be thought as a regression problem, where the probability of accidents is what you want the model to predict. Consider using a Recurrent Neural Network (RNN) that are useful for time-series problems.

Related

Which machine learning algorithm should i use to predict if particular parking space will be occupied?

I'm working on my idea for Master thesis topic.
I get a dataset with milions of records which describe on-street parking sensors.
Data i have :
-vehicle present on particular sensor ( true or false)
It's normal that there are few parking event where there are False values with different duration time in a row.
-arrival time and departure time(month,day,hour,minute and even second)
-duration in minutes
And few more columns, but i don't have any idea how to show in my analysis that "continuity of time" and
reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
Any ideas?
You can take two approaches:
If you want to predict whether a particular space will be occupied or not and if you take in count order of the events (TIME), this seems like a time series problem. You should start by trying simple time-series algorithms like Moving average or ARIMA Models. There are more sophisticated methods that take in count long and short term relationships, like recurrent neural networks, especially LSTM (Long short-term memory) which have shown good performance in time series problems.
You can take in the count all variables and use them to train a clustering algorithm like K-means or SVM.
As you pointed out:
And few more columns, but I don't have any idea how to show in my analysis that "continuity of time" and reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
I recommend you to work this problem as a time series problem.
Timeseries modeling will be better option for this kind of modelling. As you said you want to predict binary output at different time intervals i.e whether the the parking slot will be occupied at the particular time interval or not. You can use LSTM for this purpose.
Time series is definitely an option here... if you are really going with LSTMs why not look into Transformers and take advantage of attention mechanism while doing time series forecasting !! I don't know them thoroughly, yet, just have a vague idea and performance benefits over RNNs and LSTM.

Removing bias in user ratings

I have got a dataset with users ratings on images. I am normalizing the ratings using mean- standard deviation normalization to remove bias in the dataset due to user specific preferences. Is this a correct way to handle bias or is there any other way to remove bias in users ratings.
This is certainly wrong on a couple of points:
If you 'normalise' input by standard deviation in this way, what you are saying is that "low variability doesn't matter much, only the outliers really count" -- because the outliers will have themselves a deviation larger than the standard one...
You are dealing with 'votes' of user satisfaction, not 'measurements'. Bias, by definition is information about satisfaction -- you are throwing it away. I.e. 150 years ago people used to find the "No dogs, no Irish" thing acceptable, these days not so much. If you want to predict how well a restaurant is likely to be regarded after a visit, you can't discount 0 star votes merely because the people objected to the sign!
When it comes to star ratings as a prediction for how likely something is to be "enjoyed" or "regretted" you might want to read this article: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Note that the linked article is primarily interested in modelling "given past ratings, does the current vote indicate: (a) a continuation of past 'satisfaction', (b) a shifting trend towards increasing 'satisfaction', (c) a shifting trend towards decreasing 'satisfaction'" in terms of stars to award.

Xgboost forecasting model missing holiday period

I am building a forecasting system to predict the number of cable subscribers that would disconnect at a given point in time. I am using Python and out of the different models i tried, XGBoost performs the best.
I have a self referential system in place which works in a moving window fashion, e.g, as i run out of actuals, i start using forecasted figures in my lags.
To build the forecasting system, i used previous 800 days of lags(disconnects a day), moving averages, ratios, seasonality, indicators for year, month, day, week etc. However, Holidays, is where is gets a little messed up. Initially i used just one column to indicate holidays of all sorts, but later i figured out that different holidays may have a different impact (some holidays cause high sales, some holidays cause churn) so i added a column for each holiday, i also added indicators for long weekends, for holidays which fall on a Sunday etc. i also added a columns for 'season' indicating festive period such as thanksgiving, new year holidays etc.
Even after adding so many holiday related columns, i largely miss the thanksgiving and the new year period. Although it does take care of holidays to some extent, it completely misses the spike. And as can be seen from the chart, the spikes are a trend and appear every year (orange). my forecast (grey) does address the holidays in dec 17, but it under forecasts, any idea on how that can be taken care of.
p.s. I tuned the xgboost hyperparameters using gridsearch
As I understand, If you cleaned your data, removed outliers, your model will give a more stable prediction set overall, but it will fail to predict said outliers.
If you did clean the data, I'd play with the threshold, see the if the wider regular-day-errors balance with the ability to predict the higher spikes.

Assistance regarding model choice

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

Similarity of trends in time series analysis

I am new in time series analysis. I am trying to find the trend of a short (1 day) temperature time series and tried to different approximations. Moreover, sampling frequency is 2 minute. The data were collocated for different stations. And I will compare different trends to see whether they are similar or not.
I am facing three challenges in doing this:
Q1 - How I can extract the pattern?
Q2 - How I can quantify the trend since I will compare trends belong to two different places?
Q3 - When can I say two trends are similar or not similar?
Q1 -How I can extract the pattern?
You would start by performing time series analysis on both your data sets. You will need a statistical library to do the tests and comparisons.
If you can use Python, pandas is a good option.
In R, the forecast package is great. Start by running ets on both data sets.
Q2 - How I can quantify the trend since I will compare trends belong to two different places?
The idea behind quantifying trend is to start by looking for a (linear) trend line. All stats packages can assist with this. For example, if you are assuming a linear trend, then the line that minimizes the squared deviation from your data points.
The Wikipedia article on trend estimation is quite accessible.
Also, keep in mind that trend can be linear, exponential or damped. Different trending parameters can be tried to take care of these.
Q3 - When can I say two trends are similar or not similar?
Run ARIMA on both data sets. (The basic idea here is to see if the same set of parameters (which make up the ARIMA model) can describe both your temp time series. If you run auto.arima() in forecast (R), then it will select the parameters p,d,q for your data, a great convenience.
Another thought is to perform a 2-sample t-test of both your series and check the p-value for significance. (Caveat: I am not a statistician, so I am not sure if there is any theory against doing this for time series.)
While researching I came across the Granger Test – where the basic idea is to see if one time series can help in forecasting another. Seems very applicable to your case.
So these are just a few things to get you started. Hope that helps.

Resources