Filter documents - machine-learning

I have a lot of different kinds of documents in different formats and I want to filter only those documents which are resumes of any candidate and I want high accuracy for this.
How can I do that?
I am confused should I use a keyword-based approach or a machine learning-based approach? If I use the machine learning approach then suggest which model is best for the above problem with high accuracy.

Related

ML algorithm suggestion for databases that change a lot with time after model training

I have a classification problem and I'm using a logistic regression (I tested it among other models and this one was the best). I look for information from game sites and test if a user has the potential to be a buyer of certain games.
The problem is that lately some sites from which I get this information (and also from where I got the information to train the model) change weekly and, with that, part of the database I use for prediction is "partially" different from the one used for training (with different information for each user, in this case). Since when these sites started to change, the model's predictive ability has dropped considerably.
To solve this, an alternative would be, of course, to retrain the model. It's something we're considering, although we'll have to do it with some frequency given the fact that the sites are changing every couple of weeks, considerably.
other solutions considered was the use of algorithms that could adapt to these changes and, with that, we could retrain the model less frequently.
Two options raised were neural networks to classify or try to adapt some genetic algorithm. However, I have read that genetic algorithms would be very expensive and are not a good option for classification problems, given the fact that they may not converge.
Does anyone have any suggestions for a modeling approach that we can test?

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

How to decide a predictive model for sales forecasting

I would like to know which model should I choose to forecast monthly sales. should I go for regression approaches or time-series methods for small 1.5-year data?
One of the first steps I would make is to clearly determine how many features you have.
In case of Univariate forecasting (observations in time of a single variable), you would most likely resort to even statistical approaches, such as ARIMA/SARIMA(I assume the concept of seasonality is known; if not, please read on properties of time series here : https://www.dummies.com/programming/big-data/data-science/key-properties-of-a-time-series-in-data-analysis/.
If you have multiple features(observations in time of multiple variables), you could first try with a VAR(vector autoregression).
Try these models at first, and only then proceed to more complicated ones such as LSTM/CNNs
Supporting #Nicolae Petridean's affirmation, the principle of Occam's Razor should always be applied: start with simple models and only after having tried several simpler ones should you progress to deep learning techniques.
Also, bear in mind that in the case of the latter, you will need much more data as compared to simpler statistical/mathematical models or even classical machine learning ones.
Depending on the data that you have either one or the other might work. Or other techniques. Try 2 simple models using each of the 2 techniques, and validate them against a common validation dataset. This way you will have your answer. Nobody can answer to your question unless has quite some good insights into the data that you have for training. Out of my belly I would probably start with a regression but in the end I assume you will end up using something else. It is always a good option to start with simple models first to better understand the problem and then progressively fine tune or do other tricks and more complicated models, depending on what the models you already have learn or not.
Have a look at this Kaggle competition : https://www.kaggle.com/c/competitive-data-science-predict-future-sales
Check several notebooks from there and maybe you will understand more on what works or does not work in this kind of prediction.
Link to notebooks : https://www.kaggle.com/c/competitive-data-science-predict-future-sales/notebooks

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Working with inaccurate (incorrect) dataset

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

Resources