Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I would like to know the best available algorithms for text Classification. I want to classify the document based on Sports, Bank, technology etc.Please suggest good algorithms to get highest accuracy.
There is no best algorithm. See "4th Law of Data Mining – “NFL-DM” http://khabaza.codimension.net/index_files/9laws.htm
You do want an algorithm that can handle many columns. More columns than rows if need be. This rules out matrix-based algorithms.
Naive Bayes and SVM are popular choices for text classification.
The good accuracy is not only based on the machine-learning algorithm. Is is also based on the feature selection.
Try to define task specific features or analyze your feature space.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have to train a reinforcement learning agent (represented by a neural network) whose environment has a dataset where outliers are present.
How can I actually deal with the normalization data considering that I want to normalize them in a range [-1,1]?
I need to maintain outliers in the dataset because they're critical, and they can be actually significant in some circumstances despite being out of the normal range.
So the option to completely delete rows is excluded.
Currently, I'm trying to normalize the dataset by using the IQR method.
I fear that with outliers still present, the agent will take some actions only when intercepts them.
I already experimented that a trained agent always took the same actions, excluding others.
What does your experience suggest?
After some tests, I take this road:
Applied a Z-score normalization with the option "Robust"; in this way, I have mean=0 and sd=1.
I calculated the min_range(feature)+max_range(feature)/2
I divided all the feature data with the mean calculated in point 2.
The agent learned pretty well.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Can someone please give me some suggestions on which feature selection techniques for gene classification should I use?
The major problem to work with gene expression data, with a large number of dimensions and small sample size. Instead of standard feature extraction/selection algorithms, generally, kernel-based feature selection algorithms are applied to gene expression data such as KBMTL(kernelized Bayesian multitask learning), NDR(nonlinear dimensionality reduction) or regularized linear methods such as LASSO and Elastic-net.
You can check these papers to learn more about how to make efficient feature selection on gene expression data.
paper1
paper2
paper3
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to train an output vector(which is from deep learning model) like fixed vector. Hence, I chose a cosine similarity between two vectors as the objective function. However, I don't know if that is a correct approach for my need.
No. The cosine similarity is a measure of how similar two items (samples in your dataset) are.
In contrast, the objective function when training a neural network should be a definition of the current estimation error over the data - so they are different things.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to predict tags for stackoverflow questions and I am not able to decide which Machine Learning algorithm will be a correct approach for this.
Input: As a dataset I have mined stackoverflow questions, I have tokenized the data set and removed stopwords and punctuation from this data.
Things i have tried:
TF-IDF
Trained Naive Bayes on the dataset and then gave user defined input to predict tags, but its not working correctly
Linear SVM
Which ML algorithm I should use Supervised or Unsupervised? If possible please, suggest a correct ML approach from the scratch. PS: I have the list of all tags present on StackOverflow so, will this help in anyway? Thanks
I would try MLP. In order to begin I would choose a reasonably small set of keywords for input and encode them [1..100 for example] and train for a reasonably small set of output tags.
PS: Unsupervised learning for this task is unfavorable in general because many questions that refer to different tags have very similar content and are very likely to get clustered together.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a practical machine learning problem as an exercise. I just need help formulating my problem.
I have text from 20 books of a famous old Author. there are 5 more books that has been debated throughout history if the belong to the same author or not.
I am thinking about the best way to represent this problem. I am thinking of using a bag-of-words appoach to find the most significant words used by the author.
Should I treat it as a Naive Bayes (Spam/Ham) problem, or should I use KNN classification (Author/non-author) to detect the class of each document. Is there another way of doing it?
I think Naive Bayes can give you insights. One more way can be , find out features which separate such books ex
1. Complexity of words , some writers are easy to understand and use common words , i am hinting towards IDF (Inverse document frequency)
2. Some words may not not even exist at his time like "selfie" , "mobile" etc.
Try to find a lot of features like that and can also train a discriminative classifier.