Retrieval Based Q/A bot - machine-learning

I am trying to train a retrieval based Q/A chat bot using RNN (classification). I tried training for about 1000 steps, but have hardly got any meaning full results (ACC < 10). Basically, I was trying to map the tensorflow DBPedia example over my dataset. (So I turned my Q/A problem to a classification one). DBPedia is a clean and grammatically correct dataset. However, my dataset is full of short forms and grammatical errors / spelling mistakes. I have tried to correct many of them using the (right/wrongs) words pairs and stemming.
I have read that sequence to sequence model works the best for such problems. However, I had not expected the RNN to fail so miserably.
Any ideas why it did ?
[EDIT] : Even Char level CNN gives similar result.

Related

How to deal with negative numbers while creating building the log features?

I have some questions on a data set for ML::-->
so, there are only 2 features input and output for my data which has 1700 examples, And input and output are having non-linear relationship ship..they are not normally distributed respectfully, and the scatter is shown below..what can we say is the relationship between them? How to approach a solution to this kind of problem? how to create features? - I have built features like log and sqrt of input gave me a good correlation with output but this function doesn't apply for negative no.'s so what to do for -ve no.'s?enter image description here
I have tried using cubicroot(1/3) but not badly correlated with output.

Machine learning models don't work with continuous data

I'm attempting to get a machine learning model to predict a baseball players Batting Average based on their At Bats and Hits. Since:
Batting Average = Hits/At Bats
I would think this relationship would be relatively easier to discover. However, since Batting Average is float (i.e. 0.300), all the models I try return the following error:
ValueError: Unknown label type: 'continuous'
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
From reading other StackOverflow posts on this error, I began doing this:
lab_enc = preproccessing.LabelEncoder()
y = pd.DataFrame(data=lab_enc.fit_transform(y))
Which seems to change values such as 0.227 to 136 which seems odd to me. Probably just because I don't quite understand what the transform is doing. I would, if possible, prefer just using the actual Batting Average values.
Is there a way to get the models I tried to work when predicting continuous values?
The problem you are trying to solve falls into the regression (i.e. numeric prediction) context, and it can certainly be dealt with ML algorithms.
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
The first two algorithms you mention here (Logistic Regression and Random Forest Classifier) are for classification problems, and thus are not suitable for your (regression) setting (they expectedly produce the error you mention). Linear Regression however is suitable and it should work fine here.
Please, for starters, stick to Linear Regression, in order to convince yourself that it can indeed handle the problem; you can subsequently extend to other scikit-learn algorithms like RandomForestRegressor etc. If you face any issues, open a new question with the specific code & error(s)...

Deep learning classification with no labels

I must participate in a research project regarding a deep learning application for classification. I have a huge dataset containing over 35000 features - these are good values, taken from laboratory.
The idea is that I should create a classifier that must tell, given a new input, if the data seems to be good or not. I must use deep learning with keras and tensor flow.
The problem is that the data is not classified. I will enter a new column with 1 for good and 0 for bad. Problem is, how can I find out if an entry is bad, given the fact that the whole training set is good?
I have thought about generating some garbage data but I don't know if this is a good idea - I don't even know how to generate it. Do you have any tips?
I would start with anamoly detection. You can first reduce features with f.e. an (stacked) autoencoder and then use local outlier factor from sklearn: https://scikit-learn.org/stable/modules/outlier_detection.html
The reason why you need to reduce features first is, is because your LOF will be much more stable.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

Clustering or other mechanisms for implementing generic spam detection

In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.

Resources