I have some questions on a data set for ML::-->
so, there are only 2 features input and output for my data which has 1700 examples, And input and output are having non-linear relationship ship..they are not normally distributed respectfully, and the scatter is shown below..what can we say is the relationship between them? How to approach a solution to this kind of problem? how to create features? - I have built features like log and sqrt of input gave me a good correlation with output but this function doesn't apply for negative no.'s so what to do for -ve no.'s?enter image description here
I have tried using cubicroot(1/3) but not badly correlated with output.
Related
I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.
I'm trying to use the approach described in this paper https://arxiv.org/abs/1712.01815 to make the algorithm learn a new game.
There is only one problem that does not directly fit into this approach. The game I am trying to learn has no fixed board size. So currently the input tensor has dimensions m*n*11, where m and n are the dimensions of the game board and can vary each time the game is played. So first of all I need a neural network able to make use of such varying input sizes.
The size of the output is also a function of the board size, as it has a vector with entries for every possible move on the board, and so the output vector will be bigger if the board size increases.
I have read about recurrent and recursive neural networks but they all seem to relate to NLP, and I'm not sure on how to translate that to my problem.
Any ideas on NN architectures able to handle my case would be welcome.
What you need is Pointer Networks (https://arxiv.org/abs/1506.03134)
Here is a introduction quote from a post about it:
Pointer networks are a new neural architecture that learns pointers to positions in an input sequence. This is new because existing techniques need to have a fixed number of target classes, which isn't generally applicable— consider the Travelling Salesman Problem, in which the number of classes is equal to the number of inputs. An additional example would be sorting a variably sized sequence.
- https://finbarr.ca/pointer-networks/
Its an attention based model.
Essentially a pointer network is used to predict pointers back to the input, meaning your output layer isn't actually fixed, but variable.
A use case where I have used them is for translating raw text into SQL queries.
Input: "HOW MANY CARS WERE SOLD IN US IN 1983"
Output: SELECT COUNT(Car_id) FROM Car_table WHERE (Country='US' AND
Year=='1983')
The issue with raw text such as this is that it will only make sense w.r.t to a specific table (in this case car table with a set of variables around car sales, similar to your different boards for board games). Meaning, that if the question cant be the only input. So the input that actually goes into the pointer network is a combination of -
Input -
Query
Metadata of the table (column names)
Token vocabulary for all categorical columns
Keywords from SQL syntax (SELECT, WHERE etc..)
All of these are appended together.
The output layer then simply points back to specific indexes of the input. It points to Country and Year (from column names in metadata), it points to US and 1983 (from tokens in vocabulary of categorical columns), it points to SELECT, WHERE etc from the SQL syntax component of the input.
The sequence of these indexes in the appended index is then used as the output of your computation graph, and optimized using a training dataset that exists as WIKISQL dataset.
Your case is quite similar, you need to pass the inputs, metadata of the game, and the stuff you need as part of your output as an appended index. Then the pointer network simply makes selections from the input (points to them).
You need to go back to a fixed input / output problem.
A common way to fix this issue when applying to images / time series... is to use sliding windows to downsize. Perhaps this can be applied to your game.
Fully convolutional neural network is able to do that. Parameters of conv layers are convolutional kernels. Convolutional kernel not so much care about input size(yes there are certain limitations related to stride, padding input and kernel size).
Typical use case is some convlayers followed by the maxpooling and repeated again and again to some point where are filters flattened and connected to dense layer. Dense layer is problem because he expect input at fixed size. If there is another conv2 layer, your output will be another feature map of appropriate size.
Example of such network could be YOLOv3. If you feed it for example with image 416x416x3 output can be for example 13x13xnumber of filters(I know YOLOv3 has more output layers but I will discuss only one because of simplicity). If you feed YOLOv3 with image 256x256x3, output will be feature map 6x6xnumber of filters.
So network don't crash and produce results. Will be results good? I don't know, maybe yes, maybe no. I never use it at such manner I always resize image to recommended size or retrain network.
Here is the problem statement:
I have 2 datasets from different years(2013 dataset and 2014 dataset), the data is multivariate with each dataset containing 38 attributes, I want to find out any difference/delta that might have occured in between two datasets in these consecutive years, this difference should be a numerical value.
So far I have applied following techniques:
1)ANOVA (This tells me that difference is there but it doesn't tell me how much the difference is)
2)Wilcoxon-Mann-Whitney U test (Same problem as ANOVA)
3)Finding the Mean Square Error between the mean of the datasets.
Questions:
1) Is their any other method/test that can be applied which would give me a numerical value of the difference between datasets?
2) If I label the 2013 dataset as "1" and 2014 dataset as "2" then can the weight's of neural network trained to classify these dataset be used to somehow find the difference between datasets?
Note: Due to confidentiality agreement I cannot share the data here.
Don't know if you have found an answer or not.
Have you tried using RMSE? You can create a score for every column of a dataset and then combine them to get an average score for the whole data.
It's not a perfect method but it should give a scale of difference when comparing multiple dataset to eachother.
If you did find a better answer than what I suggested, please so let me know as I would be interested in it.
All the best.
I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)
I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721