Modeling a neural net for inputs with varying properties - machine-learning

I'm looking to use Tensorflow to set up a neural network to score items based on various properties they have. The amount of properties a given item can have is small (let's say 10 is the max) but the amount of possible properties is in the hundreds. For example, imagine we were scoring different kinds of vehicle, each with various attributes ("wheels", "engine horsepower", "wings", etc.) and a numerical value for that attribute (2, 600, 4).
My question is: is there a way to model the neural network for this to have a relatively low number of inputs, on the order of the max number of properties the item can have (in this example, 10)? Or does each possible property need to be an input, resulting in hundreds of total inputs, most of which (>90%) would be blank for any given item?

Just have all of the possible properties as inputs, but set them to 0 when they are not present. Hundreds of inputs to a NN is not uncommon anyways.

Related

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

Optimizing Neural Network Input for Convergence

I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.

Resources