Neural network feature combinatorics - machine-learning

Suppose we have a neural network with a sufficiently high number of hidden layers, hidden units per layer, and iterations, such that these parameters do not affect the network's predictions.
Given features x1, x2, ..., xn, is it possible (to prove) whether or not a range of potential features are redundant given this subset of features (x1 through xn). That is, given these features (x1 through xn), can a neural network discern other features such as:
Differences or additions (x1-x49, or x17+xn)?
Products and ratios (x1*x1, or x47/xn)
Higher order polynomials (or a products of sequences like ∏(x1
through xn))
trigonometric functions based upon original features (sin(x1*xn) +
x17)
logarithmic functions (ln(x2*x4)/x6)
It is in this line of inquiry that I am wondering if there are situations using a neural network where you would need to add higher order, or different, functions for the network to predict accurately.
In general, given some adequate number of features, is it possible for the network to model ANY graph, and if not what functional domains can neural networks not predict?
Furthermore, is there any research someone could point out that references this topic?
Thanks!

Given features x1, x2, ..., xn, is it possible (to prove) whether or not a range of potential features are redundant given this subset of features (x1 through xn). That is, given these features (x1 through xn), can a neural network discern other features
It seems as if you are looking for a dimensionality reduction with neural networks. Autoencoders can do that:
You have inputs x1, x2, ..., xn.
You create a network which gets those inputs (n input nodes). It has some hidden layers, a bottleneck (k nodes, with k < n) and an output layer (n nodes).
The target is to recreate the input.
When it is trained, you discard the layers after the output. If the network was able to restore the inputs from the bottleneck, the later layers are not necessary.
In general, given some adequate number of features, is it possible for the network to model ANY graph, and if not what functional domains can neural networks not predict?
I guess you are looking for the Universal approximation theorem. In short: Neural networks can approximate any continuous functions on compact subsets of R^n arbitrary close as long as you give them enough nodes and at least one hidden layer

Long story short:
No Neural Network for regression-tasks will ever, yes sorry -- NO ANN WILL EVER -- be able to reasonably predict y_target(s) for such problem-domains, that principally do not match the implemented mathematics of the NN-model.
Trying to predict y_target(s) via ( an almost ) just linear-combination of input-layer state-vector's components ( observations of features ) in X[:] ( that, well, do get some non-linear handling down-the-network of summing their respective scalar amplifications ) have to and will fail to remain precise.
Too complex to read?
Just let me put an example.
One may use an ANN to train such a linear-combination of inputs, to best approximate a cubic ( by nature ) problem-domain behaviour. The mathematics of a minimiser-search will yield such ANN-coefficients, that will provide the lowest penalty of all any other coefficient-settings.
So far so good.
But still, such "tuned"-ANN will never be any closer to the cubic-nature of the underlying ( real-world ) phenomenon. Not because I put it here, but because the linear-combination, however the tweaking of the non-linearised factors get incorporated down the road, during re-calculating all the layers till the final sum and output transformation takes place -- all this principally cannot introduce a cubic-behaviour accross the full domain-ranges of the inputs ( which the Mother Nature, natively, in the problem-domain real-world behaviour, does -- and here is the Devil hidden and the reason, why it cannot get any better this way -- not so hard to create an easy simulation of this principal failure to meet the cubic-reality in code ).
A problem-domain-agnostic Universal ANN might be a nice wish,butwould be an overkill to implementandthe worse for an attempt to .fit()
Yes, you are right one can spend some amount of creativity, so as to create and connect a pre-ANN-black-box right in front of the ANN-input-layer, where all the possible mathematics over the native-X[:] feature observations takes place, thus feeding the ANN-input layer with all possible derived-feature-semi-products, so as to allow ANN to learn any-kind-of real-problem-domain behaviour.
This would seem to be a way, until you implement such a magic-Universal-Behaviour-Model-Black-box and realise the scales that it will enforce on the back-to-back connector, so making the input-Layer and all the hidden-layers to grow in static-scales so wide, that the reality of the resulting O( N^k ) scaling will most probably nail such attempt into etheral waiting, independently of any imaginable parallel-computing efforts, right due to [PTIME,PSPACE] complexities and a fact, that none Turing-SEQ-computing complexity-taxonomy-member will get any better even if successfully fully-translated into a PAR-computing domain ( for reasoning behind this C2-boundary problem ref. comments and citations from here ).
Even if one raises a claim, that Universal Quantum Computers ( ref. U-QC-device ) will make this scenario achieve feasible results in [CTIME,CSPACE] I would be reserved to expect such U-QC-device to become reasonably available for practical deployment any soon ( FYI: current biggest published non-U-QC-device CSPACE scales are about 1024 qbits in 2016, about 2048+ qbits in 2017 and if this progress could keep such pace forever, yet the such CSPACE constraints will keep the magic-Universal-Behaviour-Model-Black-box piggy-backed-ANN rather restrictively small to meet your above expressed expectations:
Published constant-rate CSPACE-problem scalingextented until End of this Century
As of published technical details until EoM-2017/07 the current (non)-U-QC-devices available do not seem to allow for a [CSPACE]-constrained problems to have more than just 11-input-Layer neurons, so imagine just having 11 feature-inputs possible in 2017, for such pioneering, attractive and promising technology [CTIME] ANN-answers, yet with just a QUBO-simplification here of the actually R^m-continuous-domain minimisation-problem ( details ware intentionally excluded, due to additional complexities beyond the QC-curtains ).
2011: 128-neurons -- ( from 1x "extended"-input-L,
2015: 1,024 across all many or few hidden-Ls,
2016: 2,048 up to the .. output-L )
2017: 4,096
2019: 8,192
2021: 16,384
2023: 32,768
2025: 65,536
2027: 131,072
2029: 262,144
2031: 524,288
2033: 1,048,576 - neurons - - IN[300] features
2035: 2,097,152 - first able to compute a trivial ANN
2037: 4,194,304 with just an elementary architecture of
2039: 8,388,608 QuantFX.NN_mapper( ( 300, # IN[300]
2041: 16,777,216 1200, # H1[1200]
2043: 33,554,432 600, # H2[600]
2045: 67,108,864 300, # H3[300]
2047: 134,217,728 - neurons - - IN[ 3096] feat. 1 ),# OUT[1]
2049: 268,435,456 .. )
2051: 536,870,912
2053: 1,073,741,824
2055: 2,147,483,648
2057: 4,294,967,296 - neurons - - IN[17520] features
2059: 8,589,934,592
2061: 17,179,869,184
2063: 34,359,738,368
2065: 68,719,476,736
2067: 137,438,953,472 - neurons - - IN[99080] features
2069: 274,877,906,944
2071: 549,755,813,888
2073: 1,099,511,627,776
2075: 2,199,023,255,552
2077: 4,398,046,511,104
2079: 8,796,093,022,208
2081: 17,592,186,044,416
2083: 35,184,372,088,832
2085: 70,368,744,177,664
2087: 140,737,488,355,328
2089: 281,474,976,710,656
2091: 562,949,953,421,312
2093: 1,125,899,906,842,624
2095: 2,251,799,813,685,248
2097: 4,503,599,627,370,496
2099: 9,007,199,254,740,992 - neurons - - IN[25365000] features
Reality-check:
Given the above technology limits ( be it a [PTIME,PSPACE] eternity for an O(N^k) scaling of a .fit(), where k >= 2, or a [CTIME,CSPACE] problem-scale constraints ) there is not much advantage to create such divine-Black-Box super-universal ANN-device ( and then yet have to wait decades, if not centuries, before it could be used to get first answers from ANN-on-steriods ).
The very opposite is closer to reality.
One may and shall introduce all due problem-domain analysis efforts, so as to properly identify the native-reality behaviour ( ref. Technical Cybernetics: System Identification ) for knowing in advance, how to design a just-enough feature-rich input-layer ( for which synthetic-features - higher-order powers and cross-products, sums, products, harmonics, log-/exp-s, complex- / discrete-magics et al will take place, just-where-necessary for meeting ( not exceeding ) the performed system-identification ), as the ANN-model-scaling could this way remain a right-sized structure with a following pair of systematic-certainties:
a) Removing any single part would damage the model ( missing some indispensable feature will principally cause predictions to fail to be able to meet the system-identified diversity of behaviours ).
b) Adding any single part would not improve the model ( adding any feature, that is not incorporated in the diversity of identified system-behaviour adds zero new power to the current prediction capabilities )
The just-enough complex Feature-Engineering + Right-sizingis the way to go:
|>>> nnMAP, thetaVEC, thetaGRAD, stateOfZ, stateOfA, biasIDX = QuantFX.NN_mapper( ( 300, 1200, 600, 300, 1 ), True )
INF: NN_mapper has found: 5 Layers in total ( 3 of which HIDDEN ), 300 INPUTs, 1 OUTPUTs
INF: NN_mapper has found: 300 Neurons in INPUT-Layer
INF: NN_mapper has found: 1200 Neurons in HIDDEN-Layer_1
INF: NN_mapper has found: 600 Neurons in HIDDEN-Layer_2
INF: NN_mapper has found: 300 Neurons in HIDDEN-Layer_3
INF: NN_mapper has found: 1 Neuron in OUTPUT-Layer
INF: NN_mapper : will return a COMMON-block for nn_MAP__VEC
INF: NN_mapper : will return a COMMON-block for ThetaIJ_VEC having 1262401 cells, being all random.
INF: NN_mapper : will return a COMMON-block for ThetaIJGRAD having 1262401 cells,
INF: NN_mapper : will return a COMMON-block for Z_state_VEC having 2405 cells,
INF: NN_mapper : will return a COMMON-block for A_state_VEC having 2405 cells, with BIAS units == +1
INF: NN_mapper : will return BIAS units' linear addresses in biasIDX vector
: for indirect DMA-access to
: {Z|A}_state_VEC[biasIDX[LayerN]]
: cells, representing the LayerN's BIAS unit
So one might become happy with a just-enough ANN, feasible for training and operations in the realm of a classical-computing, without a need to wait the next 20 years, till and if, the Universal-Quantum-Computing-device(s) start to become able to produce and deliver results in a snap, in [CTIME], once the current [CSPACE]-constraints will stop to block such promising services.

Related

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

How to normalize tf-idf vectors for SVMs?

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

How to use Odds ratio feature selection with Naive bayes Classifier

I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.

Probability and Neural Networks

Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.

Resources