How to use regression algorithm to predict attribute? - machine-learning

I am creating a simple AI which predicts mac numbers. Problem is I couldn't find any regression algorithm to find the correct relation between com, mac and result. It only has %3 prediction ratio. What is the problem with my code?
DataSource source = new DataSource("mydata.arff");
Instances traindata = source.getDataSet();
traindata.setClassIndex(traindata.numAttributes() - 2);
REPTree repTree = new REPTree();
repTree.buildClassifier(traindata);
Instance dataToPredict = traindata.lastInstance();
double result = repTree.classifyInstance(dataToPredict);
Here is my sample Data;
#attribute com numeric
#attribute mac numeric
#attribute result {yes, no}
#data
12,28,no
22,58,no
30,70,yes
37,63,yes
32,68,yes
31,69,yes
7,27,no
30,?,yes
EDIT: Relation is so easy.
If result is true then mac = 100 - com

Related

Stata timeseries rolling forecast

I'm new to Stata and have a question about its command language. I want to use my ARIMA model to forecast, ie use x[t], x[t-1]... to produce an estimate xhat[t+1], and then roll forward one time step, to make the next forecast, rebuilding the model every N time steps.
i can duplicate code, something like the following code for T, T+1, T+2, etc.:
arima x if t<=T, arima(2,0,2)
predict xhat
to produce a series of xhats to compare with in-sample x observations. There must be a more natural way to do this in the command language. any suggestions, pointers would be very much appreciated.
Posting a working solution provided by Stata tech support:
webuse dfex
tsset month
generate int id = _n
capture program drop forecarima
program forecarima, rclass
syntax [if]
tempvar yhat
arima unemp `if', arima(1,1,0)
local T = e(tmax)
local T1 = `T' + 1
summarize id if month == `T1'
local h = r(max)
predict `yhat', y dynamic(`T')
return scalar y = unemp[`h']
return scalar yhat = `yhat'[`h']
end
rolling unemp = r(y) unemp_hat = r(yhat), window(400) recursive ///
saving(results,replace): forecarima
use results,clear
browse
this provides output with the prediction and observed both available. the dates are off by one step, but easier left to post-processing.

Seq2Seq for string reversal

If I have a string, say "abc" and target of that string in reverse, say "cba".
Can a neural network, in particular an encoder-decoder model, learn this mapping? If so, what is the best model to accomplish this.
I ask, as this is a structural translation rather than a simple character mapping as in normal machine translation
If your network is an old-fashioned encoder-decoder model (without attention), then, as #Prune said, it has memory bottleneck (encoder dimensionality). Thus, such a network cannot learn to reverse strings of arbitrary size. However, you can train such an RNN to reverse strings of limited size. For example, the following toy seq2seq LSTM is able to reverse sequences of digits with length up to 10. Here is how you train it:
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding
import numpy as np
emb_dim = 20
latent_dim = 100 # Latent dimensionality of the encoding space.
vocab_size = 12 # digits 0-9, 10 is for start token, 11 for end token
encoder_inputs = Input(shape=(None, ), name='enc_inp')
common_emb = Embedding(input_dim=vocab_size, output_dim=emb_dim)
encoder_emb = common_emb(encoder_inputs)
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_emb)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None,), name='dec_inp')
decoder_emb = common_emb(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_emb, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
def generate_batch(length=4, batch_size=64):
x = np.random.randint(low=0, high=10, size=(batch_size, length))
y = x[:, ::-1]
start = np.ones((batch_size, 1), dtype=int) * 10
end = np.ones((batch_size, 1), dtype=int) * 11
enc_x = np.concatenate([start, x], axis=1)
dec_x = np.concatenate([start, y], axis=1)
dec_y = np.concatenate([y, end], axis=1)
dec_y_onehot = np.zeros(shape=(batch_size, length+1, vocab_size), dtype=int)
for row in range(batch_size):
for col in range(length+1):
dec_y_onehot[row, col, dec_y[row, col]] = 1
return [enc_x, dec_x], dec_y_onehot
def generate_batches(batch_size=64, max_length=10):
while True:
length = np.random.randint(low=1, high=max_length)
yield generate_batch(length=length, batch_size=batch_size)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit_generator(generate_batches(), steps_per_epoch=1000, epochs=20)
Now you can apply it to reverse a sequence (my decoder is very inefficient, but it does illustrate the principle)
input_seq = np.array([[10, 2, 1, 2, 8, 5, 0, 6]])
result = np.array([[10]])
next_digit = -1
for i in range(100):
next_digit = model.predict([input_seq, result])[0][-1].argmax()
if next_digit == 11:
break
result = np.concatenate([result, [[next_digit]]], axis=1)
print(result[0][1:])
Hoorray, it prints [6 0 5 8 2 1 2] !
Generally, you can think of such a model as a weird autoencoder (with a reversal side-effect), and choose architecture and training procedure suitable for autoencoders. And there is quite a vast literature about text autoencoders.
Moreover, if you make an encoder-decoder model with attention, then, it will have no memory bottleneck, so, in principle, it is possible to reverse a sequence of any length with a neural network. However, attention requires quadratic computational time, so in practice even neural networks with attention will be very inefficient for long sequences.
I doubt that a NN will learn the abstract structural transformation. Since the string is of unbounded input length, the finite NN won't have the info necessary. NLP processes generally work with identifying small blocks and simple context-sensitive shifts. I don't think they'd identify the end-to-end swaps needed.
However, I expect that an image processor, adapted to a single dimension, would learn this quite quickly. Some can learn how to rotate a sub-image.

ValueError: Found input variables with inconsistent numbers of samples : [1, 14048]

I am trying to run MultinomiaL Naive bayes and receiving the below error. Sample training data is given. Test data is exactly similar.
def main():
text_train, targets_train = read_data('train')
text_test, targets_test = read_data('test')
classifier1 = MultinomialNB()
classifier1.fit(text_train, targets_train)
prediction1 = classifier1.predict(text_test)
Sample Data:
Train:
category, text
Family, I love you Mom
University, I hate this course
Sometimes I face this question and find most of reason from the error is the input data should be 2-D array, such as if you want to build a regression model. you write this code and then you will face this error!
for example:
a = np.array([1,2,3]).T
b = np.array([4,5,6]).T
regr = linear_model.LinearRegression()
regr.fit(a, b)
then you should add something!
a = np.array([[1,2,3]]).T
b = np.array([[4,5,6]]).T
lastly you will be run normally!
so it is just my empirical!
This is just a reference, not a standard answer!
i am from Chinese as a student in learning English and python!

Weka Confusion Matrix

I have the following data set
weather.nominal.arff
#relation weather.symbolic
#attribute outlook {sunny, overcast, rainy}
#attribute temperature {hot, mild, cool}
#attribute humidity {high, normal}
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
I wrote a simple program as following.
if (data.classIndex() == -1)
data.setClassIndex(4);
J48 classifier = new J48();
Evaluation eval = new Evaluation(data);
eval.crossValidateModel(classifier, data, 10, new Random(1));
System.out.println(eval.toMatrixString());
I got the following result.
=== Confusion Matrix ===
a b <-- classified as
0 5 | a = no
0 9 | b = yes
Now when i use the same weather.nominal data set in Weka GUI, I get the following result.
=== Confusion Matrix ===
a b <-- classified as
9 0 | a = yes
5 0 | b = no
What changes I need to make in the program to get the result as generated by Weka GUI? Values are crossed in the resulting confusion matrices.
Update
I mentioned the arff file, but I am getting the same values from database, as I have stored the same values in database. The problem only appears when I use database, not when I use normal arff file.
Code to load the values from database.
InstanceQuery query = new InstanceQuery();
query.setUsername("postgres");
query.setPassword("*****");
query.setQuery("SELECT * from weather");
Instances data = query.retrieveInstances();

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/

Resources