I am trying to classify a dataset of reviews in to two classes say class A and class B. I am using LightGBM to classify.
I have changed the parameters for the classifier many times but I can't get a huge difference in the results.
I think the problem is with the pre-processing step. I defined a function as shown below to take care of pre-processing. I used Stemming and removed stopwords. I don't know what I am missing. I have tried LancasterStemmer and PorterStemmer
stops = set(stopwords.words("english"))
def cleanData(text, lowercase = False, remove_stops = False, stemming = False, lemm = False):
txt = str(text)
txt = re.sub(r'[^A-Za-z0-9\s]',r'',txt)
txt = re.sub(r'\n',r' ',txt)
if lowercase:
txt = " ".join([w.lower() for w in txt.split()])
if remove_stops:
txt = " ".join([w for w in txt.split() if w not in stops])
if stemming:
st = PorterStemmer()
txt = " ".join([st.stem(w) for w in txt.split()])
if lemm:
wordnet_lemmatizer = WordNetLemmatizer()
txt = " ".join([wordnet_lemmatizer.lemmatize(w) for w in txt.split()])
return txt
Are there any more pre-processing steps to be done to get a better accuracy.?
URL for the dataset : Dataset
EDIT :
Parameters that I used are as mentioned below.
params = {'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'learning_rate': 0.01,
'max_depth': 22,
'num_leaves': 78,
'feature_fraction': 0.1,
'bagging_fraction': 0.4,
'bagging_freq': 1}
I have altered the depth and num_leaves parameters along with others. But the accuracy is kind of stuck at a certain level..
There are a few things to consider. First of all your training set is not balanced - the class distribution is ~ 70%/30%. You need to consider this fact in training. What types of features are you using? Using the right set of features could improve your performance.
Related
I've been trying to fit a system of differential equations to some data I have and there are 18 parameters to fit, however ideally some of these parameters should be zero/go to zero. While googling this one thing I came across was building DE layers into neural networks, and I have found a few Github repos with Julia code examples, however I am new to both Julia and Neural ODEs. In particular, I have been modifying the code from this example:
https://computationalmindset.com/en/neural-networks/experiments-with-neural-odes-in-julia.html
Differences: I have a system of 3 DEs, not 2, I have 18 parameters, and I import two CSVs with data to fit that instead of generate a toy dataset to fit.
My dilemma: while goolging I came across LASSO/L1 regularization and hope that by adding an L1 penalty to the cost function, that I can "zero out" some of the parameters. The problem is I don't understand how to modify the cost function to incorporate it. My loss function right now is just
function loss_func()
pred = net()
sum(abs2, truth[1] .- pred[1,:]) +
sum(abs2, truth[2] .- pred[2,:]) +
sum(abs2, truth[3] .- pred[3,:])
end
but I would like to incorporate the L1 penalty into this. For L1 regression, I came across the equation for the cost function: J′(θ;X,y) = J(θ;X,y)+aΩ(θ), where "where θ denotes the trainable parameters, X the input... y [the] target labels. a is a hyperparameter that weights the contribution of the norm penalty" and for L1 regularization, the penalty is Ω(θ) = ∣∣w∣∣ = ∑∣w∣ (source: https://theaisummer.com/regularization/). I understand the first-term on the RHS is the loss J(θ;X,y) and is what I already have, that a is a hyperparameter that I choose and could be 0.001, 0.1, 1, 100000000, etc., and that the L1 penalty is the sum of the absolute value of the parameters. What I don't understand is how I add the a∑∣w∣ term to my current function - I want to edit it to be something like so:
function cost_func(lambda)
pred = net()
penalty(lambda) = lambda * (sum(abs(param[1])) +
sum(abs(param[2])) +
sum(abs(param[3]))
)
sum(abs2, truth[1] .- pred[1,:]) +
sum(abs2, truth[2] .- pred[2,:]) +
sum(abs2, truth[3] .- pred[3,:]) +
penalty(lambda)
end
where param[1], param[2], param[3] refers to the parameters for DEs u[1], u[2], u[3] that I'm trying to learn. I don't know if this logic is correct though or the proper way to implement it, and also I don't know how/where I would access the learned parameters. I suspect that the answer may lie somewhere in this chunk of code
callback_func = function ()
loss_value = loss_func()
println("Loss: ", loss_value)
end
fparams = Flux.params(p)
Flux.train!(loss_func, fparams, data, optimizer, cb = callback_func);
but I don't know for certain or even how to use it, if it were the answer.
I've been messing with this, and looking at some other NODE implementations (this one in particular) and have adjusted my cost function so that it is:
function cost_fnct(param)
prob = ODEProblem(model, u0, tspan, param)
prediction = Array(concrete_solve(prob, Tsit5(), p = param, saveat = trange))
loss = Flux.mae(prediction, data)
penalty = sum(abs, param)
loss + lambda*penalty
end;
where lambda is the tuning parameter, and using the definition that the L1 penalty is the sum of the absolute value of the parameters. Then, for training:
lambda = 0.01
resinit = DiffEqFlux.sciml_train(cost_fnct, p, ADAM(), maxiters = 3000)
res = DiffEqFlux.sciml_train(cost_fnct, resinit.minimizer, BFGS(initial_stepnorm = 1e-5))
where p is initial just my parameter "guesses", i.e., a vector of ones with the same length as the number of parameters I am attempting to fit.
If you're looking at the first link I had in the original post (here), you can redefine the loss function to add this penalty term and then define lambda before the callback function and subsequent training:
lambda = 0.01
callback_func = function ()
loss_value = cost_fnct()
println("Loss: ", loss_value)
println("\nLearned parameters: ", p)
end
fparams = Flux.params(p)
Flux.train!(cost_fnct, fparams, data, optimizer, cb = callback_func);
None of this, of course, includes any sort of cross-validation and tuning parameter optimization! I'll go ahead and accept my response to my question because it's my understanding that unanswered questions get pushed to encourage answers, and I want to avoid clogging the tag, but if anyone has a different solution, or wants to comment, please feel free to go ahead and do so.
hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.
I am using mlr package's resample() function to subsample a random forest model 4000 times (the code snippet below).
As you can see, to create random forest models within resample() I'm using randomForest package.
I want to get random forest model's importance results (mean decrease in accuracy over all classes) for each of the subsample iterations. What I can get right now as the importance measure is the mean decrease in Gini index.
What I can see from the source code of mlr, getFeatureImportanceLearner.classif.randomForest() function (line 69) in makeRLearner.classif.randomForest uses randomForest::importance() function (line 83) to get importance value from the resulting object of randomForest class. But as you can see from the source code (line 73) it uses 2L as the default value. I want it to use 1L (line 75) as the value (mean decrease in accuracy).
How can I pass the value of 2L to resample() function, ("extract = getFeatureImportance" line in the code below) so that getFeatureImportanceLearner.classif.randomForest() function gets that value and sets ctrl$type = 2L (line 73)?
rf_task <- makeClassifTask(id = 'task',
data = data[, -1], target = 'target_var',
positive = 'positive_var')
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 1000, importance = TRUE),
predict.type = 'prob')
base_subsample_instance <- makeResampleInstance(rf_boot_desc, rf_task)
rf_subsample_result <- resample(rf_learner, rf_task,
base_subsample_instance,
extract = getFeatureImportance,
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier))
My solution: Downloaded source code of the mlr package. Changed the source file line 73 to 1L (https://github.com/mlr-org/mlr/blob/v2.15.0/R/RLearner_classif_randomForest.R). Installed the package from command line and used it. Not an optimal solution but a solution.
You provide a lot of specifics that do not actually relate to your question, at least how I understood it.
So I wrote a simple MWE that includes the answer.
The idea is that you have to write a short wrapper for getFeatureImportance so that you can pass your own arguments. Fans of purrr can do that with purrr::partial(getFeatureImportance, type = 2) but here I wrote myExtractor manually.
library(mlr)
rf_learner <- makeLearner('classif.randomForest', id = 'random forest',
par.vals = list(ntree = 100, importance = TRUE),
predict.type = 'prob')
measures = list(acc, auc, tpr, tnr,
ppv, npv, f1, brier)
myExtractor = function(.model, ...) {
getFeatureImportance(.model, type = 2, ...)
}
res = resample(rf_learner, sonar.task, cv10,
measures = measures, extract = myExtractor)
# first feature importance result:
res$extract[[1]]
# all values in a matrix:
sapply(res$extract, function(x) x$res)
If you want to do a bootstraped learenr maybe you should also have a look at makeBaggingWrapper instead of solving this problem through resample.
I am trying to run MultinomiaL Naive bayes and receiving the below error. Sample training data is given. Test data is exactly similar.
def main():
text_train, targets_train = read_data('train')
text_test, targets_test = read_data('test')
classifier1 = MultinomialNB()
classifier1.fit(text_train, targets_train)
prediction1 = classifier1.predict(text_test)
Sample Data:
Train:
category, text
Family, I love you Mom
University, I hate this course
Sometimes I face this question and find most of reason from the error is the input data should be 2-D array, such as if you want to build a regression model. you write this code and then you will face this error!
for example:
a = np.array([1,2,3]).T
b = np.array([4,5,6]).T
regr = linear_model.LinearRegression()
regr.fit(a, b)
then you should add something!
a = np.array([[1,2,3]]).T
b = np.array([[4,5,6]]).T
lastly you will be run normally!
so it is just my empirical!
This is just a reference, not a standard answer!
i am from Chinese as a student in learning English and python!
I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/