Hi i am new to Python SKLearn and ML in general. Im encountering a Memory Error when using MultinomialNB partial fit, Im trying to do Multi Label Classification on the DMOZ directory data.
My questions:
What am i doing wrong? Is it my lack of memory or is the data wrong?
Am i using the right approach ?
Anything i can do to improve my appraoch ?
Approach:
Store DMOZ DB directories into MongoDB/TokuMX
{
"_id": {
"$oid": "54e758c91d41c804d8ace196"
},
"docs": [
{
"url": "http://www.awn.com/",
"description": "Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.",
"title": "Animation World Network"
}
],
"labels": [
"Top",
"Arts",
"Animation"
]
}
Itterate over the docs array and pass docs elements into my classifier function.
Vectorizer and Classifier
classifier = MultinomialNB()
vectorizer = HashingVectorizer(
stop_words='english',
strip_accents='unicode',
norm='l2'
)
My classifier function
def classify(doc, labels, classifier, vectorizer, *args):
r = requests.get(doc['url'], verify=False)
print "Retrieving URL = {0}\n".format(doc['url'])
if r.status_code == 200:
html = lxml.html.fromstring(r.text)
doc['content'] = []
tags = ['font', 'td', 'h1', 'h2', 'h3', 'p', 'title']
for tag in tags:
for x in html.xpath('//'+tag):
try:
bag_of_words = nltk.word_tokenize(x.text_content())
pos_tagged = nltk.pos_tag(bag_of_words)
for word, pos in pos_tagged:
if pos[:2] == 'NN':
doc['content'].append(word)
except AttributeError as e:
print e
x_train = vectorizer.fit_transform(doc['content'])
#if we are the first one to run partial_fit, pass all classes
if len(args) == 1:
classifier.partial_fit(x_train, labels, classes=args[0])
else:
classifier.partial_fit(x_train, labels)
return doc
X: doc['content'] consists of a array with NOUNS. (600)
Y: labels consists of a array with labels inside the mongo document showed above. (3)
Classes args[0] consists of array with all the (UNIQUE)labels in the database. ( 17490)
Running inside VirtualBox on a Quadcore laptop with 4gb ram assigned to VM.
What are the 17490 unique labels? There will be one coefficient for each label and each feature, which is likely where your memory error comes from.
Related
I've created a custom docker container to deploy my model on Vertex AI (the model uses LightGBM, so I can't use the pre-built containers provided by Vertex AI for TF/SKL/XGBoost).
I'm getting errors while trying to get explainable predictions from the model (I deployed the model and normal predictions are working just fine). I have gone through the official Vertex AI guide(s) for getting predictions/explanations, and also tried different ways of configuring the explanation parameters and metadata, but it's still not working. The errors are not very informative, and this is what they look like:
400 {"error": "Unable to explain the requested instance(s) because: Invalid response from prediction server - the response field predictions is missing. Response: {'error': '400 Bad Request: The browser (or proxy) sent a request that this server could not understand.'}"}
This notebook from Vertex provided by Google has some examples on how to configure the explanation parameters and metadata for models trained with different frameworks. I'm trying something similar.
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage4/get_started_with_vertex_xai.ipynb
The model is a classifier that takes tabular input with 5 features (2 string, 3 numeric), and output value from model.predict() is 0/1 for each input instance. My custom container returns predictions in this format:
# Input for prediction
raw_input = request.get_json()
input = raw_input['instances']
df = pd.DataFrame(input, columns = ['A', 'B', 'C', 'D', 'E'])
# Prediction from model
predictions = model.predict(df).tolist()
response = jsonify({"predictions": predictions})
return response
This is how I am configuring the explanation parameters and metadata for the model:
# Explanation parameters
PARAMETERS = {"sampled_shapley_attribution": {"path_count": 10}}
exp_parameters = aip.explain.ExplanationParameters(PARAMETERS)
# Explanation metadata (this is probably the part that is causing the errors)
COLUMNS = ["A", "B", "C", "D", "E"]
exp_metadata = aip.explain.ExplanationMetadata(
inputs={
"features": {"index_feature_mapping": COLUMNS, "encoding": "BAG_OF_FEATURES"}
},
outputs={"predictions": {}},
)
For getting predictions/explanations, I tried using the below format, among others:
instance_1 = {"A": <value>, "B": <>, "C": <>, "D": <>, "E": <>}
instance_2 = {"A": <value>, "B": <>, "C": <>, "D": <>, "E": <>}
inputs = [instance_1, instance_2]
predictions = endpoint.predict(instances=inputs) # Works fine
explanations = endpoint.explain(instances=inputs) # Returns error
Could you please suggest me how to correctly configure the explanation metadata, or provide input in the right format to the explain API, to get explanations from Vertex AI? I have tried many different formats, but nothing is working so far. :(
I am developing a text topic classifier that can label sentences or small questions.
So far it can label around 30 known subjects.
Works well, but it begins to confuse similar questions with each other.
For example these 3 labels:
1) Label - backup_proxy_intranet:
How to set up a backup proxy for intranet app?
... and 140 similar questions containing 'backup proxy for intranet app'...
2) Label - smartphone_intranet:
How to use intranet app in my smartphone? and
... and 140 similar questions containing 'intranet app in my smartphone'...
3) Label - ticket_intranet: How to relate a ticket order with the intranet app?
... and 140 similar questions containing 'ticket order with the intranet app'...
After training these 3 always returns the label backup_proxy_intranet.
what can i do to separate them?
series = series.dropna()
series = shuffle(series)
X_stemmed = []
for x_t in series['phrase']:
stemmed_text = [stemmer.stem(i) for i in word_tokenize(x_t)]
X_stemmed.append(' '.join(stemmed_text))
x_normalized = []
for x_t in X_stemmed:
temp_corpus=x_t.split(' ')
corpus=[token for token in temp_corpus if token not in stops]
x_normalized.append(' '.join(corpus))
X_train,X_test,y_train,y_test = train_test_split(x_normalized,series['target'],random_state=0,test_size=0.20)
vect = CountVectorizer(ngram_range=(1,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
sampler = SMOTE()
model = make_pipeline(sampler, LogisticRegression())
print()
print("-->Model: ")
print(model)
print()
print("-->Training... ")
model.fit(X_train_vectorized,y_train)
filename = '/var/www/html/python/intraope_bot/lib/textTopicClassifier.model'
pickle.dump(model,open(filename, 'wb'))
filename2 = '/var/www/html/python/intraope_bot/lib/textTopicClassifier.vector'
pickle.dump(vect,open(filename2, 'wb'))
Best Regards!
I think you might want to use a TfidfVectorizer from sklearn : it should help you take increase the score !
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... "Label - backup_proxy_intranet: How to set up a backup proxy for intranet app? ... and 140 similar questions containing 'backup proxy for intranet app'"
... Label - smartphone_intranet: How to use intranet app in my smartphone? and ... and 140 similar questions containing 'intranet app in my smartphone'...
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
I got best results using multinomial Naive bayes also from scikit learn
Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.
I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs).
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector)
Is model.docvecs designed to provide all training docvecs?
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.
Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]
I am trying to classify a dataset of reviews in to two classes say class A and class B. I am using LightGBM to classify.
I have changed the parameters for the classifier many times but I can't get a huge difference in the results.
I think the problem is with the pre-processing step. I defined a function as shown below to take care of pre-processing. I used Stemming and removed stopwords. I don't know what I am missing. I have tried LancasterStemmer and PorterStemmer
stops = set(stopwords.words("english"))
def cleanData(text, lowercase = False, remove_stops = False, stemming = False, lemm = False):
txt = str(text)
txt = re.sub(r'[^A-Za-z0-9\s]',r'',txt)
txt = re.sub(r'\n',r' ',txt)
if lowercase:
txt = " ".join([w.lower() for w in txt.split()])
if remove_stops:
txt = " ".join([w for w in txt.split() if w not in stops])
if stemming:
st = PorterStemmer()
txt = " ".join([st.stem(w) for w in txt.split()])
if lemm:
wordnet_lemmatizer = WordNetLemmatizer()
txt = " ".join([wordnet_lemmatizer.lemmatize(w) for w in txt.split()])
return txt
Are there any more pre-processing steps to be done to get a better accuracy.?
URL for the dataset : Dataset
EDIT :
Parameters that I used are as mentioned below.
params = {'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'learning_rate': 0.01,
'max_depth': 22,
'num_leaves': 78,
'feature_fraction': 0.1,
'bagging_fraction': 0.4,
'bagging_freq': 1}
I have altered the depth and num_leaves parameters along with others. But the accuracy is kind of stuck at a certain level..
There are a few things to consider. First of all your training set is not balanced - the class distribution is ~ 70%/30%. You need to consider this fact in training. What types of features are you using? Using the right set of features could improve your performance.