I am new to using LSI with Python and Gensim + Scikit-learn tools. I was able to achieve topic modeling on a corpus using LSI from both the Scikit-learn and Gensim libraries, however, when using the Gensim approach I was not able to display a list of documents to topic mapping.
Here is my work using Scikit-learn LSI where I successfully displayed document to topic mapping:
tfidf_transformer = TfidfTransformer()
transformed_vector = tfidf_transformer.fit_transform(transformed_vector)
NUM_TOPICS = 14
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi= nmf_model.fit_transform(transformed_vector)
topic_to_doc_mapping = {}
topic_list = []
topic_names = []
for i in range(len(dbpedia_df.index)):
most_likely_topic = nmf[i].argmax()
if most_likely_topic not in topic_to_doc_mapping:
topic_to_doc_mapping[most_likely_topic] = []
topic_to_doc_mapping[most_likely_topic].append(i)
topic_list.append(most_likely_topic)
topic_names.append(topic_id_topic_mapping[most_likely_topic])
dbpedia_df['Most_Likely_Topic'] = topic_list
dbpedia_df['Most_Likely_Topic_Names'] = topic_names
print(topic_to_doc_mapping[0][:100])
topic_of_interest = 1
doc_ids = topic_to_doc_mapping[topic_of_interest][:4]
for doc_index in doc_ids:
print(X.iloc[doc_index])
Using Gensim I was unable to proceed to display the document to topic mapping:
processed_list = []
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
for doc in documents_list:
tokens = word_tokenize(doc.lower())
stopped_tokens = [token for token in tokens if token not in stop_words]
lemmatized_tokens = [lemmatizer.lemmatize(i, pos="n") for i in stopped_tokens]
processed_list.append(lemmatized_tokens)
term_dictionary = Dictionary(processed_list)
document_term_matrix = [term_dictionary.doc2bow(document) for document in processed_list]
NUM_TOPICS = 14
model = LsiModel(corpus=document_term_matrix, num_topics=NUM_TOPICS, id2word=term_dictionary)
lsi_topics = model.show_topics(num_topics=NUM_TOPICS, formatted=False)
lsi_topics
How can I display the document to topic mapping here?
In order to get the representation of a document (represented as a bag-of-words) from a trained LsiModel as a vector of topics, you use Python dict-style bracket-accessing (model[bow]).
For example, to get the topics for the 1st item in your training data, you can use:
first_doc = document_term_matrix[0]
first_doc_lsi_topics = model[first_doc]
You can also supply a list of docs, as in training, to get the LSI topics for an entire batch at once. EG:
all_doc_lsi_topics = model[document_term_matrix]
Related
I'm running a machine learning model that requires multiple transformations. I applied polynomial transformations, interactions, and also a feature selection using SelectKBest:
transformer = ColumnTransformer(
transformers=[("cat", ce.cat_boost.CatBoostEncoder(y_train), cat_features),]
)
X_train_transformed = transformer.fit_transform(X_train, y_train)
X_test_transformed = transformer.transform(X_test)
poly = PolynomialFeatures(2)
X_train_polynomial = poly.fit_transform(X_train_transformed)
X_test_polynomial = poly.transform(X_test_transformed)
interaction = PolynomialFeatures(2, interaction_only=True)
X_train_interaction = interaction.fit_transform(X_train_polynomial)
X_test_interaction = interaction.transform(X_test_polynomial)
feature_selection = SelectKBest(chi2, k=55)
train_features = feature_selection.fit_transform(X_train_interaction, y_train)
test_features = feature_selection.transform(X_test_interaction)
model = lgb.LGBMClassifier()
model.fit(train_features, y_train)
However, I want to get the feature names and I have no idea on how to get them.
My problem is time series anomaly detection and I use facebook prophet library. So I have a function called "fit_predict_model" and I have 90 different dataframes that I keep in the dictionary. I mean have 90 different models. Then it takes a long time to train. I wanted to use multiprocessing to train faster.But I am getting memory error. How can I solve this problem?
def fit_predict_model(dataframe, model_name, interval_width = 0.95, changepoint_range = 0.88):
model = Prophet(yearly_seasonality=False,daily_seasonality=True,
seasonality_mode = "multiplicative",changepoint_range = changepoint_range)
model = model.fit(dataframe)
forecast = model.predict(forecast)
return forecast
pred = {}
def run(key):
pred[key] = fit_predict_model(train[key], model_name = key)
pool = Pool(cpu_count())
pool.map(run, list(train.keys()))
pool.close()
pool.join()
I am trying to reverse engineer hypothesis test results from an online calculator using Microsoft excel or Googlesheets. The inputs/outputs for the online calculator are shown in the screenshot
I have used the following excel functions to replicate the conversion rates and p-value -
control_exposures = 917
control_conversions = 126
variant_exposures = 1002
variant_conversions = 142
control_cvr_rate = control_conversions/control_exposures = 13.74%
variant_cvr_rate = variant_conversions/variant_exposures = 14.17%
control_std_error = sqrt((control_cvr_rate*(1-control_cvr_rate)/control_exposures)) = 1.14%
variant_std_error = sqrt((variant_cvr_rate*(1-variant_cvr_rate)/variant_exposures)) = 1.10%
z_score = (control_cvr_rate- variant_cvr_rate)/sqrt(power(control_std_error,2)+power(variant_std_error,2)) = -0.2724
p_value = normdist(z_score,0,1,TRUE) = 0.3927
Based on this, how do I derive that statistical power value of 5.9% in the screenshot?
I have trying to run XGBoost for time series analysis. these are my codes which are used else where
xgb1 = xgb.XGBRegressor(learning_rate=0.1,n_estimators=n_estimators,max_depth=max_depth,min_child_weight=min_child_weight,gamma=0,subsample=0.8,colsample_bytree=0.8,
reg_alpha=reg_alpha,objective='reg:squarederror', nthread=4, scale_pos_weight=1, seed=27)
xgb_param = xgb1.get_xgb_params()
dmatrix = xgb.DMatrix(data=X_train, label=y_train)
cv_folds = 5
early_stopping_rounds = 50
cvresults = xgb.cv(dtrain=dmatrix, params = xgb_param,num_boost_round=xgb1.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
Obvious issue here is that I want to cross validate timeseries data and hence can't use the cv_folds = 5.
(How) can I use the TimeseriesSplit function within xgb.cv?
thanks,
I am trying with SVC classifier to classify text.
#self.vectorizer = HashingVectorizer(non_negative=True)
#self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
self.hasher = FeatureHasher(input_type='string',non_negative=True)
from sklearn.svm import SVC
self.clf = SVC(probability=True)
for text in self.data_train.data:
text = self.modifyQuery(text.decode('utf-8','ignore'))
training_data.append(text)
raw_X = (self.token_ques(text) for text in training_data)
#X_train = self.vectorizer.transform(training_data)
X_train = self.hasher.transform(raw_X)
y_train = self.data_train.target
self.clf.fit(X_train, y_train)
test classifier:
raw_X = (self.token_ques(text) for text in test_data)
X_test = self.hasher.transform(raw_X)
#X_test = self.vectorizer.transform(test_data)
pred = self.clf.predict(X_test)
print("pred=>", pred)
self.categories = self.data_train.target_names
for doc, category in zip(test_data, pred):
print('%r => %s' % (doc, self.categories[category]))
index = 1
predict_prob = self.clf.predict_proba(X_test)
for doc, category_list in zip(test_data, predict_prob):
#print values
I tried with hashing, feature, tfidf vectorizer but still it gives wrong answer for all queries (class with highest datasize comes as answer). While using naive bayes it gives correct result as per class and input query.
Am I doing anything wrong in code?
Update
I have total 8 classes, and each class having 100-200 lines of sentences. One class with 480 lines. This class always comes as a answer currently