I want to use Mallet as a part of an expert finding project. I'm almost new to Mallet but I know that it trains topics from a set of the documents. Let's say that I have 50 topics trained by Mallet. I want to calculate this probability: p(topic|q) or either p(q|topic)
q is the query. It's a word (such as algorithm, android and etc) which I'm desired to find the experts in the specified area.
As I read this post : how to get word-topic probability using mallet, One of the users said we can calculate the probability using --word-topic-counts-file option. Let's say that I have generated this file by Mallet. It has the following structure:
0 android 2:21
1 is 3:3
.
.
.
I know the semantic of this structure, But I don't know how can I calculate the probability of topic given query ( i.e. p(topic|q) or either p(q|topic) )
P.S: I use the word "either" because I'm not sure mallet calculates which of them
Any help would be appreciated
Take this example line from GlieBrt's answer to the linked question
1 needham 19:2 17:1
Here p(topic|q) can be calculated as
p(19|needham) = 2/3 = 0.67
and
p(17|needham) = 1/3 = 0.33
With you own example, it is even simpler:
0 android 2:21
p(2|android) = 1.0
Related
First sorry for my bad english.
Short version :Can anyone tell me how to use only the first n layers of XLNET for classification ?
Long Version :
I have a dataset composed of texts and their summary. The goal is to detect if the summary is generated by a bot or not.
So I thought of using bert and give him as input "[CLS] "+Text+" [SEP]"+summary then take the representation of the "[CLS] " token and detect using a classifier if the summary was written by a bot.
Th problem is bert takes no more than 512 words as input.
So I thought of using XLNET. But here another problem appeared : My gpu (RTX 2060) can't handle a batch of size 1.
So I Thought of using only like the first 4 layers of XLNET but the problem is: I don't know how to do it.
So my code to load the model is model=XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels = 2)
can anyone tell me what to add to use only a part of the network please ?
I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!
I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX
I'm working with around 8k documents and all of them are based on a single topic. However, the documents cover various different events that happened across the world, related to that single topic. I want to find these subtopics (or events) from the documents. Now to achieve this, I'm using the gensim LDA model:
corpus = [dictionary.doc2bow(doc) for doc in docTrain]
model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=17, chunksize=10000, id2word=dictionary,random_state=123, alpha = 0.01, eta = 0.9, passes = 10 )
coherencemodel = gensim.models.CoherenceModel(model=model, texts=data, dictionary=dictionary, coherence='c_v')
Since I was unaware of the number of topics in this case, I used elbow method to determine the optimal number of topics in this case, which comes out to be 17 or 18. Also, the coherence score is not increasing beyond 0.4.
I want to know what is going wrong and if there is any other approach that would help me solve this problem in a better way. Please let me know if any other information regarding my approach is required.
if we choose 20 topics in LDA and then if we choose 30 topics. So my question is will both these results intersect those 20 topics and produce similar results
Short answer - no. The way LDA works is it uses Gibbs sampler to get Dirichlet distribution over document vectors. Allocations are then made on this sample and hence will always be different both because of sampling randomness and allocation uncertainties unless you define explicit random seed and run same number of topics k. Take a look at original paper Blei et al. 2003 to see how k is defined.
UPDATE (with regard to comment): Hierarchical LDA (hLDA) is trying to solve the problem of retaining topics and subtopics by constructing levels of topics following the Chinese restaurant model. But it's still not perfect.
The way flat LDA works, however, is it looks at documents rather than topics to produce further results. Say, you get topic 0 (first table in restaurant) and all documents try to sit there, but it's not really enough space and you create another topic 1 where some docs feel more comfortable, etc., etc. now you are right from the point of view of how these tables are created. But there is one big thing that's critical - topic 0 CHANGES when you create a new table/Topic 1 because some documents have left the first table and took the words (or probabilities of cooccurence thereof) with them to the new table and all words in topic 0 got reshuffled given new situation. Same happens when you create more tables/topics that all the previous are also re-estimated. Hence, you will never get same 20 topics when rerunning with 30.