How do I get multiple answers from the text using BertForQuestionAnswering, just like for the below question there are two possible answers
a nice puppet
a software engineer
Below is the code snippet for the same:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet.Jim Henson was a software engineer."
input_ids = tokenizer.encode(question, text)
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
print(answer)
'a software engineer' ```
Thanks in advance!!
You are simply outputting the most likely answer according to BERT. If you want multiple answers, you will need to actually select multiple answers. In this case that would be the first and second most likely answer. To do this, create a new answer variable and select the second most likely start and answer word from the start_scores and end_scores variable. I recommend you use at torch.topk()
answer1_start, answer2_start = torch.topk(start_scores)
answer2_end, answer2_end = torch.topk(end_scores)
answer1 = ' '.join(all_tokens[answer1_start : answer1_end + 1])
answer2 = ' '.join(all_tokens[answer2_start : answer2_end + 1])
print(answer1, answer2)
Let me know how this works.
Related
I have a question regarding the concate function on Pytorch. I could not find any answers that could help me solve this. I have this code:
train_val_folder = root + "train_val_data/"
train_set1_original = datasets.ImageFolder(train_val_folder, transform=train_val_transform_croponly)
train_val_modified = datasets.ImageFolder(train_val_folder, transform=train_val_transform)
train_val_dataset= torch.utils.data.ConcatDataset([train_set1_original , train_val_modified ])
classes = train_val_dataset.classes
train_val_length = len(train_val_dataset)
print(train_val_length)
AttributeError: 'ConcatDataset' object has no attribute 'classes'
Is there an easier way to concate the original and transformed datasets while preserving .classes from ImageFolder?
Let's say I have the following ParamSet object:
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1, logscale = TRUE))
Is it possible to rename minsplit to survTree.minsplit without changing anything else?
The reason for this is that I use some learners as part of a GraphLearner and so their parameters names changed and I would like to have some code that adds the learner$id in front the parameters to use later for tuning (rather than rewriting them from scratch with the new names)
I think I have a partial solution here. It is only partial, because it does not support the transformation.
Where it works:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
print(my_psc)
#> <ParamSetCollection>
#> id class lower upper nlevels default value
#> 1: john.minsplit ParamInt 1e+00 64 64 <NoDefault[3]>
#> 2: john.cp ParamDbl 1e-04 1 Inf <NoDefault[3]>
Created on 2022-12-07 by the reprex package (v2.0.1)
Where it does not:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
#> Error in .__ParamSetCollection__initialize(self = self, private = private, : Building a collection out sets, where a ParamSet has a trafo is currently unsupported!
Created on 2022-12-07 by the reprex package (v2.0.1)
The underlying problem is that we did not solve the problem of how to reconcile the parameter transformations of individual ParamSets and a possible parameter transformation of the ParamSetCollection
I fear that there is currently no neat solution for your problem.
Sorry I can not comment yet, this is not exactly the solution you are looking for but I hope this will fix the problem you are having.
You can set the param_space in the learner, before putting it in the graph, i.e. sticking with your search space. After you create the GraphLearner regularly it will have the desired search space.
A concrete example:
library(mlr3verse)
learner = lrn("regr.rpart", cp = to_tune(0.1, 0.2))
glrn = as_learner(po("pca") %>>% po("learner", learner))
at = auto_tuner(
"random_search",
glrn,
rsmp("holdout"),
term_evals = 10
)
task = tsk("mtcars")
at$train(task)
I am working on a Binary Classification Machine Learning Problem and I am trying to balance the training set as I have an imbalanced target class variable. I am using Py-Spark for building the model.
Below is the code which is working to balance the data
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
label count
0 0.0 712980
1 1.0 2926
train_new = train_initial.sampleBy('label', fractions={0: 2926./712980, 1: 1.0}).cache()
The above code performs under-sampling, but I think this might lead to loss of information. However, I am not sure how to perform upsampling. I also tried to use sample function as below:
train_up = train_initial.sample(True, 10.0, seed = 2018)
Although, it increases the count of 1 in my data set, it also increases the count of 0 and gives the below result.
label count
0 0.0 7128722
1 1.0 29024
Can someone please help me to achieve up-sampling in py-spark.
Thanks a lot in Advance!!
The problem is that you are oversampling the whole data frame. You should filter the data from the two classes
df_class_0 = df_train[df_train['label'] == 0]
df_class_1 = df_train[df_train['label'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)
the example comes from : https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Please note that there are better way to perform oversampling (e.g. SMOTE)
For anyone trying to do random oversampling on a imbalanced dataset in pyspark. The following code will get you started (in this snippet 0 is the mayority class , and 1 is the class to be oversampled):
df_a = df.filter(df['label'] == 0)
df_b = df.filter(df['label'] == 1)
a_count = df_a.count()
b_count = df_b.count()
ratio = a_count / b_count
df_b_overampled = df_b.sample(withReplacement=True, fraction=ratio, seed=1)
df = df_a.unionAll(df_b_oversampled)
I might be quite late to the rescue here. But this is what I would recommend:
Step 1. Sample only for label = 1
train_1= train_initial.where(col('label')==1).sample(True, 10.0, seed = 2018)
step 2. Merge this data with label = 0 data
train_0=train_initial.where(col('label')==0)
train_final = train_0.union(train_1)
PS: please import the col with
from pyspark.sql.functions import col
I am using the pc-stable from the package ‘pcalg’ version 2.0-10 to learn the structure . what I understand this algorithm does not effect the the order of the input data because it is order_independent. when I run it with different order ,I got different graph. can any one help me with this issue and this is my code.
library(pracma)
randindexMatriax <- matrix(0,10,ncol(TrainData))
numberUnique_val_col = vector()
pdf("Graph for Test PC Stable with random order.pdf")
par(mfrow=c(2,1))
for (i in 1:10)
{
randindex<-randperm(1:ncol(TrainData))
randindexMatriax[i,]<-randindex
TrainDataRandOrder<-data[,randindex]
V <- colnames( TrainDataRandOrder)
UD <-data.frame(TrainDataRandOrder)
numberUnique_val_col= sapply(UD,function(x)length(unique(x)))
suffStat <- list(dm = TrainDataRandOrder,nlev = c(numberUnique_val_col[1],numberUnique_val_col[2], numberUnique_val_col[3],numberUnique_val_col[4],
numberUnique_val_col[5],numberUnique_val_col[6], numberUnique_val_col[7],
numberUnique_val_col[8],numberUnique_val_col[9],
numberUnique_val_col[10],numberUnique_val_col[11],
numberUnique_val_col[12],numberUnique_val_col[13],
numberUnique_val_col[14],numberUnique_val_col[15],
numberUnique_val_col[16],numberUnique_val_col[17],
numberUnique_val_col[18],numberUnique_val_col[19], numberUnique_val_col[20]), adaptDF = FALSE)
pc.fit <- pc(suffStat, indepTest= disCItest, alpha=0.01, labels=V, fixedGaps = NULL, fixedEdges = NULL,NAdelete = TRUE, m.max = Inf,skel.method = "stable", conservative = TRUE,solve.confl = TRUE, verbose = TRUE)
The "Stable" part of PC-Stable only affects the Skeleton phase of the algorithm. The Orientation phase is still order-dependent. Do the two graphs have identical "skeletons"? That is, if you convert all directed edges into undirected edges, are the two graphs identical?
If not, you may have uncovered a bug in pcalg! Please post a sample dataset and two orderings of the columns that produce graphs with different skeletons.
I get the following error in caret trainControl() using the custom methods syntax documented in the package vignette (pdf) on page 46. Does anyone know if this document out of date or incorrect? It seems at odds with the caret documentation page where the "custom" parameter is not used.
> fitControl <- trainControl(custom=list(parameters=lpgrnn$grid, model=lpgrnn$fit,
prediction=lpgrnn$predict, probability=NULL,
sort=lpgrnn$sort, method="cv"),
number=10)
Error in trainControl(custom = list(parameters = lpgrnn$grid, model = lpgrnn$fit, :
unused argument (custom = list(parameters = lpgrnn$grid, model = lpgrnn$fit,
prediction = lpgrnn$predict, probability = NULL, sort = lpgrnn$sort, method = "cv",
number = 10))
The cited pdf is out of date. The caret website is the canonical source of documentation.