I am trying to run code to get stacked embedding from flair and bert and I am getting the following error. one of the suggestion was to reduce the batch size, but how to pass the data in batches? here is the code and error.
from tqdm import tqdm ## tracks progress of loop ##
import torch
from flair.data import Sentence
from flair.embeddings import TransformerDocumentEmbeddings
from flair.embeddings import DocumentPoolEmbeddings
bert_embeddings = TransformerDocumentEmbeddings('bert-base-uncased')
### initialize the document embeddings, mode = mean ###
document_embeddings = DocumentPoolEmbeddings([
flair_forward,
flair_backward,
bert_embeddings
])
# Storing Size of embedding #
z = sentence.embedding.size()[0]
print(z)
### Vectorising text ###
# creating a tensor for storing sentence embeddings
sen = torch.zeros(0,z)
print(sen)
# iterating Sentences #
for tweet in tqdm(txt):
sentence = Sentence(tweet)
document_embeddings.embed(sentence)# *****this line is giving error*****
# Adding Document embeddings to list #
if(torch.cuda.is_available()):
sen = sen.cuda()
sen = torch.cat((sen, sentence.embedding.view(-1,z)),0)
and this the error I am getting.
RuntimeError Traceback (most recent call last)
<ipython-input-24-1eee00445350> in <module>()
24 for tweet in tqdm(txt):
25 sentence = Sentence(tweet)
---> 26 document_embeddings.embed(sentence)
27 # Adding Document embeddings to list #
28 if(torch.cuda.is_available()):
7 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
580 if batch_sizes is None:
581 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
--> 582 self.dropout, self.training, self.bidirectional, self.batch_first)
583 else:
584 result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.43 GiB total capacity; 6.54 GiB already allocated; 10.94 MiB free; 6.70 GiB reserved in total by PyTorch)
embeddings = FlairEmbeddings('news-forward', chars_per_chunk=128)
for example :
embedding_types = [
WordEmbeddings('glove'),
FlairEmbeddings('news-forward',chars_per_chunk=128),
FlairEmbeddings('news-backward'),
]
Edit your code accordingly new function added by flair to avoid this to see if it works. Google collab is not suitable for large transformers models for tasks such as NER, from my own experience.
see github documentation for more examples for your specific task!
I need to calculate the average gap size of a univariate time-series data set. imputeTS package generates plots using this data. Is it possible to extract the 'gap size' and the 'number of occurrence' from either statsNA or ggplot_na_gapsize?
Or is there any other way to find the average size of gaps in a time-series data set?
(You could use tsNH4 data set from the imputeTS package)
(This is my first time asking questions here and I'm fairly new to 'r')
At the moment you can get the average gap size only indirectly with some extra work with the CRAN version of imputeTS.
But I made a quick update to the development version on GitHub.
Now you can also get the average gap size with the statsNA function.
Therefore you have to install the new version from GitHub first (since it is not on CRAN yet):
library("devtools")
install_github("SteffenMoritz/imputeTS")
If you do not have "devtools" installed, then also install this library at the very beginning
install.packages("devtools")
Afterwards just use the imputeTS package as usual.
library("imputeTS")
#Example with the tsNH4 dataset
statsNA(tsNH4)
This will now print you the following:
> statsNA(tsNH4)
[1] "Length of time series:"
[1] 4552
[1] "-------------------------"
[1] "Number of Missing Values:"
[1] 883
[1] "-------------------------"
[1] "Percentage of Missing Values:"
[1] "19.4%"
[1] "-------------------------"
[1] "Number of Gaps:"
[1] 155
[1] "-------------------------"
[1] "Average Gap Size:"
[1] 5.696774
[1] "-------------------------"
[1] "Stats for Bins"
[1] " Bin 1 (1138 values from 1 to 1138) : 233 NAs (20.5%)"
[1] " Bin 2 (1138 values from 1139 to 2276) : 433 NAs (38%)"
[1] " Bin 3 (1138 values from 2277 to 3414) : 135 NAs (11.9%)"
[1] " Bin 4 (1138 values from 3415 to 4552) : 82 NAs (7.21%)"
[1] "-------------------------"
[1] "Longest NA gap (series of consecutive NAs)"
[1] "157 in a row"
[1] "-------------------------"
[1] "Most frequent gap size (series of consecutive NA series)"
[1] "1 NA in a row (occuring 68 times)"
[1] "-------------------------"
[1] "Gap size accounting for most NAs"
[1] "157 NA in a row (occuring 1 times, making up for overall 157 NAs)"
As you can see, 'Number of gaps' and 'Average gap size' is now newly added to the output.
You can also access the output as a variable:
library("imputeTS")
#To actually get a output object, set print_only to false
out <- statsNA(tsNH4, print_only = F)
# Average gap size
out$average_size_na_gaps
# Number of Gaps
out$number_na_gaps
#Number of NAs
out$number_NAs
The updates will also be in the next CRAN update. (thanks for the suggestion)
Just be a little bit careful, since it is a development version - thus not so thoroughly tested as the CRAN version.
I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
data_rdd = data.rdd # needs to be an RDD
data_rdd.cache()
# Build the model (cluster the data)
clusters = KMeans.train(data_rdd, 7, maxIterations=15, initializationMode="random")
But I am getting an error after a while:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5191.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5191.0 (TID 260738, 10.19.211.69, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
I've tried to detach and re-attach the cluster. Same result. What am I doing wrong?
Since, based on another recent question of yours, I guess you are in your very first steps with Spark clustering (you are even importing sqrt & array, without ever using them, probably because it is like that in the docs example), let me offer advice in a more general level rather than in the specific question you are asking here (hopefully also saving you from subsequently opening 3-4 more questions, trying to get your cluster assignments back into your dataframe)...
Since
you have your data already in a dataframe
you want to attach the cluster membership back into your initial
dataframe
you have no reason to revert to an RDD and use the (soon to be deprecated) MLlib package; you will do your job much more easily, elegantly, and efficiently using the (now recommended) ML package, which works directly with dataframes.
Step 0 - make some toy data resembling yours:
spark.version
# u'2.2.0'
df = spark.createDataFrame([[0, 33.3, -17.5],
[1, 40.4, -20.5],
[2, 28., -23.9],
[3, 29.5, -19.0],
[4, 32.8, -18.84]
],
["other","lat", "long"])
df.show()
# +-----+----+------+
# |other| lat| long|
# +-----+----+------+
# | 0|33.3| -17.5|
# | 1|40.4| -20.5|
# | 2|28.0| -23.9|
# | 3|29.5| -19.0|
# | 4|32.8|-18.84|
# +-----+----+------+
Step 1 - assemble your features
In contrast to most ML packages out there, Spark ML requires your input features to be gathered in a single column of your dataframe, usually named features; and it provides a specific method for doing this, VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "long"], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.show()
# +-----+----+------+-------------+
# |other| lat| long| features|
# +-----+----+------+-------------+
# | 0|33.3| -17.5| [33.3,-17.5]|
# | 1|40.4| -20.5| [40.4,-20.5]|
# | 2|28.0| -23.9| [28.0,-23.9]|
# | 3|29.5| -19.0| [29.5,-19.0]|
# | 4|32.8|-18.84|[32.8,-18.84]|
# +-----+----+------+-------------+
As perhaps already guessed, the argument inputCols serves to tell VectoeAssembler which particular columns in our dataframe are to be used as features.
Step 2 - fit your KMeans model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1) # 2 clusters here
model = kmeans.fit(new_df.select('features'))
select('features') here serves to tell the algorithm which column of the dataframe to use for clustering - remember that, after Step 1 above, your original lat & long features are no more directly used.
Step 3 - transform your initial dataframe to include cluster assignments
transformed = model.transform(new_df)
transformed.show()
# +-----+----+------+-------------+----------+
# |other| lat| long| features|prediction|
# +-----+----+------+-------------+----------+
# | 0|33.3| -17.5| [33.3,-17.5]| 0|
# | 1|40.4| -20.5| [40.4,-20.5]| 1|
# | 2|28.0| -23.9| [28.0,-23.9]| 0|
# | 3|29.5| -19.0| [29.5,-19.0]| 0|
# | 4|32.8|-18.84|[32.8,-18.84]| 0|
# +-----+----+------+-------------+----------+
The last column of the transformed dataframe, prediction, shows the cluster assignment - in my toy case, I have ended up with 4 records in cluster #0 and 1 record in cluster #1.
You can further manipulate the transformed dataframe with select statements, or even drop the features column (which has now fulfilled its function and may be no longer necessary)...
Hopefully you are much closer now to what you actually wanted to achieve in the first place. For extracting cluster statistics etc., another recent answer of mine might be helpful...
Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.
When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:
df.select('lat', 'long').rdd.collect()
# [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]
which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:
df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
# [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]
So, your code should be like this:
from pyspark.mllib.clustering import KMeans, KMeansModel
rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
clusters.centers
# [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]
Here is my code:
set.seed(1)
#Boruta on the HouseVotes84 data from mlbench
library(mlbench) #has HouseVotes84 data
library(h2o) #has rf
#spin up h2o
myh20 <- h2o.init(nthreads = -1)
#read in data, throw some away
data(HouseVotes84)
hvo <- na.omit(HouseVotes84)
#move from R to h2o
mydata <- as.h2o(x=hvo,
destination_frame= "mydata")
#RF columns (input vs. output)
idxy <- 1
idxx <- 2:ncol(hvo)
#split data
splits <- h2o.splitFrame(mydata,
c(0.8,0.1))
train <- h2o.assign(splits[[1]], key="train")
valid <- h2o.assign(splits[[2]], key="valid")
# make random forest
my_imp.rf<- h2o.randomForest(y=idxy,x=idxx,
training_frame = train,
validation_frame = valid,
model_id = "my_imp.rf",
ntrees=200)
# find importance
my_varimp <- h2o.varimp(my_imp.rf)
my_varimp
The output that I am getting is "variable importance".
The classic measures are "mean decrease in accuracy" and "mean decrease in gini coefficient".
My results are:
> my_varimp
Variable Importances:
variable relative_importance scaled_importance percentage
1 V4 3255.193604 1.000000 0.410574
2 V5 1131.646484 0.347643 0.142733
3 V3 921.106567 0.282965 0.116178
4 V12 759.443176 0.233302 0.095788
5 V14 492.264954 0.151224 0.062089
6 V8 342.811554 0.105312 0.043238
7 V11 205.392654 0.063097 0.025906
8 V9 191.110046 0.058709 0.024105
9 V7 169.117676 0.051953 0.021331
10 V15 135.097076 0.041502 0.017040
11 V13 114.906586 0.035299 0.014493
12 V2 51.939777 0.015956 0.006551
13 V10 46.716656 0.014351 0.005892
14 V6 44.336708 0.013620 0.005592
15 V16 34.779987 0.010684 0.004387
16 V1 32.528778 0.009993 0.004103
From this my relative importance of "Vote #4" aka V4, is ~3255.2.
Questions:
What units is that in?
How is that derived?
I tried looking in documentation, but am not finding the answer. I tried the help documentation. I tried using Flow to look at parameters to see if anything in there indicated it. In none of them do I find "gini" or "decrease accuracy". Where should I look?
The answer is in the docs.
[ In the left pane, click on "Algorithms", then "Supervised", then "DRF". The FAQ section answers this question. ]
For convenience, the answer is also copied and pasted here:
"How is variable importance calculated for DRF? Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result."
I am new to SVMlight. I downloaded the source code and compiled SVMlight.
I created training and testing data sets. And ran
[command]
creating a model file. Using this model file, I ran svm_classify creating a prediction file. The prediction file contains some values.
What do these numbers represent? I would like to classify my data into -1 and +1, but I see no such values in prediction file.
model file :
SVM-light Version V6.02
0 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
9947 # highest feature index
2000 # number of training documents
879 # number of support vectors plus 1
-0.13217617 # threshold b, each following line is a SV (starting with alpha*y)
-1.0000000005381390888459236521157 6:0.013155501 9:0.10063701 27:0.038305663 41:0.12115256 63:0.056871183 142:0.020468477 206:0.12547429 286:0.073713586 406:0.12335037 578:0.40131235 720:0.13097784 960:0.30321017 1607:0.17021149 2205:0.5118736 3177:0.54580438 4507:0.27290219 #
-0.61395623101405172317157621364458 6:0.019937159 27:0.019350741 31:0.025329925 37:0.031444062 42:0.11928168 83:0.03443896 127:0.066094264 142:0.0086166598 162:0.035993244 190:0.056980081 202:0.16503957 286:0.074475288 323:0.056850906 386:0.052928429 408:0.039132856 411:0.049789339 480:0.048880257 500:0.068775021 506:0.037179198 555:0.076585822 594:0.063632675 663:0.062197074 673:0.067195281 782:0.075720288 834:0.066969693 923:0.44677126 1146:0.076086208 1191:0.5542227 1225:0.059279677 1302:0.094811738 1305:0.060443446 1379:0.070145406 1544:0.087077379 1936:0.089480147 2451:0.31556693 2796:0.1145037 2833:0.20080972 6242:0.1545693 6574:0.28386003 7639:0.29435158 #
etc...
prediction file:
1.0142989
1.3699419
1.4742762
0.52224801
0.41167112
1.3597693
0.91790572
1.1846312
1.5038173
-1.7641716
-1.4615855
-0.75832723
etc...
In your training file, did you provide known classes (+1, -1)? i.e.
-1 1:0.43 3:0.12 9284:0.2 # abcdef
Can you provide an excerpt of this file as well as the commands you ran?
The prediction file holds the values for each data point according to the model you trained. You may consider that values below 0 classify the datapoint into the -1 category and above 0 into the +1 category.
When you run the classification on the training set, you will see where the model works and where it fails.