NER model training with IOB encoding fails (Stanford CoreNLP) - machine-learning

I am trying to train a NER model for Stanford CoreNLP. But as soon as the 8th or 9th iteration of the training process is reached, it stops and nothing else is happening.
The corpus is annotated with IOB/BIO encoding like this:
How O
to O
play O
a O
video O
in O
Java B-Fram
Swing I-Fram
? O
My properties file:
trainFile = C:\\Data\\corpora\\train\\train.tsv
serializeTo = C:\\Data\\ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=2
maxRight=2
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
sloppyGazette=true
gazette=C:\\Data\\gazetteers\\gaz1.txt,C:\\Data\\gazetteers\\gaz2.txt
entitySubclassification=bio
The content of my Gazetteers:
Fram LiteDB
Fram RavenDB
Fram MongoDB
Fram Cassandra
Fram Couchbase
...
The command for the training process:
java -mx8g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop C:\\Data\\ner.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter
Why is the training process suddenly stopping? Has this something to do with incorrect properties? Or does the gazetteers have to have the same labels as the annotated corpus?
At the end I want the entities to be tagged with just "Fram" instead of "B-Fram" or "I-Fram". How is that possible?
Thank you in advance.

Related

(SAS)how to make prediction to new data using a trained logistic regression model?

I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
An input dataset
A model to apply to the data
It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod is your new production data coming in.
Let's take a look at the predictions output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.

Using Bloom AI Model on Mac M1 for continuing prompts (Pytorch)

I try to run the bigscience Bloom AI Model on my Macbook M1 Max 64GB, freshly installed pytorch for Mac M1 chips and Python 3.10.6 running.
I'm not able to get any output at all.
With other AI Models I have the same issue and I really don't know how I should fix it.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "mps" if torch.backends.mps.is_available() else "cpu"
if device == "cpu" and torch.cuda.is_available():
device = "cuda" #if the device is cpu and cuda is available, set the device to cuda
print(f"Using {device} device") #print the device
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom").to(device)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
I've tried it with other Models (smaller bert Models) and also tried letting it just run on CPU without using the mps device at all.
Maybe anyone could help
It might be taking too long to get the output. Do you want to break it down to serial calls involving
a) embedding layer b) the 70 bloom blocks c) then the output layer norm and d) the token decoding?
An example to run this code is available at https://nbviewer.org/urls/arteagac.github.io/blog/bloom_local.ipynb .
It basically boils down to:
def forward(input_ids):
# 1. Create attention mask and position encodings
attention_mask = torch.ones(len(input_ids)).unsqueeze(0).bfloat16().to(device)
alibi = build_alibi_tensor(input_ids.shape[1], config.num_attention_heads,
torch.bfloat16).to(device)
# 2. Load and use word embeddings
embeddings, lnorm = load_embeddings()
hidden_states = lnorm(embeddings(input_ids))
del embeddings, lnorm
# 3. Load and use the BLOOM blocks sequentially
for block_num in range(70):
load_block(block, block_num)
hidden_states = block(hidden_states, attention_mask=attention_mask, alibi=alibi)[0]
print(".", end='')
hidden_states = final_lnorm(hidden_states)
#4. Load and use language model head
lm_head = load_causal_lm_head()
logits = lm_head(hidden_states)
# 5. Compute next token
return torch.argmax(logits[:, -1, :], dim=-1)
Please refer the linked notebook to get the implementation for functions used in the forward call.

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

Flink ML - SVM Learning Runtime is way below expectations

My goal is to evaluate the learning runtime of the SVMs from apache flink and apache spark.
Times ago this was already done in this article https://link.springer.com/article/10.1186/s41044-016-0020-2#Sec12 .
they are using datasets containing 6.5 million till 65 million instances with 631 features.
The first result shows the runtime of apache flink with the 6.5 million instances dataset.
Flink needs there 111 seconds to complete the learning (step size of 0.01 and a regularization parameter of 0.01).
In my following code snippet i am trying to rebuild this test, but it only takes 14 seconds, which sounds a bit unrealistic compared to the article.
object ScalabilityTest extends App {
val env = ExecutionEnvironment.getExecutionEnvironment
val pathToTrainingFile = "hdfs:///datasets/vectorized-data-7-mio"
val input: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
val model = SVM()
.setBlocks(env.getParallelism)
.setIterations(100)
.setRegularization(0.01)
.setStepsize(0.01)
.setOutputDecisionFunction(true)
model.fit(input)
input.output(new DiscardingOutputFormat[LabeledVector])
env.execute("flink svm scalability")
}
The dataset looks like:
0 2:0.0956973405770521 3:0.04302176671363839 63:0.22238596493564314 70:0.10967685914251926 139:0.23401113066755871 252:0.2197483438988253 555:0.4133093410994923 566:0.38936213517756474 1078:0.4758559946101788 1732:0.5005412038471097 3640:0.5518088141146315 5017:0.7249851063151311 7793:0.5518088141146315
0 2:0.08886181625011981 3:0.039948783376949924 5:0.08410395321456549 82:0.18319049693258793 90:0.1919885302924892 256:0.21925479670962622 303:0.4053787361547381 325:0.22255697201807345 353:0.27518945356563856 530:0.31997134952306455 562:0.43392159248729284 785:0.49570712373633596 9066:0.6077511178730391 15357:0.7227109687611892
0 1:0.10076144986545423 2:0.0691147459723154 3:0.06214255191969988 7:0.18335119128172658 20:0.08097367795302803 35:0.25318177398006236 37:0.04713864367763213 70:0.0792110649362639 111:0.17495017690637232 137:0.2826397578480119 153:0.17929691347274906 160:0.34589629247477904 256:0.1705315085519315 1091:0.26748048081097636 1281:0.28156097530039514 4054:0.36002041127390355
containing 7mio rows with 26k features.
question:
do i miss something here or how can it be, that my learning runtime with a dataset containing more rows and features is so much faster?

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:
f_mosaic = 't1.nc'
meta = {'width': dat_f.shape[1],
'height': dat_f.shape[2],
'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
'transform': aff_final,
'count': dat_f.shape[0]}
with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
# Create spatial dimensions
y = t1.createDimension('y', meta['width'])
x = t1.createDimension('x', meta['height'])
wl_dim = t1.createDimension('wl',meta['count'])
reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
reflectance.setncattr('grid_mapping', 'crs')
crs = t1.createVariable('crs', 'c')
crs.spatial_ref = meta['crs'].wkt
crs.epsg_code = meta['crs'].to_string()
crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())
dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})
Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?
Thanks!

Resources