TF-IDF extracting keywords - machine-learning

Working on function somewhat like this:
def get_feature_name_by_tfidf(text_to_process):
with open(master_path + '\\additional_stopwords.txt', 'r') as f:
additional_stop_words = ast.literal_eval(f.read())
stop_words = text.ENGLISH_STOP_WORDS.union(set(additional_stop_words))
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 4), min_df=0, stop_words=stop_words)
tfidf_matrix = tf.fit_transform(text_to_process.split(','))
tagged = nltk.pos_tag(tf.get_feature_names())
feature_names_with_tags = {k: v for k, v in dict(tagged).items() if v != 'VBP'}
return list(feature_names_with_tags.keys())
Which return the list of keywords in the passed text.
Is there any way to get the keywords in the same case as it is provided?
Like passed string
Input:
a = "TIME is the company where I work"
Instead of getting keyword list as:
['time', 'company']
I like to get:
['TIME', 'company']

By default, TfidfVectorizer converts words to lowercase.Use this line:
tf = TfidfVectorizer(analyzer='word',lowercase=False, ngram_range=(1, 4), min_df=0, stop_words=stop_words)
and it should work. Use this link for ref. TfidfVectorizer

Related

Roberta with multi-class and OneHotEncoding

I have a dataset for fake news, which has 4 different classes: true, false, partially true, other.
Currently my code uses LabelEncoding to these labels but I would like to switch to OneHot Encoding.
So no I am trying to turn these labels into OneHot vectors. How can I achieve that in a way, where it will be possible after that to pass these labels to RoBERTa model.
Here I will share my current code :
Firstly I convert the labels to numerical values (0-3)
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
After that I split the data to train and validation sets
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
text = df["title"].iloc[i] + " - " + text
texts.append(text)
labels.append(label)
train_test_split(texts, labels, test_size=test_size)
Finally I make the data compatible for Bert model:
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)
Where NewsGroupsDataset looks like this:
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item
def __len__(self):
return len(self.labels)
How can I switch to OneHotEncoding, because I do not want the model to assume that there is a natural order between the labels?

how to speed up "POS tag" with StanfordPOSTagger?

I wanted to take none phrases of tweets, code is following. The problem is that it only process 300 tweets at a time and spend 5 minutes, how to speed up?
by the way, some code edited according to text blob.
I use dataset of gate-EN-twitter(https://gate.ac.uk/wiki/twitter-postagger.html) and NLTK interface to the Stanford POS tagger to tag tweets
from nltk.tag import StanfordPOSTagger
from nltk.tokenize import word_tokenize
import time,nltk
start_time = time.time()
CFG = {
('NNP', 'NNP'): 'NNP',
('NN', 'NN'): 'NNI',
('NNI', 'NN'): 'NNI',
('JJ', 'JJ'): 'JJ',
('JJ', 'NN'): 'NNI',
}
st = StanfordPOSTagger('/models/gate-EN-twitter.model','/twitie_tagger/twitie_tag.jar', encoding='utf-8')
def _normalize_tags(chunk):
'''Normalize the corpus tags.
("NN", "NN-PL", "NNS") -> "NN"
'''
ret = []
for word, tag in chunk:
if tag == 'NP-TL' or tag == 'NP':
ret.append((word, 'NNP'))
continue
if tag.endswith('-TL'):
ret.append((word, tag[:-3]))
continue
if tag.endswith('S'):
ret.append((word, tag[:-1]))
continue
ret.append((word, tag))
return ret
def noun_phrase_count(text):
matches1=[]
print('len(text)',len(text))
for i in range(len(text)//1000):
tokenized_text = word_tokenize(text[i*1000:i*10000+1000])
classified_text = st.tag(tokenized_text)
tags = _normalize_tags(classified_text)
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = t1[1], t2[1]
value = CFG.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = '%s %s' % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = [t[0] for t in tags if t[1] in ['NNP', 'NNI']]
matches1+=matches
print("--- %s seconds ---" % (time.time() - start_time))
fdist = nltk.FreqDist(matches1)
return [(tag,num) for (tag, num) in fdist.most_common()]
noun_phrase_count(tweets)
Looks like a duplicate of Stanford POS tagger with GATE twitter model is slow so you may find more info there.
Additionally; if there's any chance of stumbling upon identical inputs (tweets) twice (or more), you can consider a dictionary with the tweet (plain str) as key, and tagged as value, so that when you encounter a tweet, you first check if it's in your dict already. If not, tag it and put it there (and if this route is viable, why not pickle/unpickle that dictionary so that debugging/subsequent runs of your code go faster as well).

How to use prepare_analogy_questions and check_analogy_accuracy functions in text2vec package?

Following code:
library(text2vec)
text8_file = "text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip ("text8.zip", files = "text8")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
RcppParallel::setThreadOptions(numThreads = 4)
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10, learning_rate = .25)
word_vectors_main = glove_model$fit_transform(tcm, n_iter = 20)
word_vectors_context = glove_model$components
word_vectors = word_vectors_main + t(word_vectors_context)
causes error:
qlst <- prepare_analogy_questions("questions-words.txt", rownames(word_vectors))
> Error in (function (fmt, ...) :
invalid format '%d'; use format %s for character objects
File questions-words.txt from word2vec sources https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt
This was a small bug in information message formatting (after introduction of futille.logger). Just fixed it and pushed to github.
You can install updated version of the package with devtools::install_github("dselivanov/text2vec"

Emulate string to label dict

Since Bazel does not provide a way to map labels to strings, I am wondering how to work around this via Skylark.
Following my partial horrible "workaround".
First the statics:
_INDEX_COUNT = 50
def _build_label_mapping():
lmap = {}
for i in range(_INDEX_COUNT):
lmap ["map_name%s" % i] = attr.string()
lmap ["map_label%s" % i] = attr.label(allow_files = True)
return lmap
_LABEL_MAPPING = _build_label_mapping()
And in the implementation:
item_pairs = {}
for i in range(_INDEX_COUNT):
id = getattr(ctx.attr, "map_name%s" % i)
if not id:
continue
mapl = getattr(ctx.attr, "map_label%s" % i)
if len(mapl.files):
item_pairs[id] = list(mapl.files)[0].path
else:
item_pairs[id] = ""
if item_pairs:
arguments += [
"--map", str(item_pairs), # Pass json data
]
And then the rule:
_foo = rule(
implementation = _impl,
attrs = dict({
"srcs": attr.label_list(allow_files = True, mandatory = True),
}.items() + _LABEL_MAPPING.items()),
Which needs to be wrapped like:
def foo(map={}, **kwargs):
map_args = {}
# TODO: Check whether order of items is defined
for i, item in enumerate(textures.items()):
key, value = item
map_args["map_name%s" % i] = key
map_args["map_label%s" % i] = value
return _foo(
**dict(map_args.items() + kwargs.items())
)
Is there a better way of doing that in Skylark?
To rephrase your question, you want to create a rule attribute mapping from string to label?
This is currently not supported (see list of attributes), but you can file a feature request for this.
Do you think using "label_keyed_string_dict" is a reasonable workaround? (it won't work if you have duplicated keys)

Learning2Search (vowpal-wabbit) for NER gives weird results

We are trying to use Learning2Search from vowpal-wabbit for NER
We are using ATIS dataset.
In ATIS there are 127 Entities (including Others category)
Training set has 4978 and test has 893 sentences.
How ever when we run it on test set it is mapping everything either class 1(Airline name) or class 2(Airport code)
Which is wired.
We tried another dataset (https://github.com/glample/tagger/tree/master/dataset), same behavior.
Looks like I am not using it the right way. Any pointers will be of great help.
Code snippet :
with open("/tweetsdb/ner/datasets/atis.pkl") as f:
train, test, dicts = cPickle.load(f)
idx2words = {v: k for k, v in dicts['words2idx'].iteritems()}
idx2labels = {v: k for k, v in dicts['labels2idx'].iteritems()}
idx2tables = {v: k for k, v in dicts['tables2idx'].iteritems()}
#Convert the dataset into a format compatible with Vowpal Wabbit
training_set = []
for i in xrange(len(train[0])):
zip_label_ent_idx = zip(train[2][i], train[0][i])
label_ent_actual = [(int(i[0]), idx2words[i[1]]) for i in zip_label_ent_idx]
training_set.append(label_ent_actual)
# Do like wise to get test chunk
class SequenceLabeler(pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )
def _run(self, sentence):
output = []
for n in range(len(sentence)):
pos,word = sentence[n]
with self.vw.example({'w': [word]}) as ex:
pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
output.append(pred)
return output
vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")
Code for training the model:
#Training
sequenceLabeler = vw.init_search_task(SequenceLabeler)
for i in xrange(3):
sequenceLabeler.learn(training_set[:10])
Code for Prediction:
pred = []
for i in random.sample(xrange(len(test_set)), 10):
test_example = [ (999, word[1]) for word in test_set[i] ]
test_labels = [ label[0] for label in test_set[i] ]
print 'input sentence:', ' '.join([word[1] for word in test_set[i]])
print 'actual labels:', ' '.join([str(label) for label in test_labels])
print 'predicted labels:', ' '.join([str(pred) for pred in sequenceLabeler.predict(test_example)])
To see the full code, pls refer to this notebook:
https://github.com/nsanthanam/ner/blob/master/vowpal_wabbit_atis.ipynb
I am also new to this algorithm, but did some pilot studies recently.
To your problem, the answer is that you set a wrong parameter in
vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")
Here, the search should be set as '127', and in this way, vw will use your 127 tags.
vw = pyvw.vw("--search 127 --search_task hook --ring_size 1024")
Also, my feeling is that vw doesn't work really well with so many tags. I might be wrong, please let me know your result :)

Resources