machine learning model to segregate input data - machine-learning

Consider these references as EXAMPLES :
1)Cahn, R. S.; Ingold, C.; Prelog, V. Specification of Molecular Chirality. Angew. Chem. Int. Ed. 1966, 5, 385-415.
2)Christie, G. H.; Kenner, J. The Molecular Configurations of Polynuclear Aromatic Compounds. J. Chem. Soc., Trans. 1922, 121, 614-620.
3)Kuhn, R. Molekulare Asymmetrie in Stereochemie, 1933, 803.
4)Oki, M. Recent Advances in Atropisomerism. Topics in Stereochemistry 1983, 14, 1-81.
5)Miyashita, A.; Yasuda, A.; Takaya, H.; Toriumi, K.; Ito, T.; Souchi, T.; Noyori, R. Synthesis of 2,2'-bis(diphenylphosphino)-1,1'-binaphthyl (BINAP), an atropisomeric chiral bis(triaryl)phosphine, and its
use in the rhodium(I)-catalyzed asymmetric hydrogenation of α-(acylamino)acrylic acids. J. Am. Chem.
Soc. 1980, 102, 7932-7934.
I have a large number of references ( like mentioned above ), and I want to segregate data from each of those references ( five for this case ) into separate parts such as
1) Name of the author(s),
2) Title of the topic
3) Date/year of publication
4) Pages and
5) Any other information
I want to create a machine learning model which should be capable of learning the formats by itself and get the right meaning/data from the references in to separate parts.The algorithm should be as much efficient as possible.
Question :
What algorithm and what approach should I have to use to implement the above functionality?
Would I be required to use multiple algorithms to create a model for the above scenario?

Related

Understanding FastRP vs scaleProperties

I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)
During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.
Project graph
//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})
Scale variables of interest between 0-1 for future Euclidean distance calculation
//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten
We then can run KNN based on euclidean distance
//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
{nodeProperties: {scaledProperties: "EUCLIDEAN"},
topK: 15,
mutateRelationshipType: "IS_SIMILAR",
mutateProperty: "similarity",
similarityCutoff: 0.6409912109375,
sampleRate:1,
randomSeed:42,
concurrency:1}
)
However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.
FastRP
CALL gds.fastRP.mutate(
'bd_graph',
{
embeddingDimension: 8,
mutateProperty: 'fastrp-embedding',
featureProperties: ['bd_load', 'weight', 'length']
}
)
YIELD nodePropertiesWritten
But when I run the below code
ALL gds.knn.stats("bd_graph",
{
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
I receive an error:
Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?
I am glad you are finding the tutorial helpful and getting into GDS!
Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/
The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.
CALL gds.knn.stats("bd_graph",
{
nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

How to deal with array of string features in traditional machine learning?

Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?
As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.
For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).
I'll also throw a hashing variant into the mix:
Bloom Filter
You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.
First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.
Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.
Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.
Hope this helps!
Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.
Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).
Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.

How do I use Conll 2003 corpus in python crfsuite

I have downloaded Conll 2003 corpus ("eng.train"). I want to use it to extract entity using python crfsuite training. But I don't know how to load this file for training.
I found this example, but it is not for English.
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Also in future I would like to train new entities other than POS or location. How can I add those.
Also please suggest how to handle multiple words.
You can use ConllCorpusReader.
Here a general impelemantation:
ConllCorpusReader('file path', 'file name', columntypes=['','',''])
Here a list of column types which you can use: 'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'
Example:
from nltk.corpus.reader import ConllCorpusReader
train = ConllCorpusReader('CoNLL-2003', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test = ConllCorpusReader('CoNLL-2003', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

How to fetch vectors for a word list with Word2Vec?

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?
I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.
The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].
Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.
Thus, to create a dictionary object, you may:
my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
my_dict[key] = model.wv[key]
# Or my_dict[key] = model.wv.get_vector(key)
# Or my_dict[key] = model.wv.word_vec(key, use_norm=False)
Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:
{'one': array([-0.06590105, 0.01573388, 0.00682817, 0.53970253, -0.20303348,
-0.24792041, 0.08682659, -0.45504045, 0.89248925, 0.0655603 ,
......
-0.8175681 , 0.27659689, 0.22305458, 0.39095637, 0.43375066,
0.36215973, 0.4040089 , -0.72396156, 0.3385369 , -0.600869 ],
dtype=float32),
'two': array([ 0.04694849, 0.13303463, -0.12208422, 0.02010536, 0.05969441,
-0.04734801, -0.08465996, 0.10344813, 0.03990637, 0.07126121,
......
0.31673026, 0.22282903, -0.18084198, -0.07555179, 0.22873943,
-0.72985399, -0.05103955, -0.10911274, -0.27275378, 0.01439812],
dtype=float32),
'three': array([-0.21048863, 0.4945509 , -0.15050395, -0.29089224, -0.29454648,
0.3420335 , -0.3419629 , 0.87303966, 0.21656844, -0.07530259,
......
-0.80034876, 0.02006451, 0.5299498 , -0.6286509 , -0.6182588 ,
-1.0569025 , 0.4557548 , 0.4697938 , 0.8928275 , -0.7877308 ],
dtype=float32),
'four': ......
}
You may also wish to look into the native save / load methods of Gensim's word2vec.
Gensim tutorial explains it very clearly.
First, you should create word2vec model - either by training it on text, e.g.
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
or by loading pre-trained model (you can find them here, for example).
Then iterate over all your words and check for their vectors in the model:
for word in words:
vector = model[word]
Having that, just write word and vector formatted as you want.
You can Directly get the vectors through
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors
and words through
model.wv.vocab.keys()
Hope it helps !
If you are willing to use python with gensim package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this
from gensim.models import Word2Vec
# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]
# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)
# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()
# Make a dictionary
we_dict = {word:model.wv[word] for word in words}
Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector
Get the ordered list of words in the vocabulary
words = list(w for w in model.wv.index_to_key)
Get the vector for 'also'
print(model.wv['also'])
Using basic python:
all_vectors = []
for index, vector in enumerate(model.wv.vectors):
vector_object = {}
vector_object[list(model.wv.vocab.keys())[index]] = vector
all_vectors.append(vector_object)
For gensim 4.0:
my_dict = dict({})
for word in word_list:
my_dict[word] = model.wv.get_vector('0', norm = True)
I would suggest this, you may find anything you need including Word2Vec, FastText, Doc2Vec, KeyedVectors and so on...

Resources