Which machine learning methods can I use to predict DNA Sequences? - machine-learning

I have a dataset of DNA Sequences related to Covid-19 and I simply want to predict possible future sequences based on the existing sequences.
DNA Sequences are consist of 4 letters and 4 letters only, A,G,T and C. So a chunk of a sequence would look like
"ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAA"
Any advice or help regarding how to predict future mutations based on these existing data would be a lot of help.

Related

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Mnemonic Generation Using LSTM's | How do I make sure my Model Generates Meaningful Sentence Using a Loss Function? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on project that generates mnemonics. I have a problem with my Model.
My question is ,How do I make sure my Model Generates Meaningful Sentence Using a Loss Function?
Aim of the project
To generate Mnemonics for a list of words. Given a list of words user wants to remember, the model will Output a meaningful, simple and easy to remember sentence which encapsulates the one-two first letters of the words that the user wants to remember in the words of Mnemonic to be generated. My model will receive only the first two letters of the words the user wants to remember as that is it carries all the information for the mnemonic to be generated.
Dataset
I’m Using Kaggle’s 55000+ song lyrics data and the sentences in those lyrics contain 5 to 10 words and Mnemonic I want to generate also contain the same number of words.
Input/Output Preprocessing.
I am iterating through all the sentences after removing punctuation and numbers and extracting first 2 letters from each word in a sentence and assigning a unique number to those pair of letters from a predefined dictionary which contains pairs of keys a key and a unique number as value.
List of these unique number assigned while act as input and Glove vectors of those words will act as the output. At each time step LSTM model will take these unique numbers assigned to these words and will output the corresponding word’s GloVe vector.
Model Architecture
I'm using LSTM's with 10 time steps.
At each time step the unique number associated with the pair of letters will be fed and the output will be the GloVe vector of the corresponding word.
optimizer=rmsprop(lr=0.0008)
model=Sequential()
model.add(Embedding(input_dim=733,output_dim=40,input_length=12))
model.add(Bidirectional(LSTM(250,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(Bidirectional(LSTM(350,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(TimeDistributed(Dense(332, activation='tanh')))
model.compile(loss='cosine_proximity',optimizer=optimizer,metrics=['acc'])
Results:
My model is outputting Mnemonics which match the first two letters of each word in the input. But the mnemonic generated carries little to no meaning.
I have realized this problem is caused because of the way I’m training. The order of letter extracts from words is already ready for sentence formation. But this is not the same in case of while testing. The order with which I’m feeding the letter extracts of words may not have a high probability of sentence formation.
So I built a bigram for my data and feed that permutation that had the highest probability of sentence formation into my mnemonic generator model. Though there were some improvements, the sentence as a whole didn’t make any sense.
I’m stuck at this point.
Input
Output
My question is,
How do I make sure my Model Generates Meaningful Sentence? Using a Loss Function
First, I have a couple of unrelated suggestions. I do not think you should output the GLoVe vector of each word. Why? Word2Vec approaches are meant to encapsulate word meanings and would probably not contain information about their spelling. However, the meaning is also helpful in order to produce a meaningful sentence. Thus, I would instead have the LSTM produce it's own hidden state after reading the first two letters of each word (just as you currently do). I would then have that sequence be unrolled (as you currently do) into sequences of dimension one (indexes into a index to word map). I would then take that output, process it through an embedding layer that maps the word indexes to their GLoVe embeddings, and I would run that through another output LSTM to produce more indexes. You can stack this as much as you want - but 2 or 3 levels will probably be good enough.
Even with these changes, it is unlikely you will see any success in generating easy-to-remember sentences. For that main issue, I think there are generally two ways you can go. The first is to augment your loss with some sense that the resulting sentence being a 'valid English sentence'. You can do this with some accuracy programtically by POS tagging the output sentence and adding loss relative to whether it follows a standard sentence structure (subject predicate adverbs direct-objects, etc). Though this result might be easier than the following alternative, It might not yield actually natural results.
I would recommend, in addition to training your model in it's current fashion, to use a GAN to judge if the output sentences are natural sentences. There are many resources of Keras GANs, so I do not think you need specific code in this answer. However, here is an outline of how your model should train logically:
Augment your current training with two additional phases.
first train the discriminator to judge whether or not the output sentence is natural. You can do this by having an LSTM model read sentences and giving a sigmoid output (0/1) to whether or not they are 'natural'. You can then train this model on some dataset of real sentences with 1 labels and your sentences with 0 labels at roughly a 50/50 split.
Then, in addition to the current loss function for actually generating the Mnemonics, add the loss that is the binary cross-entropy score for your generated sentences with 1 (true) labels. Be sure to obviously freeze the discriminator model while doing this.
Continue iterating over these two steps (training each for 1 epoch at a time) until you start to see more reasonable results. You may need to play with how much each loss term is weighted in the generator (your model) in order to get the correct trade-off between a correct mnemonic and an easy-to-remember sentence.

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Resources