Is a small vocabulary for Neural Nets ok?

Is a small vocabulary for Neural Nets ok? - machine-learning

I am designing a neural network to try and generate music. The neural network would be a 2 layered LSTM (Long Short Term Memory).
I am hoping to encode the music into a many-hot format for training, ie it would be a 1 if that note was playing and a 0 if that note was not playing.
Here is an excerpt of what this data would look like:
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000011010100100001010000000000000000000000
There are 88 columns which represent 88 notes and each now represents a new beat. The output will be at a character level.
I am just wondering since there are only 2 characters in the vocabulary, would the probability of a 0 being next always be higher than the probability of a 1 being next?
I know for a large vocabulary, a large training set is needed, but I only have a small vocabulary. I have 229 files which corresponds to about 50,000 lines of text. Is this enough to prevent the output being all 0s?
Also, would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?
Thanks in advance

A small vocabulary is fine as long as your dataset not skewed overwhelmingly to one of the "words".
As to "would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?", each timestep is represented as 88 characters. Each character is a feature of that timestep. Your LSTM should be outputting the next timestep, so you should have 88 nodes. Each node should output the probability of that node being present in that timestep.
Finally since you are building a Char-RNN I would strongly suggest using abc notation to represent your data. A song in ABC notation looks like this:
X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
This is perfect for Char-RNNs because it represents every song as a set of of characters, and you can run conversions from MIDI to ABC and vice versa. All you have to do is train your model to predict the next character in this sequence instead of dealing with 88 output nodes.

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?

I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Sequence Models Word2vec

I am working on data-set with more than 100,000 records.
This is how the data looks like:
email_id cust_id campaign_name
123 4567 World of Zoro
123 4567 Boho XYz
123 4567 Guess ABC
234 5678 Anniversary X
234 5678 World of Zoro
234 5678 Fathers day
234 5678 Mothers day
345 7890 Clearance event
345 7890 Fathers day
345 7890 Mothers day
345 7890 Boho XYZ
345 7890 Guess ABC
345 7890 Sale
I am trying to understand the campaign sequence and predict the next possible campaign for the customers.
Assume I have processed my data and stored it in 'camp'.
With Word2Vec-
from gensim.models import Word2Vec
model = Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)
The problem with this model is that it accepts tokens and spits out text-tokens with probabilities in return when looking for similarities.
Word2Vec accepts this form of input-
['World','of','Zoro','Boho','XYZ','Guess','ABC','Anniversary','X'...]
And gives this form of output -
model.wv.most_similar('Zoro')
[Guess,0.98],[XYZ,0.97]
Since I want to predict campaign sequence, I was wondering if there is anyway I can give below input to the model and get the campaign name in the output
My input to be as -
[['World of Zoro','Boho XYZ','Guess ABC'],['Anniversary X','World of
Zoro','Fathers day','Mothers day'],['Clearance event','Fathers day','Mothers
day','Boho XYZ','Guess ABC','Sale']]
Output -
model.wv.most_similar('World of Zoro')
[Sale,0.98],[Mothers day,0.97]
I am also not sure if there is any functionality within the Word2Vec or any similar algorithms which can help predicting campaigns for individual users.
Thank you for your help.

I don't believe that word2vec is the right approach to model your problem.
Word2vec uses two possible approaches: Skip-gram (given a target word predict its surrounding words) or CBOW (given the surrounding words predict the target word). Your case is similar to the context of CBOW, but there is no reason why the phenomenon that you want to model would respect the linguistic "rules" for which word2vec has been developed.
word2vec tends to predict the word that occurs more frequently in combination with the targeted one within the moving window (in your code: window=4). So it won't predict the best possible next choice but the one that occurred most often in the window span of the given word.
In your call to word2vec (Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)) you are also using min_count=5 so the model is ignoring the words that have a frequency less than 5. Depending on your dataset size, there could be a loss of relevant information.
I suggest to give a look to forecasting techniques and time series analysis methods. I have the feeling that you will obtain better prediction using these techniques rather word2vec. (https://otexts.org/fpp2/index.html)
I hope it helps

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?

In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.

One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Training a neural net to predict the winner of a card game

In a made up card game there are 2 players, each of which are dealt 5 cards (standard 52 card deck), after which some arbitrary function decides a winning player. The goal is to predict the outcome of the game, given the 5 cards that each player is holding. The training data could look something like this:
Player A Player B Winner
AsKs5d3h2d JcJd8d7h6s 1
7h5d8s9sTh 2c3c4cAhAs 0
6d6s6h6cQd AsKsQsJsTs 0
Where the 'Player' columns are 5 card hands, and the 'Winner' column is 1 when player A has won, and 0 when player A has lost.
There should be an indifference towards the order of the hands, such that after training, feeding the network mirrored input data like:
Player A Player B
2d3d6h7s9s TsTdJsQc3h
and
Player A Player B
TsTdJsQc3h 2d3d6h7s9s
will always predict opposite outcomes.
It should also be indifferent to the order of the cards within the hands themselves, such that AsKsQsJsTs is the same as JsTsAsKsQs, which is the same as JsQsTsAsKs etc.
What are some reasonable ways to structure a neural net and its training data to tackle such a problem?

You are going to need a network with 104 inputs (players * number of cards)
The first 52 inputs correspond to player A, the next 52 correspond to player B.
Initialize all inputs to 0, then for each card each player has, set the corresponding input to 1.
For the output layer there are usually two options for binary classification. You can have one output neuron, and if the output of this neuron is greater than a certain threshold, player A wins, else player B wins. Or you can have two output neurons and just look at which one produces the highest output. Both generally work fine.
For training data, instead of something like "AsKs5d3h2d", you will need a one-hot encoding, something like "0001000001000000100100000100000000011001000000001001" (pretend there are 104 numbers, 10 of them are 1's and the rest are 0s) And for output data you just need a 1 or a 0 corresponding to who won (in the case of having one output neuron)
This will make your network invariant to the order of the cards (all possible orders of a given hand will create the same input) And as for swapping player A and B's hands and getting the opposite result, this is something that should come naturally to any well trained network.

First, you should understand the use of Neural Network (NN) before going ahead with this problem. NN tries to find out complex relationship between input and output. (here your input is five cards and output is predicted class).
Here in this question, relationship between input and output can easily be formulated. i.e. you can easily choose some set of rules which declares a final winner.
Still like any other problem, this problem can also be dealt with NN. First you need to prepare your data.
There are total 52 type of inputs possible. So, take 52 column in dataset. Now in these 52 column you can fill three type of categorical data. Either it belongs to 'A' or 'B' or no body. 'C' and output can be the winner .
Now you can train it using NN.

SVM Classifying binary data DNA

I'm working SVM in R software and I would appreaciate any input you may provide.
I have a data set that I need to train with SVM, the format of the data is the following
ToPredict Data1 Data2 Data3 Data4 DNA
S 1 12 1 11 000000000100
B -1 17 14 3 11011110111110111
S 1 4 0 4 0000
The question that I have is regarding the DNA column.
SVM is able to get an input like DNA and still calculate reliable predictions?
For my data set, 0≠00 or 1≠001 therefore, it cannot be taken as integers.Every value represents information that needs to be processed and the order is very important, it's a string of binary values, either is 1 or 0.
The 0101 information could be displayed as ABAB etc. (A=0, B=1)
How can I train a SVM with the data above?
Thank you.

For SVMs to work, "all" you need to have a kernel function.
So what is a sensible kernel function for your "DNA strings"? You probably don't need to be able to prove it is a proper kernel, but you can get away with a good similarity measure.
How would you evaluate similarity of your sequences? I cannot help you on that, because I don't know what the data means; this is up to the user (i.e. you) to specify.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart