Applying neural network algorithms on Encrypted data - machine-learning

I have encrypted text dataset and i want to classify it using neural network algorithm. I know that there is a pattern in the encrypted data.
example of my input data :
diss%^ghghE(t dffd$#KL*vb xod##:n>did ....
My questions is should i treat encrypted data as if its normal text and create vocabulary and transform my data into sequence of indices ?
should i clean my data first from all the special characters ?
What i tried is i cleaned all data from special characters, then created a vocabulary and transform my data into sequences however i am getting a very low accuracy. but my model works well when my data is in natural language.
Any help is appreciated.

By definition, a good encryption algorithm will not allow you to learn anything[*] from the encrypted data.
So, unless you suspect that the encryption algorithm is weak, I suggest you abandon this idea.
[*] apart from the approximate size of the original text

Related

one hot encoding of output labels

While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/

Using Text Sentiment as feature in machine learning model?

I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.

Encoding large numbers of categorical variables as input data

One hot encoding doesn't sound like a great idea when you're dealing with hundreds of categories e.g. a data set where one of the columns is "first name". What's the best approach to go about encoding this sort of data?
I recommend the hashing trick:
https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick
It's cheap to compute, easy to use, allows you to specify the dimensionality, and often serves as a very good basis for classification.
For your specific application, I would hash feature-value pairs, like ('FirstName','John'), then increment the bucket for the hashed value.
If you have a large number of categories, Classification algorithm does not work well. Instead, there is a better approach of doing this. You apply regression algorithm on data and then train offset on those output. It would give you better results.
A sample code can be found here.

Resources