Compression by comprehension using machine learning techniques - machine-learning

I'm an aspiring data scientist (presently just a software developer) and I've just got this (foolish?) idea.
So far (as far as I know), we’ve being using compression algorithms based on replacing the standard way of encoding data by a smarter one. What if we could compress data by comprehension? For example by generating a kind of abstract from which we can recover the original data.
Just think of how our minds work. By associating ideas one with another.
Can machine learning techniques learn and understand the data (and how it is represented on disk) so that it can be generated back from an abstract being generated by the algorithm?

Sure, but then you would have to transfer a representation of the associations and "comprehension" to the other end in order to decompress. That representation will likely be much larger than the data you were trying to compress.

There are actually similar ideas that are, at least to a some degree, already realized.
For instance Autoencoder allows for the compression (encoder part) and re-construction of original data (decoder part).
This technique coupled with the idea of Thought Vector which, sort of, "encodes" concept/meaning/comprehension in a single vector would result into something you have described.

Related

Practical advice on dealing with very long inputs using LSTM model?

I built a character-level LSTM model on text data, but ultimately I'm looking to apply this model on very long text documents (such as a novel) where it's important to understand contextual information, such as where in the novel it's in.
For these large-scale NLP tasks, is the data usually cut into smaller pieces and concatenated with metadata - such as position within the document, detected topic, etc. - to be fed into the model? Or are there more elegant techniques?
Personally, I have not gone that in depth with using LSTMs to go into the level of depth that you are trying to attain but I do have some suggestions.
One solution to your problem, which you mentioned above, could be to simply analyze different pieces of the document by splitting your document into smaller pieces and analyzing them that way. You'll probably have to be creative.
Another solution, that I think might be of interest of you is to uses a Tree LSTM model in order to get the level to depth. Here's the link to the paper Using the Tree model you could feed in individual characters or words on the lowest level and then feed it upward to higher levels of abstraction. Again, I am not completely familiar with the model, so don't take my word on it, but it could be a possible solution.
Adding few more ideas in answer pointed by bhaskar, which are used to handle this problem.
You can used Attention mechanism, which is used to deal with long term dependencies. Because for a long sequence, it certainly forget information or its next prediction may not depend on all the sequence information, it has in its cell. So attention mechanism helps to find the reasonable weights for the characters, it depend on. For more info you can check this link
There is potentially lots of research on this problem. This is very recent paper on this problem.
You can also break the sequence and use seq2seq model, which encode the features into low dims space and then decoder will extract it . This is short-article on this.
My personal advice is to break the sequence and then train it, because sliding window on the complete sequence is pretty much able to find the correlation between each sequence.

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Improving K Means on some data sets

Anyone got an idea on how a simple K-means algorithm could be tuned to handle data sets of this form.
The most direct way to handle data of that form while still using k-means it to use a kernelized version of k-means. 2 implemtations of it exist in the JSAT library (see here https://github.com/EdwardRaff/JSAT/blob/67fe66db3955da9f4192bb8f7823d2aa6662fc6f/JSAT/src/jsat/clustering/kmeans/ElkanKernelKMeans.java)
As Nicholas said, another option is to create a new feature space on which you run k-means. However this takes some prior knowledge of what kind of data you will be clustering.
After that, you really just need to move to a different algorithm. k-means is a simple algorithm that makes simple assumptions about the world, and when those assumptions are too strongly violated (non linearly separable clusters being one of those assumptions) then you just have to accept that and pick a more appropriate algorithm.
One possible solution to this problem is to add another dimension to your data set, for which there is a split between the two classes.
Obviously this is not applicable in many cases, but if you have applied some sort of dimensionality reduction to your data, then it may be something worth investigating.

Neural Network for File Decryption - Possible?

I have already worked with Neural Networks before and know most basics about them. I especially have experience with regular Multi-Layer-Perceptrons. I was now asked by someone if the following is possible and somehow feel challenged to master the problem :)
The Situation
Let's assume I have a program that can encrypt and decrypt regular ASCII-Coded Files. I have no idea at all about the specific encryption method nor the key used. All I know is, that the program can reverse the encryption and thus read the original content.
What I want?
Now my question is: Do you think it is possible to train (some kind of) Neural Network which replicates the exact decryption-Algorithm with acceptable effort?
My ideas and work so far
I have not much of experience with encryption. Someone suggested just to assume AES encryption, so I could write a little program to batch-encrypt ASCII-Coded files. So this would cover the gathering of learning data for supervised learning. Using the encrypted files als input for the neural networks and the original files as training data I could train any net. But now I am stuck, how would you suggest to feed the input and output data to the Neural Network. So how many Input and Output-Neurons would you guys use?
Since I have no Idea what the encrypted files would look like, it might be the best idea to pass the data in binary form. But I can't just use thousands of input and output-neurons and pass all bits at the same time. Maybe recurrent networks and feed one bit after another? Also doesn't sound very effective.
Another problem is, that you can't decrypt partially - meaning you can't be roughly correct. You either got it right or not. To put it other words, in the end the net error has to be zero. From what I have experienced so far with ANN, this is nearly impossible to achieve for big networks. So is this problem solvable?
Another problem is, that you can't decrypt partially - meaning you can't be roughly correct. You either got it right or not.
That's exactly the problem. Neural Networks can approximate continuous functions, meaning that a small change in the input values causes a small change in the output value, while encryption functions/algorithm are designed to be as non-continuous as possible.
I think if that worked, people would be doing it. As far as i know, they aren't doing it.
Seriously, if you could just throw a lot of plaintext/ciphertext pairs at a neural network and construct a decrypter, then it would be a very effective known-plaintext or chosen-plaintext attack. Yet the attacks of that kind we have against current ciphers are not very effective at all. That means that either the entire open cryptographic community has missed the idea, or it doesn't work. I realise that this is far from a conclusive argument (it's effectively an argument from authority), but i would suggest it's indicative that this approach won't work.
Say you have two keys A and B that translate ciphertext K into Pa and Pb respectively. Pa and Pb are both "correct" decryptions of ciphertext K. So if your neural network has only K as input, it has no means of actually predicting the correct answer. Most ways of encryption cracking involve looking at the result to if it looks like what you're after. For example, readable text is more likely to be the plaintext than apparently random junk. A neural network would need to be good at guessing if it got the right answer according to what the user would expect the contents to be, which could never be 100% correct.
However, neural networks can in theory learn any function. So if you have enough cyphertext/plaintext pairs for a particular encryption key, then a sufficiently complex neural network can learn to be exactly the decryption algorithm for that particular key.
Also regarding the continuous vs discrete problem, this is basically solved. The outputs have something like the sigmoid function so you just have to pick a threshold for 1 vs 0. .5 could work. With enough training you could in theory get the correct answer for 1 vs 0 100% of the time.
The above assumes that you have one network big enough to process the entire file at once. For arbitrarily sized ciphertext, you would probably need to do blocks at a time with an RNN, but I don't know if that still has the same "compute any function" properties as for a traditional network.
None of this is to say that such a solution is practically doable.

NLP and Ruby to characterize quality of writing

I'd like to take a shot at characterizing incoming documents in my app as either "well" or "poorly" written. I realize this is no easy task, but even a rough idea would be useful. I feel like the way to do this would be via naïve Bayes classifier with two classes, but am open to suggestions. So two questions:
is this method the optimal (taking into account simplicity) way to do this
assuming a large enough training db?
are there libraries in ruby
(or any integratable JRuby or
whatever) that i can plug into my
rails app to make this happen with little fuss?
Thanks!
You might try using vocabulary vector analysis. Covered some here:
http://en.wikipedia.org/wiki/Semantic_similarity
Basically you build up a corpus of texts that you deem "well-written" or "poorly-written" and count the frequency of certain words. Make a normalized vector for each, and then compute the distance between those to the vectors of each incoming document. I am not a statistician, but I'm told it's similar to Bayesian filtering, but seems to deal with misspellings and outliers better.
This is not perfect, by any means. Depending on how accurate you need it to be, you will probably still need humans to make the final judgement. But we've had good luck using it as a pre-filter to reduce number of reviewers.
Another simple algorithm to check out is the Flesch-Kincaid readability metric. It is quite widely used and should be easy to implement. I assume one of the Ruby NLP libraries has syllable methods.
You may find interesting this Burstein, Chodorow, and Leacock on the Criterion essay evaluation system for a pretty interesting very high-level overview of how one particular system did essay evaluation as well as style correction.

Resources