What is the absolute best theoretical lossless compression of data possible?

What is the absolute best theoretical lossless compression of data possible? - machine-learning

To start with:
Assume that the algorithm takes finite space.
Assume that the computational resources are infinite.
What form would the result of such compression take? My intuition tells me it would it be some form of a pRNG-like algorithm with an irreducible seed that gives rise to the compressed data. Could there be something even more efficient?
Now what if we assume all resources are finite. Would the problem of perfect compression equate to the problem of perfect pattern recognition? What form would the result of such compression take? Factorization into primes? Something else? And would having such an algorithm imply that the problem of AI has been cracked?
As a side question, has there been successful attempts to use machine learning for data compression?

There is a mathematical proof that your question cannot be answered in general. The best compression possible is not computable. See Kolmogorov complexity.
Compression only works when the data can be modeled in some way to expose redundancy.

Related

If a neural network can optimize traditional image processing algorithms？

I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly，but I don't know much about the mathematics of neural networks.

If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!

In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.

RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

How to verify what's noise what's real data?

I am wondering how can I claim that I correctly catch the "noise" in my data ?
To be more specific, take Principle Component Analysis as example, we know that in PCA, after doing SVD, we can zeros out the small singular values and reconstruct the original matrix using low-rank approximation.
Then can I claim what's been ignored is indeed noise in the data ?
Is there any evaluation metric for this ?
The only method I can come up with is simply subtract the original data from the reconstructed data.
Then, try to fit a Gaussian over it, seeing if the fitness is good.
Is that conventional method in field like DSP ??
BTW, I think in typical machine learning tasks, the measurement would be the follow up classification performance, but since I am doing purely generative model, there are no labels attached.

The way I see it, the definition of noise would depend on the domain of the problem. Therefore the strategy for reducing it would be different on each domain.
For instance, having a noisy signal in problems like seismic formation classification or a noisy image on a face classification problem would be drastically different to the noise produced by improperly tagged data in a medical diagnostic problem or the noise because similar words with different meaning in a language classification problem for documents.
When the noise is because of a given (or a set of) data point, then the solution is as simple as ignore those data points (although identify those data points most of the time is the challenging part)
From your example I guess you are more concerning about the case when the noise is embedded into the features (like in the seismic example). Sometimes people tend to pre-process the data with a noise reduction filter like the median filter (http://en.wikipedia.org/wiki/Median_filter). In contrast, some other people tend to reduce the dimension of the data to reduce noise, and PCA is used in this scenario.
Both strategies are valid, and normally people try both and cross-validate them to see which one gave better results.
What you did is a good metric to check gaussian noise. However, for non-gaussian noise your metric can give you false negatives (bad fitness but still good noise reduction)

Personally, if you want to prove the efficacy of the noise reduction, I'd use a task-based evaluation. I assume you're doing this for some purpose, to solve some problem? If so, solve the task with the original noisy matrix and the new clean one. If the latter works better, what was discarded was noise, for the purposes of the task you're interested in. I think some objective measure of noise is pretty hard to define.

I have found this. it is very resoureful, needs good time to understand.
https://sci2s.ugr.es/noisydata#Introduction%20to%20Noise%20in%20Data%20Mining

Compression algorithm for a bit stream

I am looking for a good algorithm for bit stream compression (packet payload compression).
I would like to avoid algorithms that are based on symbol probability. I have already tried the LZ family algorithms, and found none of them useful, even with BWT.
I am trying to accomplish a minimum compression percentage of 30%, but have only managed 3-5% using RLE.
What is a good algorithm that has a compression above 30%?

If you have no knowledge about your input data, it's hard to achieve good compression (just like a general purpose compressor).
But at least you can try some context-based model. use several prefix bits as context and predict the probability of next bit, then pass the probability to a range coder.
Further compression can be achieved with a context mixing model without byte-alignment. see http://mattmahoney.net/dc/dce.html#Section_43.

What algorithm would you use for clustering based on people attributes?

I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.

K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.

The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!

In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart