Are checksums unique? - checksum

Checksums, as we know, are a small amount of data that are computed from a larger set of data (usually to provide error detection). But how are they unique if there's less data? If they're not unique then multiple inputs would, by definition, produce the same checksum, and I would think that would defeat their whole purpose.

Related

What can we do with the dataset that 98 percent of the columns are null values?

I want to predict down time of the servers before it happens. To achive this aim, I collected many data from different data sources.
One of the data sources is metric data which contain cpu-time, cpu-percentage, memory-usage, etc. However, values of the columns in this dataset are null. I mean 98% of the many columns are null.
What kind of data preperation technique can be used to prepere the data before apply it to a prediction algorithm.
I appreciate any help.
If I were in your situation my first option would be to ignore this data source. There is too much missing data to be a relevant source of information for any ML algorithm.
That being said, if you still want to use this source of data, you will have to fill the gaps. Infer the missing data with only 2% of available data is hardly possible, but when you are speaking of more than 90% of missing data, I would advise to have a look at Non-Negative Matrix Factorization (NMF) here.
A few versions of this algorithm are implemeted in R, also to have better results in inferring such a big amount of missing data you could read this paper which uses times series information -which could be your case- with NMF to get better results. I ran some tests up to 95% of missing data and results were not so bad, hence, as discussed earlier, you could discard some of your data to have only 80% or 90% of missing data, then apply NMF for times series.
Normally various data imputation techniques can be applied, but in the case of 98% null values, I don't think this would be a correct approach, you are going to infer the empty data from just 2% available information; this would generate an enormous amount of bias in your data. I would go for such an option: Sort your rows such in descending order, such that the rows with the largest number of non-null columns come first. Then determine a cutoff from the beginning of the sorted list of rows, such that, for example, only 20% of the data missing in the selected subset of the data. Then apply data imputation. But of course, this assumes that you will have enough number of data points (rows) after determining this cutoff, which you may not have and the data is not missing at random for each row (if data is missing at random for each row, you cannot use this sorting method at all).
In any case, I can hardly see a concrete way of getting a meaningful model built by using such a high amount of missing data.
First, there can be many reasons why your data are null, like, it was not planed to get those data in the previous project version, then you upgrade it but it is not retroactive so you only have access to the data from the new version, meaning the 2% are fine data but represent nothing compared to total volume cause the new version is just up since X days; etc.
ANYWAY
Even if you have only 2% of non-null data, it does not really matters, what does matter is "how many data represent those 2%" ? If it is 2% of 5 billions, then it is enough to take "just" the 2% of non-null as training data and ignore the others!
Now, if the 2% represents just few data, then I really advise you to NOT fill the null values with them, because it will create enormous bias, furthermore, it means your actual process is not ready for implementing machine learning project => Just adapt to get more data.

Should 'deceptive' training cases be given to a Naive Bayes Classifier

I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.

What are other ways to define Normalized Data?

I know Normalization means reducing redundancy in data sets but what is the definition of normalized data?
Can I describe it as the "simplest form" of a data set?
Normalization is not necessarily related to redundancy. It’s related to reduction. For instance, in my daily code, normalizing is more about mapping big intervals into [0;1]. Although data normalization might have a general meaning for database administrators, it doesn’t have one if you look at it without a context.

When are uni-grams more suitable than bi-grams (or higher N-grams)?

I am reading about n-grams and I am wondering whether there is a case in practice when uni-grams would are preferred to be used over bi-grams (or higher N-grams). As I understand, the bigger N, the bigger complexity to calculate the probabilities and establish the vector space. But apart from that, are there other reasons (e.g. related to type of data)?
This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.
For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.
Usually, n-grams more than 1 is better as it carries more information about the context in general. However, sometimes unigrams are also calculated besides bigram and trigrams and used as fallback for them. This is usefull also, if you want high recall than precision to search unigrams, for instance, you are searching for all possible uses of verb "make".
Lets use Statistical Machine Translation as an Example:
Intuitively, the best scenario is that your model has seen the full sentence (lets say 6-grams) before and knows its translation as a whole. If this is not the case you try to divide it to smaller n-grams, keeping into consideration that the more information you know about the word surroundings, the better the translation. For example, if you want to translate "Tom Green" to German, if you have seen the bi-gram you will know it is a person name and should remain as it is but if your model never saw it, you would fall back to unigrams and translate "Tom" and "Green" separately. Thus "Green" will be translated as a color to "Grün" and so on.
Also, in search knowing more about the surrounding context makes the results more accurate.

Association Rule Mining on Categorical Data with a High Number of Values for Each Attribute

I am struggling with association rule mining for a data set, the data set has a lot of binary attributes but also has a lot of categorical attributes. Converting the categorical to binary is theoretically possible but not practical. I am searching for a technique to overcome this issue.
Example of data for a car specifications, to execute association rule mining, the car color attribute should be a binary, and in the case of colors, we have a a lot of colors to be transferred to binary (My data set is insurance claims and its much worse than this example).
Association rule mining doesn't use "attributes". It processes market basket type of data.
It does not make sense to preprocess it to binary attributes. Because you would need to convert the binary attributes into items again (worst case, you would then tranlate your "color=blue" item into "color_red=0, color_black=0, ... color_blue=1" if you are also looking for negative rules.
Different algorithms - and different implementations of the theoretically same algorithm, unfortunately - will scale very differently.
APRIORI is designed to scale well with the number of transactions, but not very well with the number of different items that have minimum support; in particular if you are expecting short itemsets to be frequent only. Other algorithms such as Eclat and FP-Growth may be much better there. But YMMV.
First, try to convert the data set into a market basket format, in a way that you consider every item to be relevant. Discard everything else. Then start with a high minimum support, until you start getting results. Running with a too low minimum support may just run out of memory, or may take a long time.
Also, make sure to get a good implementation. A lot of things that claim to be APRIORI are only half of it, and are incredibly slow.

Resources