Association Rule Mining on Categorical Data with a High Number of Values for Each Attribute - machine-learning

I am struggling with association rule mining for a data set, the data set has a lot of binary attributes but also has a lot of categorical attributes. Converting the categorical to binary is theoretically possible but not practical. I am searching for a technique to overcome this issue.
Example of data for a car specifications, to execute association rule mining, the car color attribute should be a binary, and in the case of colors, we have a a lot of colors to be transferred to binary (My data set is insurance claims and its much worse than this example).

Association rule mining doesn't use "attributes". It processes market basket type of data.
It does not make sense to preprocess it to binary attributes. Because you would need to convert the binary attributes into items again (worst case, you would then tranlate your "color=blue" item into "color_red=0, color_black=0, ... color_blue=1" if you are also looking for negative rules.
Different algorithms - and different implementations of the theoretically same algorithm, unfortunately - will scale very differently.
APRIORI is designed to scale well with the number of transactions, but not very well with the number of different items that have minimum support; in particular if you are expecting short itemsets to be frequent only. Other algorithms such as Eclat and FP-Growth may be much better there. But YMMV.
First, try to convert the data set into a market basket format, in a way that you consider every item to be relevant. Discard everything else. Then start with a high minimum support, until you start getting results. Running with a too low minimum support may just run out of memory, or may take a long time.
Also, make sure to get a good implementation. A lot of things that claim to be APRIORI are only half of it, and are incredibly slow.

Related

Ignore columns in SMOTE oversampling

I am having six feature columns and one target column, which is imbalanced.
Can I make oversampling method like ADASYN or SMOTE by creating synthetic records only for the four columns X1,X2,X3,X4 by copying exactly the same as constant (Month, year column)
Current one:
Expected one: It can create synthetic records by up-sampling target class '1' but the number of records can increase but the added records should have month and years (unchanged as shown below )
From a programming perspective, an identical question asked in the relevant Github repo back in 2017 was answered negatively:
[Question]
I have a data frame that I want to apply smote to but I wish to only use a subset of the columns. The other columns contain additional data for each sample and I want each new sample to contain the original info as well
[Answer]
There is no way to do that apart of extracting the column in a new matrix and process it with SMOTE.
Even if you generate a new samples you have to decide what to put as values there so I don't see how such feature can be added
Answering from a modelling perspective, this is not a good idea and, even if you could find a programming workaround, you should not attempt it - and arguably, this is the reason why the developer of imbalanced-learn above was dismissive even in the thought of adding such a feature in the SMOTE implementation.
Why is that? Well, synthetic oversampling algorithms, like SMOTE, essentially use some variant of a k-nn approach in order to create artificial samples "between" the existing ones. Given this approach, it goes without saying that, in order for these artificial samples to be indeed "between" the real ones (in a k-nn sense), all the existing (numerical) features must be taken into account.
If, by employing some programming alchemy, you manage at the end to produce new SMOTE samples based only on a subset of your features, putting the unused features back in will destroy any notion of proximity and "betweenness" of these artificial samples to the real ones, thus compromising the whole enterprise by inserting a huge bias in your training set.
In short:
If you think your Month and year are indeed useful features, just include them in SMOTE; you may get some nonsensical artificial samples, but this should not be considered a (big) problem for the purpose here.
If not, then maybe you should consider removing them altogether from your training.

Are there good ways to reduce the size of a vocabulary in natural language processing?

While working on tasks like text classification, QA, the original vocabulary generated from the corpus is usually too large, containing a lot of 'unimportant' words. The most popular ways I've seen to reduce the vocabulary size are discarding stop words and words with low frequencies.
For example, in gensim
gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None):
Remove all entries from the vocab dictionary with count smaller than min_reduce.
Modifies vocab in place, returns the sum of all counts that were pruned.
But in practice, setting the minimum count is empirical and does not seems quite exact. I notice that the term frequency of each word in the vocabulary often follows long-tail distribution, is it a good way if I only keep the top-K words that occupies X% (95%, 90%, 85%, ...) of the total term frequency? Or are there any sensible ways to reduce the vocabulary, without seriously influencing the NLP task?
There is indeed a few recent developments that try to counteract this problem. The most notable ones are probably subword units (also known as Byte Pair Encodings, or BPEs), which you can imagine as a notion similar to syllables in a word (but not the same!); A word like basketball could then be transformed into variations like bas ##ket ##ball or basket ##ball. Note that this is a constructed example and might not reflect the actually chosen subwords.
The idea itself is relatively old (an article from 1994), but has been recently popularized by Sennrich et al., and is basically used in every state-of-the-art NLP library that has to deal with large vocabularies.
The two biggest implementations of this idea are probably fastBPE and Google's SentencePiece.
With subword units, you now basically have the freedom to determine a fix vocabulary size, and the algorithm will then try to optimize towards a mix of word diversity, and basically splitting "more complex words" into several pieces, such that your desired vocabulary size can cover any word in the corpus. For the exact algorithm, though, I highly recommend you to look into the linked paper or implementations.
In general, the least-frequent words in your training data are also the safest to discard.
This is especially the case for 'word2vec' and similar algorithms. There may not be enough varied examples of the usage of each rare word to learn reliable representations – as opposed to weak/idiosyncratic representations based on the few not-necessarily-representative examples of their use that you do have.
Also, rare words won't recur as often in future texts, making their relative value in the model less.
And, by the typical 'zipfian' distribution of word-frequencies in natural-language material, while each individual rare word only occurs a few times, altogether there are many such words. So just discarding words with one to a few instances will often significantly shrink the vocabulary (and thus overall model) by half or more.
Finally, it's been observed in 'word2vec' that discarding those intervening rare words – which are many in total number, though each individually has only limited-quality examples – the quality of the surviving more-frequent word-vectors often improves. Those more-important words have fewer intervening lower-value 'noisy' words moving them out of each others' context windows, or pulling the model's weights in other directions via interleaved training examples.
(Similarly, in adequate corpuses, using more-aggressive frequent-word downsampling, as controlled by the sample parameter, can often increase word-vector quality while also speeding training – though with no savings in overall vocabulary size, as no words are totally eliminated by that setting.)
On the other hand, 'stop words' are insufficiently numerous to offer much vocabulary-size savings when discarded. Discard them, or not, based on whether their presence helps or hurts your later steps & final results – not to save a tiny amount of vocabulary-driven model space.
Note that for gensim's Word2Vec model, and related algorithms, in addition to the min_count parameter which discards all words appearing fewer times than that value, there is also the max_final_vocab parameter, which will dynamically choose whatever min_count is sufficient to achieve a final vocabulary size no larger than the max_final_vocab value.
So if you know you have the system memory to support a 1-million-word model, you don't have to use trial-and-error on min_count values to reach something near that: you can just specify max_final_vocab=1000000, min_count=1.
(On the other hand, be careful with the max_vocab_size parameter. It should only be used to prevent the initial word-count survey from outgrowing available RAM, and thus should be set to the largest value your system can manage – far, far larger than whatever you'd like your actual final vocabulary size to be. That's because the max_vocab_size is enforced whenever the survey-in-progress reaches that size – not just at the end – and discards a lot of the smaller word counts, and then enforces a higher floor each time it's enforced. If this limit is hit at all, it means final counts will only be approximate – & the escalating floor means sometimes the running-vocabulary will be pruned to a mere 10% or so of the full max_vocab_size.)
You can significantly reduce vocabulary size via text pre-processing tailored to your learning task & domain. Some NLP techniques include:
Remove rare & frequent stop words. Not just from pre-defined lists but through learned thresholds, TF-IDF weights or superfluous part-of-speech removals.
Correct spelling/grammar/slang if your text is noisy or from different dialects of the same language.
lemmatize words to remove tense & plurality variants if these relationships don't matter. ie: played, playing or plays -> play
Parametrize text with named entities whenever specific details aren't needed. ie: <PERSON> bought <MONEY> tickets to <LOCATION> for <DATE>
Disambiguate & perform synonym substitution to the most frequent usage of its interpretation. ie: bedrooms are spacious -> rooms are big
Simplify contractions & negations. ie: I don't dislike it -> I do not dislike it ~> I like it
resolve co-references where pronouns are made explicit. ie: John said he will go -> John said John will go
Dimensionality reduce with SVD to automatically capture equivalent phrases.

Identifying specific parts of a document using CRF

My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.
The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples).
I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :
Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.
Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None
Nobody knows that, you have to try it on your dataset, check resulting quality, maybe inspect the CRF model (e.g. https://github.com/TeamHG-Memex/eli5 has sklearn-crfsuite support - a shameless plug), try to come up with better features or decide to annotate more examples, etc. This is just a general data science work. Dataset size looks on a lower side, but depending on how structured is the data and how good are features a few hundred documents may be enough to get started. As the dataset is small, you may have to invest more time in feature engineering.
I don't think class imbalance would be a problem, at least it is unlikely to be your main problem.

Should 'deceptive' training cases be given to a Naive Bayes Classifier

I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.

When are uni-grams more suitable than bi-grams (or higher N-grams)?

I am reading about n-grams and I am wondering whether there is a case in practice when uni-grams would are preferred to be used over bi-grams (or higher N-grams). As I understand, the bigger N, the bigger complexity to calculate the probabilities and establish the vector space. But apart from that, are there other reasons (e.g. related to type of data)?
This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.
For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.
Usually, n-grams more than 1 is better as it carries more information about the context in general. However, sometimes unigrams are also calculated besides bigram and trigrams and used as fallback for them. This is usefull also, if you want high recall than precision to search unigrams, for instance, you are searching for all possible uses of verb "make".
Lets use Statistical Machine Translation as an Example:
Intuitively, the best scenario is that your model has seen the full sentence (lets say 6-grams) before and knows its translation as a whole. If this is not the case you try to divide it to smaller n-grams, keeping into consideration that the more information you know about the word surroundings, the better the translation. For example, if you want to translate "Tom Green" to German, if you have seen the bi-gram you will know it is a person name and should remain as it is but if your model never saw it, you would fall back to unigrams and translate "Tom" and "Green" separately. Thus "Green" will be translated as a color to "Grün" and so on.
Also, in search knowing more about the surrounding context makes the results more accurate.

Resources