How to use sparselengthssum in caffe2 for categorical data ? - machine-learning

I am trying to convert categorical data to numerical data. caffe2 has few new operators for the same. I am interested in using sparselenthssum. But the caffe2 docs does not have an example use case. How to use sparelengthssum for categorical data?
https://caffe2.ai/docs/operators-catalogue.html#sparselengthssum.

Related

Is it possible to use JSON format input for BERT model?

I am trying to create one knowledge base (single source of truth) gathered from multiple web sources. (ex. wiki <-> fandom)
So I want to try a Siamese network or calculate cosine similarity with BERT embedded documents.
Then, can I ignore those json structures and train them anyway?
Although BERT wasn't specifically trained to find similarity between JSON data, you could always extract and concatenate the values of your JSON into a long sentence and leave it to BERT to capture the context as you expect.
Alternatively, you could generate a cosine similarity score for each key-value dependency between the JSONs and aggregate them to generate a net similarity score for the JSON data pair.
Also, see Sentence-BERT (SBERT), a modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

Predicting over data that has categorical, numerical and text

I am trying to build a classifier for my dataset. Each observation in the data has categorical and numerical values, as well as a more general description in free-text. I understand how to build a boosting algorithm to handle the categorical and numerical values, and I have already trained a neural network that predicted over the text quite succesfully. What I'm wrapping my head around is how to integrate both approaches?
Embed your free text using a Language Model (e.g. averaging fasttext wordembeddings, or using google-universal-sentence-encoder) into an N-dim vector of floats. One hot encode the categorical stuff. Concatenate [embedding, one_hot_encoding, numericals] and badabing badaboom, you've got yourself 1 vector representing your datapoint.
Tensorflow hub's KerasLayer + https://tfhub.dev/google/universal-sentence-encoder/4 is def a good starting point. I you need to train something yourself, you could look into tf.keras.layers.Embedding.

Is there a way to use decision trees with categorical variables without one-hot encoding?

I have a dataset with 200+ categorical variables (non-ordinal) and just a few continuous variables. I have tried to use one-hot encoding but that increases the dimensions by a lot and results in a poor score.
It seems like the regular scikit-learn tree can only be used with categorical variables that has been transformed into one-hot encoding (for non-ordinal vars) and I was if there's a way to create a tree without one-hot. I did some research and found that there's an API called h2o that might be useful, but I'm trying to find a way to run it on my local machine.
you can install the h2o-3 package for python, for example, from h2o.ai/downloads or from pypi.
the h2o package handles categorical values automatically efficiently. it is recommended to not one-hot-encode them first.
you can find lots of documentation at docs.h2o.ai.
As per, https://datascience.stackexchange.com/a/32623/51879
You can use other encoding techniques using this wrapper for scikit-learn http://contrib.scikit-learn.org/categorical-encoding/
Also check out this great article for more details https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931.

H2O Flow: How does H2O flow UI treat data types differently

Specifically, what is the difference in how H2O treats enum and string data types in contrast to 'int's and 'numerical' types?
For example, say I have a binary classifier that takes input samples that have features
x1=(1 of 10 possible favorite ice cream flavors (enum))
x2=(some random phrase (string))
x3=(some number (int))
What would be the difference in how the classifier treats these types during training?
When uploading data into h2o Flow UI, I get the option to convert certain data types (like enum) to 'numerical.' This makes me think that there is more than just string-to-number mapping going on when I just leave the 'enum' as an 'enum' (not converting to 'numerical' type), but I can't find information on what that difference is.
Clarification would be appreciated, thanks.
The "enum" type is the type of encoding you'll want to use for categorical features. If the categorical features are encoded as "enum", then the tree-based algorithms like Random Forest and GBM will be able to handle these features in a smart way. Most other implementations of RFs and GBM force you to do a one-hot expansion of the categorical features (into K dummy columns), but in H2O, the tree-based methods can use these features without any expansion. The exact whay that the variables are handled can be controlled using the categorical_encoding argument.
If you have an ordered categorical variable, then it might be okay to encode that as "int", however, the effect of doing that on model performance will depend on the data.
If you were to convert an "enum" column to "numeric" that would simply encode each category as an integer and you'd lose the notion that those numbers represent categories (so it's not recommended).
You should not use the "string" type in H2O unless you are going to exclude that column from the set of predictors. It would make sense to use a "string" column for text, but you'll probably want to parse (e.g. tokenize) that text to generate new numeric or enum features that will be included in the set of predictors.

How to use different dataset for scikit and NLTK?

I am trying to implement inbuilt naive bayes classifier of Scikit and NLTK for raw data I have. The data I have is set tab-separated-rows each having some label, paragraph and some other attributes.
I am interested in classifying the paragraphs.
I need to convert this data into format suitable for inbuilt classifiers of Scikit/ NLTK.
I want to implement Gaussian,Bernoulli and Multinomial Naive Bayes for all paragraphs.
Question 1:
For scikit, the example given imports iris data. I checked the iris data, it has precalculated values from the data set. How can I convert my data into such format and directly call the gaussian function? Is there any standard way of doing so?
Question 2:
For NLTK,
What should be input for NaiveBayesClassifier.classify function? is it dict with boolean values? how can it be made multinomial or gaussian?
# question 2:
nltk.NaiveBayesClassifier.classify expects a so called 'featureset'. A featureset is a dictionary with feature names as keys and feature values as values, e.g. {'word1':True, 'word2':True, 'word3':False}. Nltks' naive bayes classifier cannot be used as multinomial approach. However, you can install scikit learn and use the nltk.classify.scikitlearn wrapper module to deploy scikit's multinomial classifier.

Resources