Difference between spacy_sklearn and tensorflow_embedding pipelines - machine-learning

I want to know if there is any basic difference between how spacy_sklearn and tensorflow_embedding pipelines operate under the hood.I mean tensorflow_embedding must also be using the same concepts of word embeddings,reducing the dimensionality of data using PCA etc. Is the only difference then that spacy_sklearn has some pre trained data to draw upon in the form of pre trained vectors and tensorflow pipeline does not?Is my understanding correct?Also how is tensorflow_embedding pipeline related to the tensorflow framework offered by google?
I tried looking up tensorflow framework on google, but could not get any specific answer.I also searched about it on RASA community page, but again found no help

The spacy_sklearn pipeline uses pre-trained word vectors.This is useful if we don’t have very much training data.
The tensorflow embedding pipeline doesn’t use any pre-trained word vectors,it fits specifically for our dataset. The advantage of the tensorflow_embedding pipeline is that the word vectors will be customised for our domain.
For more information ,please refer the below link
https://rasa.com/docs/nlu/choosing_pipeline/

Related

HOG and LBP on weka

I'm new to the subject of ML so I apologize as my questions may seem too basic.
I have an image dataset and my supervisor asked me to do feature extraction using HOG and LBP filters. So far I have been working with weka, and I couldn't fine any useful tutorials on how to implement these filters on weka, is it possible? and if not, how else can implement these filters to extract features from my dataset?
Weka
You could use and/or extend the imageFilter Weka package. It uses LIRE under the hood, which has a range of feature extraction methods implemented.
ADAMS
If you don't mind using another framework, then you could make use of the plugins for LIRE in ADAMS. Its adams-imaging module also offers support for a range of LIRE feature generators (download either the adams-annotator or adams-base-all snapshot).
Using its workflow engine, you can run the flow adams-imaging-feature_generation, which generates generates PHOG and LocalBinaryPatterns features from a range of images and displays them as spreadsheet. You could use this flow as basis and turn it into one that allows you to select images interactively and then saves them as CSV or ARFF file.

How to access to FastText classifier pipeline?

As we know Facebook's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.
Actually I want to do some manipulation on the vector embedding - like introducing tf-idf weighting apart from these word2vec representations and another thing I want to to is oversampling using SMOTE which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?
The full source code is available:
https://github.com/facebookresearch/fastText
So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.
Note that both FastText, and its supervised classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.
Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.
For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review:
this skeptical blog post comparing FastText to the much-earlier 'vowpal wabbit' tool: "Fast & easy baseline text categorization with vw"
Facebook's far-less discussed extension of such vector-training for more generic categorical or numerical tasks, "StarSpace"

Image Classification in Azure Machine Learning

I'm preparing for the Azure Machine Learning exam, and here is a question confuses me.
You are designing an Azure Machine Learning workflow. You have a
dataset that contains two million large digital photographs. You plan
to detect the presence of trees in the photographs. You need to ensure
that your model supports the following:
Solution: You create a Machine
Learning experiment that implements the Multiclass Decision Jungle
module. Does this meet the goal?
Solution: You create a Machine Learning experiment that implements the
Multiclass Neural Network module. Does this meet the goal?
The answer for the first question is No while for second is Yes, but I cannot understand why Multiclass Decision Jungle doesn't meet the goal since it is a classifier. Can someone explain to me the reason?
I suppose that this is part of a series of questions that present the same scenario. And there should be definitely some constraints in the scenario.
Moreover if you have a look on the Azure documentation:
However, recent research has shown that deep neural networks (DNN)
with many layers can be very effective in complex tasks such as image
or speech recognition. The successive layers are used to model
increasing levels of semantic depth.
Thus, Azure recommends using Neural Networks for image classification. Remember, that the goal of the exam is to test your capacity to design data science solution using Azure so better to use their official documentation as a reference.
And comparing to the other solutions:
You create an Azure notebook that supports the Microsoft Cognitive
Toolkit.
You create a Machine Learning experiment that implements
the Multiclass Decision Jungle module.
You create an endpoint to the
Computer vision API.
You create a Machine Learning experiment that
implements the Multiclass Neural Network module.
You create an Azure
notebook that supports the Microsoft Cognitive Toolkit.
There are only 2 Azure ML Studio modules, and as the question is about constructing a workflow I guess we can only choose between them. (CNTK is actually the best solution as it allows constructing a deep neural network with ReLU whereas AML Studio doesn't, and API call is not about data science at all).
Finally, I do agree with the other contributors that the question is absurd. Hope this helps.
This question is indeed part of a series of questions that present the same scenario with multiple options. Both of the solutions approach the problem as a multi-class classification problem, which is correct. However, the key element here is dimensionality.
Your inputs (images) are highly dimensional which requires a deep learning approach in order to be effective. A decision jungle won't be able to learn effectively in such a high dimensional feature space, where a NN has higher chances to do so.
I hope it helps.

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Annotated images classification

I've got a bunch of images (~3000) which have been manually classified (approved/rejected) based on some business criteria. I've processed these images with Google Cloud Platform obtaining annotations and SafeSearch results, for example (csv format):
file name; approved/rejected; adult; spoof; medical; violence; annotations
A.jpg;approved;VERY_UNLIKELY;VERY_UNLIKELY;VERY_UNLIKELY;UNLIKELY;boat|0.9,vehicle|0.8
B.jpg;rejected;VERY_UNLIKELY;VERY_UNLIKELY;VERY_UNLIKELY;UNLIKELY;text|0.9,font|0.8
I want to use machine learning to be able to predict if a new image should be approved or rejected (second column in the csv file).
Which algorithm should I use?
How should I format the data, especially the annotations column? Should I obtain first all the available annotation types and use them as a feature with the numerical value (0 if it doesn't apply)? Or would it be better to just process the annotation column as text?
I would suggest you try convolutional neural networks.
Maybe the fastest way to test your idea if it will work or not (problem could be the number of images you have, which is quite low), is to use transfer learning with Tensorflow. There are great tutorials made by Magnus Erik Hvass Pedersen, who published them on youtube.
I suggest you go through all the videos, but the important ones are #7 and #8.
Using transfer learning allows you to use the models they build at google to classify images. But with transfer learning, you are able to use your own data with your own labels.
Using this approach you will be able to see if this is suitable for your problem. Then you can dive into convolutional neural networks and create the pipeline that will work the best for your problem.

Resources