I have a large set of scanned documents that I need to index however the the documents of interest are a small proportion of the entire package my classifier needs to identify. To get an idea of the optimum number of classes and how best to merge documents in a class I wanted to run an unsupervised clustering analysis.
Which distance method would work better to capture the structural information. Also would agglomerative Hierarchical clustering be the best clustering approach for the given task? Thanks
An unsupervised clustering technique fails on scanned documents since it fails to grasp the underlying structure and ends up giving non nonsensical clusters. So the approach is fundamentally flawed. However Classification using deep convolutional neural networks, with sufficient data and carefully chosen distinct classes, can outperform OCR techniques if the documents have a distinct structure.
Related
I am trying to apply some clustering method on my datasets (with numerical dimensions). But I'm convinced that the features have different weights for different clusters. I read that there is an approach called soft subspace clustering that tries do identify the clusters and the weights of the features for each cluster simultaneously. However, the algorithms that I have found apre applied only to categorical data.
I am trying to identify some algorithm of soft subspace clustering for numerical. Do you know if there is any, or how can I adapt methods originally designed to deal with categorical data for dealing with numerical data (I think that it would necessary to propose some way of measuring the relevance of each numerical feature in each cluster)?
Yes, there are dozens of subspace clustering algorithms.
You'll need to do a proper literature research, this is too broad to cover in a QA like stack overflow. Look for (surprise) "subspace clustering", but also include "biclustering", for example.
I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.
I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?
Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.
(also read the comments and will incorporate the content in my answer)
From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees.
http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf
There are many elements in your question:
1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.
2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs
3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.
4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.
3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster
4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.
5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.
6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.
In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.
I'm trying to find a fast way to match descriptors from a database. My program works the following way:
1) Populates a database with descriptors of images (using proper feature detection algorithms)
2) Load an image
3) Extracts descriptor for that image and compares it to all descriptors in the DB, so it can find a proper match.
As you can imagine, it's very heavy to compute a comparison of 32 descriptors millions of times. I've used a hashing function, but that only works for two descriptors that are exactly the same, thus only matching two exactly identical images.
What do you suggest I use to speed up this search?
Cheers
EDIT:
I've decided to start by approaching a Neural Network solution. Here's a pretty good link for anyone who wants to get started on the subject.
What you are trying to accomplish is essentially called machine learning/classification. Directly comparing is really hopeless given the size of your database.
Just in case you are not aware of the terminology. The database that you have is called the learning-dataset. You need to use a machine learning algorithm like SVM (support vector machine), K-NN (K-nearest neighbours), Neural networks etc to learn a model. This model is later on used to predict the outcomes of an new and fresh instance, ie. the vector you want to compare your database with.
I would strong suggest the use of SVM, since it is the state of the art classifier in machine learning literature. There is an implementation of it called "SVM light".
http://svmlight.joachims.org/. It is also possible to do ranking with SVM-light. please have a look at that as well.
You can also try and use neural networks. I use neural network toolbox in matlab for this (nprtools) which is pretty neat.