Is employing BPNN for water quality management an overkill? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm developing a device for Freshwater Quality Management which can be used for freshwater bodies such as lakes and rivers. The project is spread in three parts:
The first part deals with acquiring parameters such as pH, turbidity etc.
The second part deals with taking corrective measures based on the parameters. For instance, if the pH is too low, the device will inject basic solution to maintain a pH of 7-7.5.
Now the third part deals with predicting the health of the lake based on the parameters acquired (pH/Turbidity etc.). The predictive algorithm shall take in account of parameters and develop a correlation between them to explain for how long the lake will sustain. To achieve this, I'm currently biased toward using Back Propagation Neural Network (BPNN) as I have found that multiple other people/institutes prefer NN for water quality management.*
Now my concern is whether using BPNN would be an overkill for this project? If yes, which method/tool should I go for?
*1,2 and 3

Doing something the way "it used to be" is not always the best idea. In general, if you do not have strong, analytical reasons to choose neural network you should not ever start with it. Neural networks are tricky to train, have huge number of hyperparameters, are non-deterministic and computationaly expensive. Always start with the simpliest model, and only if it yields poor results - move to more complex ones. From theoretical perspecitive it is strongly justified by Vapnik theorems, and from practical it is similar to agile approach in programming.
So where to start?
Linear regression (Ridge regression, Lasso)
Polynomial regression
KNN regression
RBF Networks
Random Forest Regresor
If all of them fail - think about "classical" neural network. But chances are rather... small.

A neural network is a function approximator. If what you have is a real-valued vector of inputs, and associated with each of those vectors, you have a target real number or classification such as "good", "bad", "red", etc. then a neural network can be used to solve your problem.
Neural networks are, in their simplest form, functions of the form n(x) := g(Wh(Ax + b)+ c), where A and W are matrices, and b and c are vectors, h is a component-wise nonlinear function, generally a sigmoid function, and g is a function taking the same values as your target space.
In your case, your input vector, denoted x above, would contain pH, turbidity, etc, and your targets would be how long the lake will sustain. If your network is "trained" properly, it will be able to, given an unseen input u (new measurements for pH and turbidity etc), compute a good approximation to how long the lake will sustain.
"Training" a neural network consists of choosing the parameters for A, W, b, c. How many of these parameters there are depends on how many columns you chose for A and W (and therefore also for b and c). One way to choose these parameters is such that the function n(x) is close to your actual, measured targets on all of the historical (training) examples you have. More specifically, A,W,b,c are chosen to minimize E(A,W,b,c) := (n(x) - t(x))^2 where t(x) is your historically measured target (how long the lake sustained when the pH and turbidity were as measured in x). One way to try to minimize E over A,W,b,c is to compute the gradient of E with respect to each of the parameters and then take a step toward the negative of the gradient via an algorithm called back-propagation.
I want to note that the computation of a neural network, when the parameters are fixed, is deterministic, but that there are some algorithms for computing the gradient of E which aren't deterministic. Some other algorithms are deterministic.
So, with all that as background, are neural networks overkill for your project? That depends on the function you're trying to approximate from your observations to the output you're trying to predict. Whether a neural network will give you good prediction accuracy depends on many factors, perhaps the most important of which is how many examples you have to train on. If you don't have very many training examples relative to the number of predictors, a neural network may not be what you're looking for, but for the most part, that's an empirical question more than a theoretical one.
The nice thing is that, if you're willing to use python, there are good libraries to make all of this testing very easy for you. If you try a nerual network and it doesn't give you very good predictions, there are many other methods of regression you could try. You could try linear regression (which is a special case of a neural network), or a random forest for example. All of these are easy to code up in python if you use sklearn for your linear regression and your random forest. There are a few libraries for neural networks which make playing with them pretty easy, as well. I recommend tensorflow for neural networks.
My recommendation would be to spend a little bit of time trying several methods. For a relatively simple prediction problem like this, the time to train your network should be pretty short. The longer times of days or weeks you may have heard about are for massive datasets with millions or billions of training examples and millions of parameters.
Here http://pastebin.com/KrUAX9je is a toy neural network I created to "learn" to approximate a function f(a,b,c) = abc.

Backpropagation (BP) is a method for learning artificial neural network model parameters using gradient descent. It computes the gradients in an efficient manner. There are also other methods to train such models but BP is more commonly used due to many reasons. I do not know anything about the scale of the projects and the amount of data collected, but neural networks are more effective if the number of examples is large. If you have, say, 10 attributes (pH, Turbidity ...) and maybe more than 2-3k examples then neural networks could be helpful.
However, you should not think neural networks are the BEST model ever. You need to try out different models and choose the one giving you the best performance.

Related

Naive Bayes vs. SVM for classifying text data

I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others).
In trying to come up with a model to use, I've had the following two ideas:
Naive Bayes (likely the sklearn multinomial Naive Bayes implementation)
Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation)
I have built both models, and am currently comparing the results.
What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better.
Many thanks!
The biggest difference between the models you're building from a "features" point of view is that Naive Bayes treats them as independent, whereas SVM looks at the interactions between them to a certain degree, as long as you're using a non-linear kernel (Gaussian, rbf, poly etc.). So if you have interactions, and, given your problem, you most likely do, an SVM will be better at capturing those, hence better at the classification task you want.
The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes.
From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables which are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't an universal approximator. SVMs with the proper choice of Kernel are (as are 2/3 layer neural networks) though, so from that point of view, the theory matches the practice.
But in the end it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.
Support Vector Machine (SVM) is better at full-length content.
Multinomial Naive Bayes (MNB) is better at snippets.
MNB is stronger for snippets than for longer documents. While (Ng and Jordan,
2002) showed that NB is better than SVM/logistic
regression (LR) with few training cases, MNB is also better with short documents. SVM usually beats NB when it has more than 30–50 training cases, we show that MNB is still better on snippets even with relatively large training sets (9k cases).
Inshort, NBSVM seems to be an appropriate and very strong baseline for sophisticated classification text data.
Source Code: https://github.com/prakhar-agarwal/Naive-Bayes-SVM
Reference: http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
Cite: Wang, Sida, and Christopher D. Manning. "Baselines and bigrams:
Simple, good sentiment and topic classification." Proceedings of the
50th Annual Meeting of the Association for Computational Linguistics:
Short Papers-Volume 2. Association for Computational Linguistics,
2012.

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

How can neural networks learn functions with a variable number of inputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
A simple example: Given an input sequence, I want the neural network to output the median of the sequence. The problem is, if a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs? I know that recurrent neural networks can learn functions like max and parity over a sequence, but computing these functions only requires constant memory. What if the memory requirement grows with the input size like computing the median?
This is a follow up question on How are neural networks used when the number of inputs could be variable?.
One idea I had is the following: treating each weight as a function of the number of inputs instead of a fixed value. So a weight may have many parameters that define a function, and we train these parameters. For example, if we want the neural network to compute the average of n inputs, we would like each weight function behaves like 1/n. Again, average per se can be computed using recurrent neural networks or hidden markov model, but I was hoping this kind of approaches can be generalized to solve certain problems where memory requirement grows.
If a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs?
First of all, you should understand the use of a neural network. We, generally use the neural network in problems where a mathematical solution is not possible. In this problem, use of NN is not significant/ unadvisable.
There are other problems of such nature, like forecasting, in which continuous data arrives over time.
One solution to such problem can be Hidden Markov Model (HMM). But again, such models depends on the correlation between input over a period of time. So This model is not efficient for problems where the input is completely random.
So, If input is completely random and memory requirement grows
There is nothing much you can do about it, one possible solution could be growing your memory size.
Just remember one thing, NN and similar models of machine learning aims to extract meaningful information from the data. if data is just some random values then all models will generate some random output.
One more idea: some data transformation. Let have N big enough that always bigger than n. We make a net with 2*N inputs. First N inputs are for data. If n less then N, then rest inputs set to 0. Last N inputs are intended for specifying which numbers are useful. Thus 1 is data, 0 is not data. As follows in Matlab notation: if v is an input, and it is a vector of length 2*N, then we put into v(1:n) our original data. After that, we put to v(n+1:N) zeros. Then put to v(N+1:N+n) ones, and then put V(N+n+1:2*N) zeros. It is just an idea, which I have not checked. If you are interested in the application of neural networks, take a look at the example of how we have chosen an appropriate machine learning algorithm to classify EEG signals for BCI.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Which machine learning classifier to choose, in general? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)
How do I know which classifier I should use?
Decision tree
SVM
Bayesian
Neural network
K-nearest neighbors
Q-learning
Genetic algorithm
Markov decision processes
Convolutional neural networks
Linear regression or logistic regression
Boosting, bagging, ensambling
Random hill climbing or simulated annealing
...
In which cases is one of these the "natural" first choice, and what are the principles for choosing that one?
Examples of the type of answers I'm looking for (from Manning et al.'s Introduction to Information Retrieval book):
a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes).
I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.
b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.
What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.
Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?
First of all, you need to identify your problem. It depends upon what kind of data you have and what your desired task is.
If you are Predicting Category :
You have Labeled Data
You need to follow Classification Approach and its algorithms
You don't have Labeled Data
You need to go for Clustering Approach
If you are Predicting Quantity :
You need to go for Regression Approach
Otherwise
You can go for Dimensionality Reduction Approach
There are different algorithms within each approach mentioned above. The choice of a particular algorithm depends upon the size of the dataset.
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Model selection using cross validation may be what you need.
Cross validation
What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).
How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.
Model selection
Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.
There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.
Nested cross validation
In nested cross validation, you perform cross validation on the model selection algorithm.
Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.
Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.
The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.
In short:
Boosting - often effective when a large amount of training data is available.
Random trees - often very effective and can also perform regression.
K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.
Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.
SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.
Things you might consider in choosing which algorithm to use would include:
Do you need to train incrementally (as opposed to batched)?
If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.
Is your data composed of categorical only, or numeric only, or both?
I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.
Does you or your audience need to understand how the classifier works?
Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.
How much classification speed do you need?
SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).
Complexity.
Neural nets and SVMs can handle complex non-linear classification.
As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.
For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).
Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.
And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.
As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.
Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.
My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.
So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.
Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.
You should always keep into account the inference vs prediction trade-off.
If you want to understand the complex relationship that is occurring in your data then you should go with a rich inference algorithm (e.g. linear regression or lasso). On the other hand, if you are only interested in the result you can go with high dimensional and more complex (but less interpretable) algorithms, like neural networks.
Selection of Algorithm is depending upon the scenario and the type and size of data set.
There are many other factors.
This is a brief cheat sheet for basic machine learning.

Resources