What machine learning method to use for optimal luggage load? - machine-learning

I have an optimization problem and I'm thinking about employing machine learning methods to resolve it.
The goal is to find optimal luggage load in the plane. The luggage has size, weight and some other attributes restricting the place where it can be put.
There's some strict rules (e.g. some kind of baggage needs to be in front of the luggage department, some cannot be placed near the other, heavy ones should be on the bottom etc.) and some conditions have to be met at the end - the weight should be evenly distributed, all luggage needs to fit inside.
There're also some undocumented ("common sense") rules that are applied by people responsible for loading the luggage.
Those could be derived basing on available examples.
I was thinking about training Neural Network basing on the examples.
I looked at http://neuroph.sourceforge.net/sample_projects.html but couldn’t find the sample that could be modified for my domain.
Also, I’m not sure if this is the right method for this problem category.
Here's a simplified example.
Let's say a piece of baggage is described by 3 attributes:
weight
stiffness
type
Sample data:
B1 (20kg, soft, normal)
B2 (20kg, hard, normal)
B3 (40kg, hard, normal)
B4 (50kg, hard, normal)
B5 (20kg, soft, special)
There's 6 slots in the luggage department (S1-S6, S1 - front, S6 - back).
Example rules:
the heaviest load should be placed as close to the middle of department as possible.
the special ones should be at the back
common sense rule (not documented) - the soft baggage shouldn't be placed between hard ones
One of the correct outputs is:
S1->B1 (normal but soft, so it's not between B2 and B4)
S2->B2 (normal)
S3->B4 (heavy - in the middle),
S4->B3 (heavy - in the middle),
S5->B5 (special - at the back)
Representing the output in normalized form required by NN is tricky.
I'm wondering if NN is a good method for this problem at all.
Maybe some other should be used (e.g. Reinforcement Learning)
What machine learning method(s) could be used to find the optimal luggage load?

Related

Is splitting a long document of a dataset for BERT considered bad practice?

I am fine-tuning a BERT model on a labeled dataset with many documents longer than the 512 token limit set by the tokenizer.
Since truncating would lose a lot of data I would rather use, I started looking for a workaround. However I noticed that simply splitting the documents after 512 tokens (or another heuristic) and creating new entries in the dataset with the same label was never mentioned.
In this answer, someone mentioned that you would need to recombine the predictions, is that necessary when splitting the documents?
Is this generally considered bad practice or does it mess with the integrity of the results?
You have not mentioned if your intention is to classify, but given that you refer to an article on classification I will refer to an approach where you classify the whole text.
The main question is - which part of the text is the most informative for your purpose - or - in other words - does it make sense to use more than the first / last split of text?
When considering long passages of text, frequently, it is enough to consider the first (or last) 512 tokens to correctly predict the class in substantial majority of cases (say 90%). Even though you may loose some precision, you gain on speed and performance of the overall solution and you are getting rid of a nasty problem of figuring out the correct class out of a set of classifications. Why?
Consider an example of text 2100 tokens long. You split it by 512 tokens, obtaining pieces: 512, 512, 512, 512, 52 (notice the small last piece - should you even consider it?). Your target class for this text is, say, A, however you get the following predictions on the pieces: A, B, A, B, C. So you have now a headache to figure out the right method to determine the class. You can:
use majority voting but it is not conclusive here.
weight the predictions by the length of the piece. Again non conclusive.
check that prediction of the last piece is class C but it is barely above the threshold and class C is kinda A. So you are leaning towards A.
re-classify starting the split from the end. In the same order as before you get: A, B, C, A, A. So, clearly A. You also get it when you majority vote combining all of the classifications (forward and backward splits).
consider the confidence of the classifications, e.g. A: 80, B: 70, A: 90, B: 60, C: 55% - avg. 85% for A vs. 65% for B.
reconfirm the correction of labelling of the last piece manually: if it turns out to be B, then it changes all of the above.
then you can train an additional network to classify out of the raw classifications of pieces. Getting again into trouble of figuring out what to do with particularly long sequences or non-conclusive combinations of predictions resulting in poor confidence of the additional classification layer.
It turns out that there is no easy way. And you will notice that text is a strange classification material exhibiting all of the above (and more) issues while typically the difference in agreement between the first piece prediction and the annotation vs. the ultimate, perfect classifier is slim at best.
So, spare the effort and strive for simplicity, performance, and heuristic... and clip it!
On details of the best practices you should probably refer to the article from this answer.

How is Growing Neural Gas used for clustering?

I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.

Machine Learning Techniques in Spam Filtering by Konstantin Tretyakov

http://ats.cs.ut.ee/u/kt/hw/spam/spam.pdf
First of all, I am not sure if this is even a good question for stack overflow as it's not directly related to code, I just couldn't think of a different place to ask it.
I have been looking into machine learning for a report I have to make, and wanted to write something about spam filtering. The link above looks like a pretty good and trustworthy source, but I am probably pretty dumb and just don't understand what they are saying in the neural networks part (page 68 and on). In the part where they adjust w en b, they use c and x to adjust it. C is 1 or -1, for as far as I understand it (might be wrong here too though :(), but x is the prepared -words like 'the' removed, words like 'running' stemmed to 'run'- mail itself, right? How can you use that to rework the weight w? they say w-new = w-old + cx, but how do you multiply a non-integer?
It is a Vector Representation of word. In existing ml libraries there are tools to build such representations automatically (like this or that)
In general the idea is trivial. Imagine that you know there are 1000 of words out there in your document corpus. No more no less. Simplest way of vector representation for words is to just build a sparse matrix of size 1000x1 where you for each specific word you will have all rows as 0 and one of rows marked with 1 (what is often called 'one-hot-encoding')
This is an ugly representation though, usually something more effective is used, like TF-IDF representation - though it is a variation of same idea.

What is multiobjective clustering?

I don't understand what is the multiobjective clustering is it using multiple variables for clustering or what?
I know that stack overflow might not be the best for this kind of questions, but
I've asked it on other website and I did not got a response.
Multiobjective optimization in general means that you have multiple criterions which you are interested in, which cannot be simply converted to something comparable. For example consider problem when you try to have very fast model and very accurate one. Time is measured in s, accuracy in %. How do you compare (1s, 90%) and (10days, 92%)? Which one is better? In general there is no answer. Thus what people usually do - they look for pareto front, so you test K models and selec M <= K of them such that, none of them is clearly "beaten" by any else. For example if we add (1s, 91%) to the previous example, Pareto front will be {(1s, 91%), (10days, 92%)} (as (1s, 90%) < (1s, 91%), and remaining ones are impossible to compare).
And now you can apply the same problem in clustering setting. Say for example that you want to build a model which is fast to classify new instances, minimizes avg. distance inside each cluster, and does not put into each cluster too many special instances labeled with X. Then again you will get models (clusterings) which are now characterized by 3, not comparable, measures, and in Multiobjective Clustering you try to deal with these problems (like for example finding Pareto front of such clusterings).

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Resources