objects classification, mutually related features - machine-learning

Looking for some inspirations on how to address the following problem:
there is a collection of multiple worlds,
each world has a collection of objects,
a single object, or a group of objects, may have a maximum of one category assigned,
some categories are mutually related - i.e., the fact that object1 in group1 belongs to categoryA, increases a chance that some other group containing the same object1 belongs to categoryB
Having a dataset with multiple worlds fully described - the target is to take a completely new world and correctly categorize the objects and groups.
I would appreciate some ideas on how to address it.
My approach was to write classifiers that learn different characteristics of objects and groups based on the learning data, and then assign scores (a number between 0 and 1) to different combinations of objects in the unknown world. The problem I'm facing though is how to provide the final response. With like 20 classifiers and each assigning scores to multiple groups, it's difficult to say. For example, sometimes multiple classifiers return scores with very small values, that sum up to a big number, and that shades the fact that one very rare classifier returned 1.

Related

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

What is a good approach to clustering multi-dimensional data?

I created a k-means clustering for clustering data based on 1 multidimentional feature i.e. 24-hour power usage by customer for many customers, but I'd like to figure out a good way to take data which hypothetically comes from matches played within a game for a player and tries to predict the win probability.
It would be something like:
Player A
Match 1
Match 2
.
.
.
Match N
And each match would have stats of differing dimensions for that player such as the player's X/Y coordinates at a given time, time a score was made by the player, and such. Example, the X/Y would have data points based on the match length, while scores could be anywhere between 0 and X, while other values might only have 1 dimension such as difference in skill ranking for the match.
I want to take all of the matches of the player and cluster them based on the features.
My idea to approach this is to cluster each multi-dimensional feature of the matches to summarize them into a cluster, then represent that entire feature for the match with a cluster number.
I would repeat this process for all of the features which are multi-dimensional until the row for each match is a vector of scalar values and then run one last cluster on this summarized view to try to see if wins and losses end up in distinctive clusters, and based on the similarity of the current game being played with the clustered match data, calculate the similarity to other clusters and assign a probability on whether it is likely going to become a win or a loss.
This seems like a decent approach, but there are a few problems that make me want to see if there is a better way
One of the key issues I'm seeing is that building model seems very slow - I'd want to run PCA and calculate the best number of components to use for each feature for each player, and also run a separate calculation to determine the best number of clusters to assign for each feature/player when I am clustering those individual features. I think hypothetically scaling this out over thousands to millions of players with trillions of matches would take an extremely long time to do this computation as well as update the model with new data, features, and/or players.
So my question to all of you ML engineers/data scientists is how is my approach to this problem?
Would you use the same method and just allocate a ton of hardware to build the model quickly, or is there some better/more efficient method which I've missed in order to cluster this type of data?
It is a completely random approach.
Just calling a bunch of functions just because you've used them once and they sound cool never was a good idea.
Instead , you first should formalize your problem. What are you trying to do?
You appear to want to predict wins vs. losses. That is classification not clustering. Secondly, k-means minimizes the sum-of-squares. Does it actually !ake sense to minimize this on your data? I doubt so. Last, you begin to be concerned about scaling something to huge data, which does not even work yet...

How do I interpret this output from sklearn.tree.export_graphviz?

I am working on analyzing grade data. As a new way to look at the data I am using a decision tree, for the first time. I believe I have the code right and now I am trying to interpret it. The features are grades gotten for a series of quizzes, and the classification is the final grade the student received. I have a few questions:
If my understanding is correct, each node has a test and a left branch representing the test being true, and the other for false. And when the tree seems to have asked enough questions, it says what the "class" is. If that is the case, how come there's a class= on boxes well before the leaves? I would have thought that just leaves have a class=
How do I "tune" the overall tree? It seems to have too many boxes. Is this an example of "overfitting"? How can I tune that better?
For example, the use of FINAL_GRADE_PA01 seems arbitrary to be based on the ordering of the data. Is that true or did the analysis actually conclude that that feature was the best discriminator?
If I'm not mistaken, those class values indicate what the model would have predicted, had it stopped branching on that node. It still stores those values, but it doesn't use them if there's a branching from that node.
About the number of nodes, as you see in the docs:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets.
To reduce memory consumption, the complexity and size of the trees
should be controlled by setting those parameter values.
There are several parameters which you can use to reduce the complexity of your model. The following two parameters are just an example:
max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited
number of leaf nodes.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

Using neural networks for classification in Hierarchical data

I am stuck in a problem wherein I have hierarchical data, say A->B->C(smaller to biggest), and the smallest unit of data is a block(A consists of multiple blocks, B consists of multiple A's, and C consists of multiple B's), and I want to classify blocks into labels. Now the block labels for each group of A is independent of block labels for another group of A, however the "trends or patterns" followed by data could be similar and that is what is to be learnt. The complexity I am facing is variable input sizes. I cannot possibly train single neural networks for groups of A, since its a large number. So, I am thinking in terms of groups at level B, but how could I create a scheme which could handle these variable input sizes. Each block is represented by a one dimensional array of the total number of labels in the group of A it belongs to.Also, I have the information for hierarchy for every block(smallest unit) possible. Any help would be appreciated. Thanks!

Should I keep/remove identical training examples that represent different objects?

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.
Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?
If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).
To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.
My advice is to leave the duplicates in the data.

Resources