Problems with creating a decision tree and splitting on an attribute? - machine-learning

So I'm trying to split on an attribute "Color" that has possible values (Blue,Green,Red,Orange,Pink).
I'm splitting on entropy values, and the best split can either be Multi-Way 5, Multi-Way 4, Multi-Way 3, or Binary. For example:
5: (Blue, Green,Red,Orange,Pink)
4: (Blue, Green), (Red), (Orange), (Pink)
(Green,Pink), (Blue),(Red),(Orange)
3: (Red,Orange), (Blue,Green), (Pink)
(Red,Blue), (Green, Orange), (Pink)
2: (Blue,Green,Red), (Orange,Pink)
(Pink), (Blue, Green, Red, Orange)
And so on. But how can I make a comprehensive list of all the possible splits? Is there a specific algorithm I could use? Or how would I even know how many max possible combinations there are with this?
Any help would be greatly appreciated, thanks!!!

The best split according to entropy(Information Gain) will always be 5.
Recall when you split according to an attribute either you gain information about Y or if they are independent no information gain is there, i.e. Information Gain at every split is greater than equal to zero. So IG(cases 2:4)<= IG(Case 1) as case 2,3,4 can be made into case 1 by adding further splits which can only add Information and not lose it.
For IG at split >=0 refer to: Can the value of information gain be negative?.
In general in decision trees/RF you try to find a single split which gives highest IG for an attribute, and then compare across attributes and select one.

Related

How to create a generalized dataset to detect all display digits with Roboflow

I want to detect digits on a display. For doing that I am using a custom 19 classes dataset. The choosen model has been yolov5-X. The resolution is 640x640. Some of the objets are:
0-9 digits
Some text as objects
Total --> 17 classes
I am having problems to detect all the digits when I want to detect 23, 28, 22 for example. If they are very close to each other the model finds problems.
I am using roboflow to create diferent folders in which I add some prepcocessings to have a full control of what I am entering into the model. All are checked and entered in a new folder called TRAIN_BASE. In total I have 3500 images with digits and the majority of variance is with hue and brightness.
Any advice to make the model able to catch all the digits besides being to close from each other?
Here are the steps I follow:
First of all, The use of mosaic dataset was not a good choice the purpose of detecting digits on a display because in a real scenario I was never gonna find pieces of digits. That reason made the model not to recognize some digits if it was not shure.
example of the digits problem concept
Another big improvement was to change the anchor boxes of the yolo model to adapt them to small objects. To know which anchor boxes I needed. Just with adding this argument to train.py is enought in the script provided by ultralitics to print custom anchors and add them to your custom architecture.
To check which augmentations can be good and which not, the next article explains it quite visually.
P.D: Thanks for the fast response to help the comunity gave me.

Ordinal Encoding or One-Hot-Encoding

IF we are not sure about the nature of categorical features like whether they are nominal or ordinal, which encoding should we use? Ordinal-Encoding or One-Hot-Encoding?
Is there a clearly defined rule on this topic?
I see a lot of people using Ordinal-Encoding on Categorical Data that doesn't have a Direction.
Suppose a frequency table:
some_data[some_col].value_counts()
[OUTPUT]
color_white 11413
color_green 4544
color_black 1419
color_orang 3
Name: shirt_colors, dtype: int64
There are a lots of guys who are preferring to do Ordinal-Encoding on this column. And I am hell-bent to go with One-Hot-Encoding.
My view on this is that doing Ordinal Encoding will allot these colors' some ordered numbers which I'd imply a ranking. And there is no ranking in the first place. In other words, my model should not be thinking of color_white to be 4 and color_orang to be 0 or 1 or 2.
Keep in mind that there is no hint of any ranking or order in the Data Description as well.
I have the following understanding of this topic:
Numbers that neither have a direction nor magnitude are Nominal Variables. For example, fruit_list =['apple', 'orange', banana']. Unless there is a specific context, this set would be called to be a nominal one. And for such variables, we should perform either get_dummies or one-hot-encoding
Whereas the Ordinal Variables have a direction. For example, shirt_sizes_list = [large, medium, small]. These variables are called Ordinal Variables. If the same fruit list has a context behind it, like price or nutritional value i-e, that could give the fruits in the fruit_list some ranking or order, we'd call it an Ordinal Variable. And for Ordinal Variables, we perform Ordinal-Encoding
Is my understanding correct?
Kindly provide your feedback
This topic has turned into a nightmare
Thank you!
You're right. Just one thing to consider for choosing OrdinalEncoder or OneHotEncoder is that does the order of data matter?
Most ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases e.g., for ordered categories such as:
quality = ["bad", "average", "good", "excellent"] or
shirt_size = ["large", "medium", "small"]
but it is obviously not the case for the:
color = ["white","orange","black","green"]
column (except for the cases you need to consider a spectrum, say from white to black. Note that in this case, white category should be encoded as 0 and black should be encoded as the highest number in your categories), or if you have some cases for example, say, categories 0 and 4 may be more similar than categories 0 and 1. To fix this issue, a common solution is to create one binary attribute per category (One-Hot encoding)

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!
I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.

How to display the results of multiple comparisons

If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.

Resources