visualization clusterization results - machine-learning

After using k-means i have 3 clusters.
I've used 10 features (marks) in k-means for this data set.
I'm understand that we can't draw 10D chart, but how can i visualize this clusters?
Should i separate data by 2 or 3 features instead 10?
What axises should i use in my case?
For drawing i'm using js and highcharts.js on client side.
Example of code (just for stackoverflow requirement), but I have 10 coordinates for every point
const kmeans = require('ml-kmeans');
let data = [[1, 1, 1, 1, 1], [1, 2, 1, 1, 1], [-1, -1, -1, 1, 1], [-1, -1, -1.5, 1, 1]];
let centers = [[1, 2, 1, 1, 1], [-1, -1, -1, 1, 1]];
let ans = kmeans(data, 2, { initialization: centers });
console.log(ans);
/*KMeansResult*/
{
clusters: [ 0, 0, 1, 1, 1 ]
centroids:
[ { centroid: [ 1, 1.5, 1, 1, 1 ], error: 0.25, size: 2 },
{ centroid: [ -1, -1, -1.25, 1, 1 ], error: 0.0625, size: 2 } ],
converged: true, iterations: 1
}
*/*

Use your favorite generic visualization approach. Clusterings do not have very special requirements.
E.g.
Scatterplot matrix
Dimensionality reduction with PCA
tSNE embeddings
MDS
UMAP
Boxplots
Violin plots
...

Related

How to predict Total Hours needed with List as Input?

I am struggling with the problem I am facing:
I have a dataset of different products (Cars) that have certain Work Orders open at a given time. I know from historical data how much time this work in TOTAL has caused.
Now I want to predict it for another Car (e.g. Car 3).
Which type of algorithm, regression shall I use for this?
My idea was to transform this row based dataset into column based with binary values e.g. Brake: 0/1, Screen 0/1.. But then I will have lots of Inputs as the number of possible Inputs is 100-200..
Here's a quick idea using multi-factor regression for 30 jobs, each of which is some random accumulation of 6 tasks with a "true cost" for each task. We can regress against the task selections in each job to estimate the cost coefficients that best explain the total job costs.
First done w/ no "noise" in the system (tasks are exact), then with some random noise.
A "more thorough" job would include examining the R-squared value and plotting the residuals to ensure linearity.
In [1]: from sklearn import linear_model
In [2]: import numpy as np
In [3]: jobs = np.random.binomial(1, 0.6, (30, 6))
In [4]: true_costs = np.array([10, 20, 5, 53, 31, 42])
In [5]: jobs
Out[5]:
array([[0, 1, 1, 1, 1, 0],
[1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0],
[1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 0]])
In [6]: tot_job_costs = jobs # true_costs
In [7]: reg = linear_model.LinearRegression()
In [8]: reg.fit(jobs, tot_job_costs)
Out[8]: LinearRegression()
In [9]: reg.coef_
Out[9]: array([10., 20., 5., 53., 31., 42.])
In [10]: np.random.normal?
In [11]: noise = np.random.normal(0, scale=5, size=30)
In [12]: noisy_costs = tot_job_costs + noise
In [13]: noisy_costs
Out[13]:
array([113.94632664, 103.82109478, 78.73776288, 145.12778089,
104.92931235, 48.14676751, 94.1052639 , 134.64827785,
109.58893129, 67.48897806, 75.70934522, 143.46588308,
143.12160502, 147.71249157, 53.93020167, 44.22848841,
159.64772255, 52.49447057, 102.70555991, 69.08774251,
125.10685342, 45.79436364, 129.81354375, 160.92510393,
108.59837665, 149.1673096 , 135.12600871, 60.55375843,
107.7925208 , 88.16833899])
In [14]: reg.fit(jobs, noisy_costs)
Out[14]: LinearRegression()
In [15]: reg.coef_
Out[15]:
array([12.09045186, 19.0013987 , 3.44981506, 55.21114084, 33.82282467,
40.48642199])
In [16]:

RandomizedSearchCV with xgboost classifier is taking too long

I am trying to run this code on my dataset. However, its taking too long (the code has been running since yesterday and my dataset is not too large)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
xgb_classifier = XGBClassifier(learning_rate=0.02, n_estimators=600,
objective='binary:logistic',silent=True, nthread=1)
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params,
n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(trainx,trainy),
verbose=3, random_state=1001 )
random_search.fit(trainx, trainy)
I am new in the field. any ideas?

There's a good method to create a tilemap from an ascii matrix?

The idea is simple, but the execution is bothering me.
I've created a small random dungeon generator that create a grid like this:
000001
000111
000111
001101
011101
011111
This is a sample 6x6 dungeon where 0 is a wall and 1 is an open path.
The conversion from this to some sort of tile id map is simple, and trivial, but creating the image itself is the hard part.
I want to know if there's a lib, or method to achieve that. If not, then what would you do?
This is not part of a game, and only a dungeon generator for DND. Any language is OK, but the generator was made in Go.
You can use OpenCV for this task. Probably PIL can do the same, don't have exp with it.
import cv2
import numpy as np
data_list = [
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 1, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 1]
]
arr = np.array(data_list, dtype=np.uint8) * 255
arr = cv2.resize(arr, (0, 0), fx=50, fy=50, interpolation=cv2.INTER_NEAREST)
cv2.imshow("img", arr)
cv2.waitKey()
# or you can save on disk
cv2.imwrite("img.png", arr)
use np.block()
# a bunch of sprites/images, all the same size
# load them however you like
tiles = [...]
data_list = [
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 1, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 1]
]
picture = np.block([
[tiles[k] for k in row]
for row in data_list
])
Or, if you use any kind of game engine, or something even more trivial, like SDL/PyGame, simply "blit" each tile.
PIL, as you found out, is perfectly capable of blitting one image (tile) onto another (whole map).
I kind of managed to get a solution, but it will be a Python only.
Using PIL I can make a mosaic with tile images and create the map. It's not a solid solution made from scratch but it can do the Job.
I'm still open for another approach.
My solution is this method here:
matrix = np.loadtxt(input_file, usecols=range(matrix_square), dtype=int)
tiles = []
for file in glob.glob("./tiles/*"):
im = Image.open(file)
tiles.append(im)
output = Image.new('RGB', (image_width,image_height))
for i in range(matrix_width):
for j in range(matrix_height):
x,y = i*tile_size,j*tile_size
index = matrix[j][i]
output.paste(tiles[index],(x,y))
output.save(output_file)
The matrix_square is the matrix dimensions (as a square). I'm still working on a better solution, but this is working fine for me.
You need to change the tile_size to match the tile resolution that you're using.
This is a generated dungeon with this method
The tiles are bad, but the grid is fine enough.

Show Series and colorAxis both in Legend

Is it possible to have both colorAxis and series in the legend? http://jsfiddle.net/6k17dojn/ i see i can only show one at a time when I toggle this setting
colorAxis: {
showInLegend: true,
}
Currently to show a basic legend with colorAxis, you need to add some code to Highcharts core. This plugin below allows you to add colorAxis to a legend if showInLegend property is set to false:
(function(H) {
H.addEvent(H.Legend, 'afterGetAllItems', function(e) {
var colorAxisItems = [],
colorAxis = this.chart.colorAxis[0],
i;
if (colorAxis && colorAxis.options) {
if (colorAxis.options.dataClasses) {
colorAxisItems = colorAxis.getDataClassLegendSymbols();
} else {
colorAxisItems.push(colorAxis);
}
}
i = colorAxisItems.length;
while (i--) {
e.allItems.unshift(colorAxisItems[i]);
}
});
}(Highcharts))
Live demo: http://jsfiddle.net/BlackLabel/hs1zeruy/
API Reference: https://api.highcharts.com/highcharts/colorAxis.showInLegend
Docs: https://www.highcharts.com/docs/extending-highcharts
It's possible, but not with the data you currently work with. A heatmap's data is a set of coordinates, but here, your two series overlap.
Your raw data is :
[
[0,0,0.2, 0.4],
[0,1,0.1, 0.5],
[0,2,0.4, 0.9],
[0,3,0.7, 0.1],
[0,4,0.3, 0.6]
]
From there, you're mapping two series: 2018, and 2019 via the seriesMapping: [{x: 0, y: 1, value: 2}, {x: 0, y: 1, value: 3}] option.
You thus end up with the following two series:
2018 2019 2019 should be
[ [ [
[0, 0, 0.2], [0, 0, 0.4], [1, 0, 0.4],
[0, 1, 0.1], [0, 1, 0.5], [1, 1, 0.5],
[0, 2, 0.4], [0, 2, 0.9], [1, 2, 0.9],
[0, 3, 0.7], [0, 3, 0.1], [1, 3, 0.1],
[0, 4, 0.3] [0, 4, 0.6] [1, 4, 0.6]
] ] ]
Notice that in both cases, the coordinates are the same, but for 2019, the x value should be 1. Since you have 0 as x coordinate for both series, they overlap.
To fix you issue, you need to change your data (or pre-process it, whatever is easier). For example:
var data = '[[0,0,0.2, 0.4],[0,1,0.1, 0.5],[0,2,0.4, 0.9],[0,3,0.7, 0.1],[0,4,0.3, 0.6]]';
var rows = JSON.parse(data);
rows = $.map(rows, function(arr){
return [[
arr[0], arr[1], arr[2], // 2018
arr[0] + 1, arr[1], arr[3], // 2019
]];
});
// and the seriesMapping changes to
data: {
rows: rows,
firstRowAsNames: false,
seriesMapping: [{x: 0, y: 1, value: 2}, {x: 3, y: 4, value: 5}]
},
You can see it in action here: http://jsfiddle.net/Metoule/qgd2ca6p/6/

Why is the structuring element asymmetric in OpenCV?

Why is the structuring element asymmetric in OpenCV?
cv2.getStructuringElement(cv2.MORPH_ELLIPSE, ksize=(4,4))
returns
array([[0, 0, 1, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]], dtype=uint8)
Why isn't it
array([[0, 1, 1, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 1, 1, 0]], dtype=uint8)
instead?
Odd-sized structuring elements are also asymmetric with respect to 90-degree rotations:
array([[0, 0, 1, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 1, 0, 0]], dtype=uint8)
What's the purpose of that?
There's no purpose for it other than it's one of many possible interpolations for such a shape. In the case of the ellipse with size 5, if it were full it would just be the same as the MORPH_RECT and if the same two were removed from the sides as from the top it would be a diamond. Either way, the way it's actually implemented in the source code is what you would expect---it creates a circle via the distance function and takes near integers to get the binary pixels. Search that file for cv::getStructuringElement and you'll find the implementation, it's nothing too fancy.
If you think an update to this function should be made, then open up a PR on GitHub with the implemented version, or an issue to discuss it first. I think a successful contribution would be easy here and I'd venture that the case for symmetry is strong. One would expect the result of a symmetric image being processed with an elliptical kernel wouldn't depend on orientation of the image.

Resources