Cluster Analysis - machine-learning

We have done KMeans clustering with k=3 Clusters on dataset having 4 features, and have plotted the Centroid coordinates of each cluster with the input features. We want to focus on a cluster and try to figure out which feature is contributing more to a particular cluster with respect to other features. We do the same for other clusters too.
For e.g. lets assume we have clusters c1 c2 c3 with features f1 f2 f3 and f4. The centroids for the clusters are for e.g.
C1 C2 C3 respectively with corresponding features f1 f2 f3 f4
f1 f2 f3 f4
C1 -2.02746249 2.10961828 -0.57917217 1.687631
C2 0.63967987 -0.55768793 0.76387423 0.265274
C3 -0.38727624 0.18714381 -1.27291271 -1.273105
Can we infer that for cluster 1 since the centroid for f2 (feature2) is highest so can that be considered as the most significant/contributing feature for that particular cluster, similarly for cluster 2 centroid for f3(feature3) is highest so can that be considered as the important feature for cluster 2.

Related

Overfitting in data frame that some rows repeated

I have a machine learning problem in a logistic regression algorithm. That I have a data frame where some rows and features are repeated like the below table:
feature 1
feature 2
feature 3
...
feature n-1
feature n
Target
a1
a2
a3
..
an
1
1
b1
b2
b3
..
bn
1
0
c1
c2
c3
..
cn
1
1
..
..
..
..
..
1
..
a1
a2
a3
..
an
2
..
b1
b2
b3
..
bn
2
..
c1
c2
c3
..
cn
2
..
..
..
..
..
..
2
..
a1
a2
a3
..
an
3
..
b1
b2
b3
..
bn
3
..
c1
c2
c3
..
cn
3
..
..
..
..
..
..
..
..
Is it possible to occur overfitting or underfitting with this data frame or not?
And what about a data frame that has between 6 or 8 features with about 500 rows?
I should add and notice this, rows that are repeated in features from 1 to n-1 vary in feature n.
Whether you overfit or not is due to:
the complexity of the model
the available data.
But what's important is the actual data. If you double the data by repeating it, you don't effectively change the data you have. In fact, many algorithms randomly sample from the dataset. So, having duplicates changes nothing (except if the duplicated data has a different distribution than the non-duplicated data)
As such, removing the duplication in the data will not affect whether your overfit or not.
Edit: Now, if the data is not duplicated, but rather modified, it is a different story:
where some rows and features are repeated
Then, no effect.
But if the data is modified, as the table shows, then you need to explain: Is this actual noisy measurements? Is this some random transcription/data collection error?
However, if it is not errors in the dataset but actual data, then it is important to keep it. This is not about overfitting, this is about training with the actual data.

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

Count mismatches between rows in google sheet

Need some help on this cause I'm getting an issue
I've this three columns (Time, R1 and R2) and I'm trying to count the mismatches between R1 and R2 but for each month (on the time column)
I already used a formula but I'm having an issue to add 1.
https://docs.google.com/spreadsheets/d/1bVP79Gbd14lO6xunu2K9POT7y55yrXegD-cTF70Fb4k/edit#gid=0 (the spreadsheet with the values)
=iferror(if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),""),"Error")
This part "month(Database!$C$4:$C) = month($A5)" is where I compare the information of the months, ( but I'm having an issue cause cause "month(Database!$C$4:$C)" only retrieves 4 that is the month of april)
This part "(Database!G$4:G <> Database!H$4:H)" is where I compare the columns R1 and R2
The part "EOMONTH($A5,0)=$A5" is where I take the month to based myself
Time R1 R2
2020-04-30 BA BU
2020-04-30 BU BA
2020-04-29 BA BU
2020-04-29 BU BA
2020-04-28 BA BU
2020-04-28 AA BA
2020-04-25 AA BA
2020-04-22 BU BA
2020-04-19 AA BU
2020-04-19 AA BA
2020-03-27 BA AA
2020-03-27 BA AA
2020-03-26 BU AA
2020-03-18 BA AA
2020-03-18 AA BU
Approach
In order to validate the answer I created a test Spreadsheet from a copy of your. In this sheet I created two support columns, one which contains the month number: MONTH($A1) and the other one a flag if the two values R1 and R2 are different: IF($B1=$C1,"",1).
In this way I can use this two support structure to validate the numbers obtained by the formula which didn't use any. I will use a much simpler formula this time to compute the sum =SUMIF(D:D,month(<DATE_VALUE>),E:E).
I will link here the test sheet. As you can see the values are the same as the ones obtained by the formula. So, I can confirm that if you are expecting different results, your database is not consistent.
In conclusion the formula: if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),"") is working correctly.

How to perform calculation with cumulative sum using ARRAYFORMULA

Is it possible to perform an arbitrary calculation (eg. A2*B2) on a set of rows and obtain the cumulative sum along the way using ARRAYFORMULA? For example, in the following sheet we have numbers (column A), multipliers (column B), the result of multiplying them (column C), and a cumulative tally (column D):
| A B C D E F
-------------------------------------------------------------------------------
1 | number multiplier result cumulative array formula array formula sum?
2 | 3 4 12 12 12
3 | 2 4 8 20 8
4 | 10 1 10 30 10
5 | 7 9 63 93 63
I can use ARRAYFORMULA in cell E2 (specifically, ARRAYFORMULA(A2:A5*B2:B5)) to do the multiplication. Is it possible to use ARRAYFORMULA (or alternative tool) in cell F2 to show the cumulative total?
use:
=ARRAYFORMULA(IF(A2:A="",,MMULT(TRANSPOSE((ROW(A2:A)<=
TRANSPOSE(ROW(A2:A)))*A2:A*B2:B), SIGN(B2:B))))
Calculate the cumulative sum with the SCAN and LAMBDA functions:
=SCAN(0, F5:F, LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value))
This will run faster as it runs with linear complexity (O(N)) compared to the ARRAYFORMULA solution, which runs in quadratic time (O(N**2)).
Where:
0 is the initial value of the cumulative sum
F5:F is the range to sum over
LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value)) is the function that calculates the sum at each cell
Sample File

How do I re-base this in gerrit?

I have the following case in gerrit,
(master)O---C1---C2---C4
\
C3
Now there are two files F1 and F2 that are common files in C1 and C2 commits but different contents. Also there is other file F3 that is common in C1 and C3 commits.
Now these changes have not yet been reviewed. Now this has resulted in merge conflict. How can I resolve this ?

Resources