make_blobs without Sklearn [closed] - machine-learning

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I would like to write exactly this code but without the Sklearn importation :
from sklearn.datasets import make_blobs
X_Train, Y_Train = make_blobs(n_samples=100, n_features=2, centers=2, random_state=0)

You can actually check the source code for generating the blobs. In it you can find the parts for making the matrix.
In theory, one way is to generate a multivariate normal distribution with 2 different means and similar standard deviation, for example if there's only 2 relevant features, you can set the means
import matplotlib.pyplot as plt
import numpy as np
means = [[1,1],[3,3]]
covs = [[0.5,0],[0,0.5]]
n = 100
np.random.seed(1223)
X = np.vstack([np.random.multivariate_normal(i, covs, n) for i in means])
y = np.repeat([0,1],n)
plt.scatter(X[:,0], X[:,1],c = y)
plt.axis('equal')
plt.show()

Related

How can I handle high cardinal/sparse features using neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I look for examples about encoding high cardinal or sparse datasets using Neural Networks but I cannot find it. Also I search about embedding numerical variables (not categorical) however I couldn't find any examples either. Can you send me a GitHub link etc. if you have about these issues?
Working with neural networks I am assuming that tensorflow with Keras backend is being used?
If so here is a reference snippet, main library used tf.feature_column
import tensorflow as tf
from tensorflow.keras import layers
feature_columns=[]
for col in list(df_train_numerical.columns):
col = tf.feature_column.numeric_column(col)
feature_columns.append(col)
for col in list(df_train_categorical.columns):
col = tf.feature_column.embedding_column(tf.feature_column.categorical_column_with_hash_bucket(col, hash_bucket_size=8000), dimension=8)
#above hash bucket size is specified (cardinality) with dimension
feature_columns.append(col)
feature_layer = layers.DenseFeatures(feature_columns)
Following that the feature_layer is basically the first layer of the neural network-
model = tf.keras.models.Sequential()
model.add(feature_layer)
reference git code

Problems with using sklearm.preprocessing.PowerTransformer() before splitting data using train_test_split [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Since my data are not normally distributed so I decided to use PowerTransformer on X, y before splitting them to X_train, X_test, y_train, y_test. Is it okay if I do this or I should perform transformation later. Here is my code:
X = df[['Aces', 'TotalPointsWon', 'ServiceGamesWon', 'TotalServicePointsWon']]
y = df[['Winnings']]
transformer_X = PowerTransformer()
X_log = transformer_X.fit_transform(X)
transformer_y = PowerTransformer()
y_log = transformer_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_log, y_log, train_size=0.8)
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train, y_train)
Residuals Analyses Graph
Thanks for helping out.
PowerTransformer makes data more Gaussian-like, feature-wise.
Just like any data preprocessing step, the rule of thumb is to fit (i.e learn the parameters) the training data, then transform both the latter and the test set (i.e applying the learned parameters to the unseen new data).
Hence, the fit method should only be applied to the training data, with the assumption that it represents the statistical distribution of whole sample (i.e. make sure to use stratified splits if it is a classification problem, and make sure you have enough examples, use cross validation, ...etc).
Why?
Because at some time, you'll receive new unseen data that you'll have only to transform. That's why you're splitting the data at the stage, to simulate this event and validate that the model is not overfitting nor underfitting and did actually learn how to represent the data.
Otherwise, your model would be biased, and data snooping will be, to a certain degree, applicable here.
Final words
Please note that PowerTransformer accepts a parameter called method that specifies one of the two available power transform methods:
yeo-johnson: which works with positive and negative values.
box-cox: which only works with strictly positive values.
You can read more about them here and here, respectively.

Taking a sample of the image dataset [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
For example i want to develop a Deep Learning model for classification of images and I have thousands of images . Since training the model with the whole dataset takes a long amount of time i would like to take a sample (10%) of the original dataset for initial training . How to do this?
If the dataset is contained into a folder, I will try in the following:
import os
import numpy as np
images = os.listdir('Path to your dataset') # list of all the images
n_test_images = int(len(images) * 0.1) # 10% of the total images
subset_images = np.random.choice(images, size=n_test_images, replace=False)
I used replace=True to avoid picking twice the same element.
After I selected the 10% of the images, I load them.
Actually I am not sure if this way is the most optimal one, but it could be a good starting point.

Can we tune any of the parameters on testing data, including any parameters learned by preprocessing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I want to normalize the data using StandardScaler function.
But I have doubts about how this should be done.
One way to do this is like as follows:
scaler = StandardScaler().fit(X)
X = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y)
And the other case is like this:
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
I read somewhere that we should never tune any of our parameters on testing data, including any parameters learned by preprocessing (scale, bias).
According to this fact, is only the second case correct?
I'm a little confused.
I would go with second case yes.
Assume that you train your model and on a later moment you use it on novel data, the accuracy you would expect is similar to the one you achieved on the validation (actually on the test set but for sake of simplicity..).
What will happen is that some informations from your validation have leaked into your model (aka the bias and mean during the preprocessing) and it may perform better on validation thanks to this.

How to plot two sets of high dimensional data in one visualization plot for comparision? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to compare my generated samples (i.e. MNIST digit images) from GAN (Generated Adversarial Network).
For my 1st experiment, the GAN training is not successful, so the generated samples are not similar to real MNIST images.
For my 2nd experiment, the GAN training is very successful, so the generated samples should be overlapped well with real MNIST samples in a visualized plot.
The above examplary figure shows what I hope to achieve:
(1) The first figure shows the original real image distribution
(2) The second figure shows that the results of GAN1 don't overlap well with real data
(3) The third figure shows that the results of GAN2 overlap well with the real data.
Could someone provide some guidance what is a good way to plot something like this with Python, and provide some examplary code?
You can try to use dimensionality reduction methods like PCA, t-SNE, LLE or UMAP to reduce the dimension of your images to 2 and plot the images as you already pointed out.
Here is some example code in python:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
X_real = ... # real images e.g. 1000 images as vectors
X_gan = ... # generated images from GAN with same shape
X = np.vstack([X_real, X_gan]) # stack matrices vertically
X_pca = PCA(n_components=50).fit_transform(X) # for high-dimensional data it's advisible to reduce the dimension first (e.g. 50) before using t-SNE
X_embedded = TSNE(n_components=2).fit_transform(X_pca)
# plot points with corresponding class and method labels
plt.scatter(...)
Instead of t-SNE you can directly use PCA or one of the other methods mentioned above.

Resources