Taking a sample of the image dataset [closed] - machine-learning

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
For example i want to develop a Deep Learning model for classification of images and I have thousands of images . Since training the model with the whole dataset takes a long amount of time i would like to take a sample (10%) of the original dataset for initial training . How to do this?

If the dataset is contained into a folder, I will try in the following:
import os
import numpy as np
images = os.listdir('Path to your dataset') # list of all the images
n_test_images = int(len(images) * 0.1) # 10% of the total images
subset_images = np.random.choice(images, size=n_test_images, replace=False)
I used replace=True to avoid picking twice the same element.
After I selected the 10% of the images, I load them.
Actually I am not sure if this way is the most optimal one, but it could be a good starting point.

Related

How can I handle high cardinal/sparse features using neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I look for examples about encoding high cardinal or sparse datasets using Neural Networks but I cannot find it. Also I search about embedding numerical variables (not categorical) however I couldn't find any examples either. Can you send me a GitHub link etc. if you have about these issues?
Working with neural networks I am assuming that tensorflow with Keras backend is being used?
If so here is a reference snippet, main library used tf.feature_column
import tensorflow as tf
from tensorflow.keras import layers
feature_columns=[]
for col in list(df_train_numerical.columns):
col = tf.feature_column.numeric_column(col)
feature_columns.append(col)
for col in list(df_train_categorical.columns):
col = tf.feature_column.embedding_column(tf.feature_column.categorical_column_with_hash_bucket(col, hash_bucket_size=8000), dimension=8)
#above hash bucket size is specified (cardinality) with dimension
feature_columns.append(col)
feature_layer = layers.DenseFeatures(feature_columns)
Following that the feature_layer is basically the first layer of the neural network-
model = tf.keras.models.Sequential()
model.add(feature_layer)
reference git code

Basic question about heavy-tailed distribution [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have a basic question about heavy-tailed distributions.
Suppose there are 50,000 cities in Spain and the population of each is denoted by p(1), p(2), …, p(n). Based on the mean of the distribution πœ‡ and the deviation 𝜎, how can we tell if the distribution is heavy-tailed or not? What procedure should we consider?
If you have all 50,000 observations then you can calculate the central moments about the mean.
In particular, the fourth central moment divided by the variance squared is the kurtosis. This number will tell you if the distribution is platykurtic or not. If it is greater than three, it means that your distribution has heavier tails than a standard normal distribution.
So if you are working in Python and all 50K observations are stored in x:
from scipy import stats
# Calculate kurtosis
k = stats.moment(x, 4) / x.var()**2
# Evaluate
if k > 3:
print('Distribution has heavy tails')
else:
print('Distribution does not have heavy tails')

How to plot two sets of high dimensional data in one visualization plot for comparision? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to compare my generated samples (i.e. MNIST digit images) from GAN (Generated Adversarial Network).
For my 1st experiment, the GAN training is not successful, so the generated samples are not similar to real MNIST images.
For my 2nd experiment, the GAN training is very successful, so the generated samples should be overlapped well with real MNIST samples in a visualized plot.
The above examplary figure shows what I hope to achieve:
(1) The first figure shows the original real image distribution
(2) The second figure shows that the results of GAN1 don't overlap well with real data
(3) The third figure shows that the results of GAN2 overlap well with the real data.
Could someone provide some guidance what is a good way to plot something like this with Python, and provide some examplary code?
You can try to use dimensionality reduction methods like PCA, t-SNE, LLE or UMAP to reduce the dimension of your images to 2 and plot the images as you already pointed out.
Here is some example code in python:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
X_real = ... # real images e.g. 1000 images as vectors
X_gan = ... # generated images from GAN with same shape
X = np.vstack([X_real, X_gan]) # stack matrices vertically
X_pca = PCA(n_components=50).fit_transform(X) # for high-dimensional data it's advisible to reduce the dimension first (e.g. 50) before using t-SNE
X_embedded = TSNE(n_components=2).fit_transform(X_pca)
# plot points with corresponding class and method labels
plt.scatter(...)
Instead of t-SNE you can directly use PCA or one of the other methods mentioned above.

I didnt quite get the "there exist a j"part. Can anyone help me understand it better? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I didnt quite get the "there exist a j"part. Can anyone help me understand it better?
For each dimension, the most extreme 10% are categorized as boundary. The collection of all points which lie in the most extreme 10% of any dimension is classified as the boundary set.
for a 1D line: fraction of points in boundary f = 0.100
for a 2D square: f = 0.1 + 2*(0.05-2*0.05**2) = 0.190. To see why, you can draw a square with cutting lines at the 0.05 and 0.95 fractions for each of the 2 dimensions. You will end up with:
for a 3D cube: f = 0.1 + #I'm too lazy to write it all down = 0.271
for a 50D hypercube (definitely not going to write the direct calculation): f = 0.995.
Now luckily there is an indirect way of calculating these fractions which requires significantly less effort. I'll leave that bit of homework for you to do.

Estimating probabilities using Bayes rule? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am working on a past exam paper. I am given a data set as follows:
Hair {brown, red} = {B,R},
Height {tall, short} = {T,S} and
Country {UK, Italy} = {U,I}
(B,T,U) (B,T,U) (B,T,I)
(R,T,U) (R,T,U) (B,T,I)
(R,T,U) {R,T,U) (B,T,I)
(R,S,U) (R,S,U) (R,S,I)
Question: Estimate the probabilities P(B,T|U), P(B|U), P(T|U), P(U) and P(I)
As the question states estimate, I am guessing that I don't need to calculate any values. Is it just a case of adding up how many times P(B,T|U) occurs over the whole data set e.g. (2/12) = 16%.
Then would the probability of P(U) be 0?
I don't think so. Out of your 12 records, 8 are from the country UK. So P(U) should be 8/12 = 2/3 ~= .66
Bayes' theorem is P(A|B) = P(B|A)P(A)/P(B) , which you're going to need to estimate some of those probabilities.

Resources