2.3 ratio between Pytorch BCEloss and my own "log" calculations - machine-learning

I'm scripting a toy model to both practice PyTorch and GAN models, and I'm making sure I understand each step as much as possible.
That leaded me to checking my understanding of the BCEloss function, and apparently I understand it... with a ratio of 2.3.
To check the results, I write the intermediate values for Excel:
tmp1 = y_pred.tolist() # predicted values in list (to copy/paste on Excel)
tmploss = nn.BCELoss(reduction='none') # redefining a loss giving the whole BCEloss tensor
tmp2 = tmploss(y_pred, y_real).tolist() # BCEloss values in list (to copy/paste Exel)
Then I copy tmp1 on Excel and calculate: -log(x) for each values, which is the BCEloss formula for y_target = y_real = 1.
Then I compare the resulting values with the values of tmp2: these values are 2.3x higher than "mine".
(Sorry, I couldn't figure out how to format tables on this site...)
Can you please tell me what is happening? I feel a PEBCAK coming :-)

This is because in Excel the Log function calculates the logarithm to the base 10.
The standard definition of binary cross entropy uses a log function to the base e.
The ratio you're seeing is just log(10)=2.302585

Related

Hello all I need some help regarding embeddings selection for relational GNN

Currently I am using a csv file converted from a pcap and I took the column length from my csv file and used it as my embedding. The code compiles and I do get accuracy in the high 70s like 77 percent. I am just not sure if this is an appropriate choice for an embedding. I am also getting this issue of some data sets get weird results as UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. I know there is people who answered this question but I tried all of there methods and still no clue why my model works for some data sets and not all.Please if someone could confirm what I am doing makes sense or not that would really help me.
CSV file snapshot for reference
df['embeddings'] =df['Length']
embeddings = torch.from_numpy(df['embeddings'].to_numpy())
# normalizing degree values
scale = StandardScaler()
embeddings = scale.fit_transform(embeddings.reshape(-1,1))

Confusion matrix subset of classes not working properly

I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.

What statistic method to use in multivariate abundance data with random effects?

I am working with multivariate data with random effects.
My hypothesis is this: D has an effect on A1 and A2, where A1 and A2 are binary data, and D is a continuous variable.
I also have a random effect, R, that is a factor variable.
So my model would be something like this: A1andA2~D, random=1=~1|R
I tried to use the function manyglm in mvabund package, but it can not deal with random effects. Or I can use lme4, but it can not deal with multivariate data.
I can convert my multivariate data to a 4 level factor variable, but I didn't find any method to use not binary but factor data as a response variable. I also can convert the continuous D into factor variable.
Do you have any advice about what to use in that situation?
First, I know this should be a comment and not a complete answer but I can't comment yet and thought you might still appreciate the pointer.
You should be able to analyze your data with the MCMCglmm R package. (see here for an Intro), as it can handle mixed models with multivariate response data.

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

scikit-learn: clustering text documents using DBSCAN

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.
My minimal code is as follows:
docs = []
for item in [database]:
docs.append(item)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)
X = X.todense() # <-- This line was needed to resolve the isse
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...
(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)
When I run this code, I get the following error when instatiating DBSCAN and calling fit():
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range
Clicking on the line in dbscan_.py that throws the error, I noticed the following line
...
X = np.asarray(X)
n = X.shape[0]
...
When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:
...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.
It looks like sparse representations for DBSCAN are supported as of Jan. 2015.
I upgraded sklearn to 0.16.1 and it worked for me on text.
The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.
Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.
You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.
You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.
Mean Shift may actually need your data to be vector space of fixed dimensionality.

Resources