LinearRegressionWithSGD() returns NaN - machine-learning

I am trying to use LinearRegressionWithSGD on Million Song Data Set and my model returns NaN's as weights and 0.0 as the intercept. What might be the issue for the error ? I am using Spark 1.40 in standalone mode.
Sample data: http://www.filedropper.com/part-00000
Here is my full code:
// Import Dependencies
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
// Define RDD
val data =
sc.textFile("/home/naveen/Projects/millionSong/YearPredictionMSD.txt")
// Convert to Labelled Point
def parsePoint (line: String): LabeledPoint = {
val x = line.split(",")
val head = x.head.toDouble
val tail = Vectors.dense(x.tail.map(x => x.toDouble))
return LabeledPoint(head,tail)
}
// Find Range
val parsedDataInit = data.map(x => parsePoint(x))
val onlyLabels = parsedDataInit.map(x => x.label)
val minYear = onlyLabels.min()
val maxYear = onlyLabels.max()
// Shift Labels
val parsedData = parsedDataInit.map(x => LabeledPoint(x.label-minYear
, x.features))
// Training, validation, and test sets
val splits = parsedData.randomSplit(Array(0.8, 0.1, 0.1), seed = 123)
val parsedTrainData = splits(0).cache()
val parsedValData = splits(1).cache()
val parsedTestData = splits(2).cache()
val nTrain = parsedTrainData.count()
val nVal = parsedValData.count()
val nTest = parsedTestData.count()
// RMSE
def squaredError(label: Double, prediction: Double): Double = {
return scala.math.pow(label - prediction,2)
}
def calcRMSE(labelsAndPreds: RDD[List[Double]]): Double = {
return scala.math.sqrt(labelsAndPreds.map(x =>
squaredError(x(0),x(1))).mean())
}
val numIterations = 100
val stepSize = 1.0
val regParam = 0.01
val regType = "L2"
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize)
.setRegParam(regParam)
val model = algorithm.run(parsedTrainData)

I am not familiar with this specific implementation of SGD, but generally if a gradient descent solver goes to nan that means that the learning rate is too big. (in this case I think it is the stepSize variable).
Try to lower it by an order of magnitude each time until it starts to converge

I can think there are two possibilities.
stepSize is big. You should try something like 0.01, 0.03, 0.1,
0.3, 1.0, 3.0....
Your train data have NaN. If so, result will be likely NaN.

Related

Implementing linear regression from scratch in python

I'm trying to Implement linear regression in python using the following gradient decent formulas (Notice that these formulas are after partial derive)
slope
y_intercept
but the code keeps giving me wearied results ,I think (I'm not sure) that the error is in the gradient_descent function
import numpy as np
class LinearRegression:
def __init__(self , x:np.ndarray ,y:np.ndarray):
self.x = x
self.m = len(x)
self.y = y
def calculate_predictions(self ,slope:int , y_intercept:int) -> np.ndarray: # Calculate y hat.
predictions = []
for x in self.x:
predictions.append(slope * x + y_intercept)
return predictions
def calculate_error_cost(self , y_hat:np.ndarray) -> int:
error_valuse = []
for i in range(self.m):
error_valuse.append((y_hat[i] - self.y[i] )** 2)
error = (1/(2*self.m)) * sum(error_valuse)
return error
def gradient_descent(self):
costs = []
# initialization values
temp_w = 0
temp_b = 0
a = 0.001 # Learning rate
while True:
y_hat = self.calculate_predictions(slope=temp_w , y_intercept= temp_b)
sum_w = 0
sum_b = 0
for i in range(len(self.x)):
sum_w += (y_hat[i] - self.y[i] ) * self.x[i]
sum_b += (y_hat[i] - self.y[i] )
w = temp_w - a * ((1/self.m) *sum_w)
b = temp_b - a * ((1/self.m) *sum_b)
temp_w = w
temp_b = b
costs.append(self.calculate_error_cost(y_hat))
try:
if costs[-1] > costs[-2]: # If global minimum reached
return [w,b]
except IndexError:
pass
I Used this dataset:-
https://www.kaggle.com/datasets/tanuprabhu/linear-regression-dataset?resource=download
after downloading it like this:
import pandas
p = pandas.read_csv('linear_regression_dataset.csv')
l = LinearRegression(x= p['X'] , y= p['Y'])
print(l.gradient_descent())
But It's giving me [-568.1905905426412, -2.833321633515304] Which is decently not accurate.
I want to implement the algorithm not using external modules like scikit-learn for learning purposes.
I tested the calculate_error_cost function and it worked as expected and I don't think that there is an error in the calculate_predictions function
One small problem you have is that you are returning the last values of w and b, when you should be returning the second-to-last parameters (because they yield a lower cost). This should not really matter that much... unless your learning rate is too high and you are immediately getting a higher value for the cost function on the second iteration. This I believe is your real problem, judging from the dataset you shared.
The algorithm does work on the dataset, but you need to change the learning rate. I ran it in the example below and it gave the result shown in the image. One caveat is that I added a limit to the iterations to avoid the algorithm from taking too long (and only marginally improving the result).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
class LinearRegression:
def __init__(self , x:np.ndarray ,y:np.ndarray):
self.x = x
self.m = len(x)
self.y = y
def calculate_predictions(self ,slope:int , y_intercept:int) -> np.ndarray: # Calculate y hat.
predictions = []
for x in self.x:
predictions.append(slope * x + y_intercept)
return predictions
def calculate_error_cost(self , y_hat:np.ndarray) -> int:
error_valuse = []
for i in range(self.m):
error_valuse.append((y_hat[i] - self.y[i] )** 2)
error = (1/(2*self.m)) * sum(error_valuse)
return error
def gradient_descent(self):
costs = []
# initialization values
temp_w = 0
temp_b = 0
iteration = 0
a = 0.00001 # Learning rate
while iteration < 1000:
y_hat = self.calculate_predictions(slope=temp_w , y_intercept= temp_b)
sum_w = 0
sum_b = 0
for i in range(len(self.x)):
sum_w += (y_hat[i] - self.y[i] ) * self.x[i]
sum_b += (y_hat[i] - self.y[i] )
w = temp_w - a * ((1/self.m) *sum_w)
b = temp_b - a * ((1/self.m) *sum_b)
costs.append(self.calculate_error_cost(y_hat))
try:
if costs[-1] > costs[-2]: # If global minimum reached
print(costs)
return [temp_w,temp_b]
except IndexError:
pass
temp_w = w
temp_b = b
iteration += 1
print(iteration)
return [temp_w,temp_b]
p = pd.read_csv('linear_regression_dataset.csv')
x_data = p['X']
y_data = p['Y']
lin_reg = LinearRegression(x_data, y_data)
y_hat = lin_reg.calculate_predictions(*lin_reg.gradient_descent())
fig = plt.figure()
plt.plot(x_data, y_data, 'r.', label='Data')
plt.plot(x_data, y_hat, 'b-', label='Linear Regression')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Precision and recall missunderstanding

In pycocotools in cocoeval.py sctipt there is COCOeval class and in this class there is accumulate function for calculating Precision and Recall. Does anyone know what is this npig variable? Is this negative-positive or?
Because I saw this formula for recall: Recall = (True Positive)/(True Positive + False Negative)
Can I just use this precision and recall variable inside dictionary self.eval to get precision and recall of my model which I'm testing, and plot a precision-recall curve?
And the variable scores is this F1 score?
Because I'm not very well understand this T,R,K,A,M what is happening with this.
How can I print precision and recall in terminal?
def accumulate(self, p = None):
'''
Accumulate per image evaluation results and store the result in self.eval
:param p: input params for evaluation
:return: None
'''
print('Accumulating evaluation results...')
tic = time.time()
if not self.evalImgs:
print('Please run evaluate() first')
# allows input customized parameters
if p is None:
p = self.params
p.catIds = p.catIds if p.useCats == 1 else [-1]
T = len(p.iouThrs)
R = len(p.recThrs)
K = len(p.catIds) if p.useCats else 1
A = len(p.areaRng)
M = len(p.maxDets)
precision = -np.ones((T,R,K,A,M)) # -1 for the precision of absent categories
recall = -np.ones((T,K,A,M))
scores = -np.ones((T,R,K,A,M))
# create dictionary for future indexing
_pe = self._paramsEval
catIds = _pe.catIds if _pe.useCats else [-1]
setK = set(catIds)
setA = set(map(tuple, _pe.areaRng))
setM = set(_pe.maxDets)
setI = set(_pe.imgIds)
# get inds to evaluate
k_list = [n for n, k in enumerate(p.catIds) if k in setK]
m_list = [m for n, m in enumerate(p.maxDets) if m in setM]
a_list = [n for n, a in enumerate(map(lambda x: tuple(x), p.areaRng)) if a in setA]
i_list = [n for n, i in enumerate(p.imgIds) if i in setI]
I0 = len(_pe.imgIds)
A0 = len(_pe.areaRng)
# retrieve E at each category, area range, and max number of detections
for k, k0 in enumerate(k_list):
Nk = k0*A0*I0
for a, a0 in enumerate(a_list):
Na = a0*I0
for m, maxDet in enumerate(m_list):
E = [self.evalImgs[Nk + Na + i] for i in i_list]
E = [e for e in E if not e is None]
if len(E) == 0:
continue
dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
# different sorting method generates slightly different results.
# mergesort is used to be consistent as Matlab implementation.
inds = np.argsort(-dtScores, kind='mergesort')
dtScoresSorted = dtScores[inds]
dtm = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]
dtIg = np.concatenate([e['dtIgnore'][:,0:maxDet] for e in E], axis=1)[:,inds]
gtIg = np.concatenate([e['gtIgnore'] for e in E])
npig = np.count_nonzero(gtIg==0 )
if npig == 0:
continue
tps = np.logical_and( dtm, np.logical_not(dtIg) )
fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )
tp_sum = np.cumsum(tps, axis=1).astype(dtype=np.float)
fp_sum = np.cumsum(fps, axis=1).astype(dtype=np.float)
for t, (tp, fp) in enumerate(zip(tp_sum, fp_sum)):
tp = np.array(tp)
fp = np.array(fp)
nd = len(tp)
rc = tp / npig
pr = tp / (fp+tp+np.spacing(1))
q = np.zeros((R,))
ss = np.zeros((R,))
if nd:
recall[t,k,a,m] = rc[-1]
else:
recall[t,k,a,m] = 0
# numpy is slow without cython optimization for accessing elements
# use python array gets significant speed improvement
pr = pr.tolist(); q = q.tolist()
for i in range(nd-1, 0, -1):
if pr[i] > pr[i-1]:
pr[i-1] = pr[i]
inds = np.searchsorted(rc, p.recThrs, side='left')
try:
for ri, pi in enumerate(inds):
q[ri] = pr[pi]
ss[ri] = dtScoresSorted[pi]
except:
pass
precision[t,:,k,a,m] = np.array(q)
scores[t,:,k,a,m] = np.array(ss)
self.eval = {
'params': p,
'counts': [T, R, K, A, M],
'date': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'precision': precision,
'recall': recall,
'scores': scores,
}
toc = time.time()
print('DONE (t={:0.2f}s).'.format( toc-tic))

Using WeightedRandomSampler in PyTorch

I need to implement a multi-label image classification model in PyTorch. However my data is not balanced, so I used the WeightedRandomSampler in PyTorch to create a custom dataloader. But when I iterate through the custom dataloader, I get the error : IndexError: list index out of range
Implemented the following code using this link :https://discuss.pytorch.org/t/balanced-sampling-between-classes-with-torchvision-dataloader/2703/3?u=surajsubramanian
def make_weights_for_balanced_classes(images, nclasses):
count = [0] * nclasses
for item in images:
count[item[1]] += 1
weight_per_class = [0.] * nclasses
N = float(sum(count))
for i in range(nclasses):
weight_per_class[i] = N/float(count[i])
weight = [0] * len(images)
for idx, val in enumerate(images):
weight[idx] = weight_per_class[val[1]]
return weight
weights = make_weights_for_balanced_classes(train_dataset.imgs, len(full_dataset.classes))
weights = torch.DoubleTensor(weights)
sampler = WeightedRandomSampler(weights, len(weights))
train_loader = DataLoader(train_dataset, batch_size=4,sampler = sampler, pin_memory=True)
Based on the answer in https://stackoverflow.com/a/60813495/10077354, the following is my updated code. But then too when I create a dataloader :loader = DataLoader(full_dataset, batch_size=4, sampler=sampler), len(loader) returns 1.
class_counts = [1691, 743, 2278, 1271]
num_samples = np.sum(class_counts)
labels = [tag for _,tag in full_dataset.imgs]
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), num_samples)
Thanks a lot in advance !
I included an utility function based on the accepted answer below :
def sampler_(dataset):
dataset_counts = imageCount(dataset)
num_samples = sum(dataset_counts)
labels = [tag for _,tag in dataset]
class_weights = [num_samples/dataset_counts[i] for i in range(n_classes)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
return sampler
The imageCount function finds number of images of each class in the dataset. Each row in the dataset contains the image and the class, so we take the second element in the tuple into consideration.
def imageCount(dataset):
image_count = [0]*(n_classes)
for img in dataset:
image_count[img[1]] += 1
return image_count
That code looks a bit complex... You can try the following:
#Let there be 9 samples and 1 sample in class 0 and 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
Here is an alternative solution:
import numpy as np
from torch.utils.data.sampler import WeightedRandomSampler
counts = np.bincount(y)
labels_weights = 1. / counts
weights = labels_weights[y]
WeightedRandomSampler(weights, len(weights))
where y is a list of labels corresponding to each sample, has shape (n_samples,) and are encoded [0, ..., n_classes].
weights won't add up to 1, which is ok according to the official docs.

Oversampling or SMOTE in Pyspark

I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority class itself get higher count and then minority accordingly. Is this possible in PySpark?
+---------+-----+
| SubTribe|count|
+---------+-----+
| Chill| 10|
| Cool| 18|
|Adventure| 18|
| Quirk| 13|
| Mystery| 25|
| Party| 18|
|Glamorous| 13|
+---------+-----+
Here is another implementation of Pyspark and Scala smote that I have used in the past. I have copped the code across and referenced the source because its quite small:
Pyspark:
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.feature import VectorAssembler
def vectorizerFunction(dataInput, TargetFieldName):
if(dataInput.select(TargetFieldName).distinct().count() != 2):
raise ValueError("Target field must have only 2 distinct classes")
columnNames = list(dataInput.columns)
columnNames.remove(TargetFieldName)
dataInput = dataInput.select((','.join(columnNames)+','+TargetFieldName).split(','))
assembler=VectorAssembler(inputCols = columnNames, outputCol = 'features')
pos_vectorized = assembler.transform(dataInput)
vectorized = pos_vectorized.select('features',TargetFieldName).withColumn('label',pos_vectorized[TargetFieldName]).drop(TargetFieldName)
return vectorized
def SmoteSampling(vectorized, k = 5, minorityClass = 1, majorityClass = 0, percentageOver = 200, percentageUnder = 100):
if(percentageUnder > 100|percentageUnder < 10):
raise ValueError("Percentage Under must be in range 10 - 100");
if(percentageOver < 100):
raise ValueError("Percentage Over must be in at least 100");
dataInput_min = vectorized[vectorized['label'] == minorityClass]
dataInput_maj = vectorized[vectorized['label'] == majorityClass]
feature = dataInput_min.select('features')
feature = feature.rdd
feature = feature.map(lambda x: x[0])
feature = feature.collect()
feature = np.asarray(feature)
nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm='auto').fit(feature)
neighbours = nbrs.kneighbors(feature)
gap = neighbours[0]
neighbours = neighbours[1]
min_rdd = dataInput_min.drop('label').rdd
pos_rddArray = min_rdd.map(lambda x : list(x))
pos_ListArray = pos_rddArray.collect()
min_Array = list(pos_ListArray)
newRows = []
nt = len(min_Array)
nexs = percentageOver/100
for i in range(nt):
for j in range(nexs):
neigh = random.randint(1,k)
difs = min_Array[neigh][0] - min_Array[i][0]
newRec = (min_Array[i][0]+random.random()*difs)
newRows.insert(0,(newRec))
newData_rdd = sc.parallelize(newRows)
newData_rdd_new = newData_rdd.map(lambda x: Row(features = x, label = 1))
new_data = newData_rdd_new.toDF()
new_data_minor = dataInput_min.unionAll(new_data)
new_data_major = dataInput_maj.sample(False, (float(percentageUnder)/float(100)))
return new_data_major.unionAll(new_data_minor)
dataInput = spark.read.format('csv').options(header='true',inferSchema='true').load("sam.csv").dropna()
SmoteSampling(vectorizerFunction(dataInput, 'Y'), k = 2, minorityClass = 1, majorityClass = 0, percentageOver = 90, percentageUnder = 5)
Scala:
// Import the necessary packages
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.expressions.Window
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions._
object smoteClass{
def KNNCalculation(
dataFinal:org.apache.spark.sql.DataFrame,
feature:String,
reqrows:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val b1 = dataFinal.withColumn("index", row_number().over(Window.partitionBy("label").orderBy("label")))
val brp = new BucketedRandomProjectionLSH().setBucketLength(BucketLength).setNumHashTables(NumHashTables).setInputCol(feature).setOutputCol("values")
val model = brp.fit(b1)
val transformedA = model.transform(b1)
val transformedB = model.transform(b1)
val b2 = model.approxSimilarityJoin(transformedA, transformedB, 2000000000.0)
require(b2.count > reqrows, println("Change bucket lenght or reduce the percentageOver"))
val b3 = b2.selectExpr("datasetA.index as id1",
"datasetA.feature as k1",
"datasetB.index as id2",
"datasetB.feature as k2",
"distCol").filter("distCol>0.0").orderBy("id1", "distCol").dropDuplicates().limit(reqrows)
return b3
}
def smoteCalc(key1: org.apache.spark.ml.linalg.Vector, key2: org.apache.spark.ml.linalg.Vector)={
val resArray = Array(key1, key2)
val res = key1.toArray.zip(key2.toArray.zip(key1.toArray).map(x => x._1 - x._2).map(_*0.2)).map(x => x._1 + x._2)
resArray :+ org.apache.spark.ml.linalg.Vectors.dense(res)}
def Smote(
inputFrame:org.apache.spark.sql.DataFrame,
feature:String,
label:String,
percentOver:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val groupedData = inputFrame.groupBy(label).count
require(groupedData.count == 2, println("Only 2 labels allowed"))
val classAll = groupedData.collect()
val minorityclass = if (classAll(0)(1).toString.toInt > classAll(1)(1).toString.toInt) classAll(1)(0).toString else classAll(0)(0).toString
val frame = inputFrame.select(feature,label).where(label + " == " + minorityclass)
val rowCount = frame.count
val reqrows = (rowCount * (percentOver/100)).toInt
val md = udf(smoteCalc _)
val b1 = KNNCalculation(frame, feature, reqrows, BucketLength, NumHashTables)
val b2 = b1.withColumn("ndtata", md($"k1", $"k2")).select("ndtata")
val b3 = b2.withColumn("AllFeatures", explode($"ndtata")).select("AllFeatures").dropDuplicates
val b4 = b3.withColumn(label, lit(minorityclass).cast(frame.schema(1).dataType))
return inputFrame.union(b4).dropDuplicates
}
}
Source
Maybe this project can be useful for your goal:
Spark SMOTE
But I think that 115 records aren't enough for a random forest. You can use other simplest technique like decision trees
You can check this answer:
Is Random Forest suitable for very small data sets?

keras change the parameters during training

I have a customized layer to do a simple linear-transformation. like x*w+b. I want to change the w and b during the training, is that possible? For example, I want w1 in the first iteration and w2 in second iteration.(w1 and w2 defined by myself).
Of course, you can do it, but you need to do it in a smart way. Here is some code you can play with.
from keras import backend as K
from keras.layers import *
from keras.models import *
import numpy as np
class MyDense( Layer ) :
def __init__( self, units=64, use_bias=True, **kwargs ) :
super(MyDense, self).__init__( **kwargs )
self.units = units
self.use_bias = use_bias
return
def build( self, input_shape ) :
input_dim = input_shape[-1]
self.count = 0
self.w1 = self.add_weight(shape=(input_dim, self.units), initializer='glorot_uniform', name='w1')
self.w0 = self.add_weight(shape=(input_dim, self.units), initializer='glorot_uniform', name='w0')
if self.use_bias:
self.bias = self.add_weight(shape=(self.units,),initializer='glorot_uniform',name='bias' )
else:
self.bias = None
self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})
self.built = True
return
def call( self, x ) :
if self.count % 2 == 1 :
c0, c1 = 0, 1
else :
c0, c1 = 1, 0
w = c0 * self.w0 + c1 * self.w1
self.count += 1
output = K.dot( x, w )
if self.use_bias:
output = K.bias_add(output, self.bias, data_format='channels_last')
return output
def compute_output_shape(self, input_shape):
assert input_shape and len(input_shape) >= 2
assert input_shape[-1]
output_shape = list(input_shape)
output_shape[-1] = self.units
return tuple(output_shape)
# define a dummy model
x = Input(shape=(128,))
y = MyDense(10)(x)
y = Dense(1, activation='sigmoid')(y)
model = Model(inputs=x, outputs=y)
print model.summary()
# get some dummy data
a = np.random.randn(100,128)
b = (np.random.randn(100,) > 0).astype('int32')
# compile and train
model.compile('adam', 'binary_crossentropy')
model.fit( a, b )
Note: the following code is equivalent to what we did above, but it will NOT work !!!
if self.count % 2 == 1 :
w = self.w0
else :
w = self.w1
Why? Because having zero gradients (the former implementation) for one variable is NOT equivalent to having None gradients (the later implementation).

Resources