Oversampling or SMOTE in Pyspark - machine-learning

I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority class itself get higher count and then minority accordingly. Is this possible in PySpark?
+---------+-----+
| SubTribe|count|
+---------+-----+
| Chill| 10|
| Cool| 18|
|Adventure| 18|
| Quirk| 13|
| Mystery| 25|
| Party| 18|
|Glamorous| 13|
+---------+-----+

Here is another implementation of Pyspark and Scala smote that I have used in the past. I have copped the code across and referenced the source because its quite small:
Pyspark:
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.feature import VectorAssembler
def vectorizerFunction(dataInput, TargetFieldName):
if(dataInput.select(TargetFieldName).distinct().count() != 2):
raise ValueError("Target field must have only 2 distinct classes")
columnNames = list(dataInput.columns)
columnNames.remove(TargetFieldName)
dataInput = dataInput.select((','.join(columnNames)+','+TargetFieldName).split(','))
assembler=VectorAssembler(inputCols = columnNames, outputCol = 'features')
pos_vectorized = assembler.transform(dataInput)
vectorized = pos_vectorized.select('features',TargetFieldName).withColumn('label',pos_vectorized[TargetFieldName]).drop(TargetFieldName)
return vectorized
def SmoteSampling(vectorized, k = 5, minorityClass = 1, majorityClass = 0, percentageOver = 200, percentageUnder = 100):
if(percentageUnder > 100|percentageUnder < 10):
raise ValueError("Percentage Under must be in range 10 - 100");
if(percentageOver < 100):
raise ValueError("Percentage Over must be in at least 100");
dataInput_min = vectorized[vectorized['label'] == minorityClass]
dataInput_maj = vectorized[vectorized['label'] == majorityClass]
feature = dataInput_min.select('features')
feature = feature.rdd
feature = feature.map(lambda x: x[0])
feature = feature.collect()
feature = np.asarray(feature)
nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm='auto').fit(feature)
neighbours = nbrs.kneighbors(feature)
gap = neighbours[0]
neighbours = neighbours[1]
min_rdd = dataInput_min.drop('label').rdd
pos_rddArray = min_rdd.map(lambda x : list(x))
pos_ListArray = pos_rddArray.collect()
min_Array = list(pos_ListArray)
newRows = []
nt = len(min_Array)
nexs = percentageOver/100
for i in range(nt):
for j in range(nexs):
neigh = random.randint(1,k)
difs = min_Array[neigh][0] - min_Array[i][0]
newRec = (min_Array[i][0]+random.random()*difs)
newRows.insert(0,(newRec))
newData_rdd = sc.parallelize(newRows)
newData_rdd_new = newData_rdd.map(lambda x: Row(features = x, label = 1))
new_data = newData_rdd_new.toDF()
new_data_minor = dataInput_min.unionAll(new_data)
new_data_major = dataInput_maj.sample(False, (float(percentageUnder)/float(100)))
return new_data_major.unionAll(new_data_minor)
dataInput = spark.read.format('csv').options(header='true',inferSchema='true').load("sam.csv").dropna()
SmoteSampling(vectorizerFunction(dataInput, 'Y'), k = 2, minorityClass = 1, majorityClass = 0, percentageOver = 90, percentageUnder = 5)
Scala:
// Import the necessary packages
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.expressions.Window
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions._
object smoteClass{
def KNNCalculation(
dataFinal:org.apache.spark.sql.DataFrame,
feature:String,
reqrows:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val b1 = dataFinal.withColumn("index", row_number().over(Window.partitionBy("label").orderBy("label")))
val brp = new BucketedRandomProjectionLSH().setBucketLength(BucketLength).setNumHashTables(NumHashTables).setInputCol(feature).setOutputCol("values")
val model = brp.fit(b1)
val transformedA = model.transform(b1)
val transformedB = model.transform(b1)
val b2 = model.approxSimilarityJoin(transformedA, transformedB, 2000000000.0)
require(b2.count > reqrows, println("Change bucket lenght or reduce the percentageOver"))
val b3 = b2.selectExpr("datasetA.index as id1",
"datasetA.feature as k1",
"datasetB.index as id2",
"datasetB.feature as k2",
"distCol").filter("distCol>0.0").orderBy("id1", "distCol").dropDuplicates().limit(reqrows)
return b3
}
def smoteCalc(key1: org.apache.spark.ml.linalg.Vector, key2: org.apache.spark.ml.linalg.Vector)={
val resArray = Array(key1, key2)
val res = key1.toArray.zip(key2.toArray.zip(key1.toArray).map(x => x._1 - x._2).map(_*0.2)).map(x => x._1 + x._2)
resArray :+ org.apache.spark.ml.linalg.Vectors.dense(res)}
def Smote(
inputFrame:org.apache.spark.sql.DataFrame,
feature:String,
label:String,
percentOver:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val groupedData = inputFrame.groupBy(label).count
require(groupedData.count == 2, println("Only 2 labels allowed"))
val classAll = groupedData.collect()
val minorityclass = if (classAll(0)(1).toString.toInt > classAll(1)(1).toString.toInt) classAll(1)(0).toString else classAll(0)(0).toString
val frame = inputFrame.select(feature,label).where(label + " == " + minorityclass)
val rowCount = frame.count
val reqrows = (rowCount * (percentOver/100)).toInt
val md = udf(smoteCalc _)
val b1 = KNNCalculation(frame, feature, reqrows, BucketLength, NumHashTables)
val b2 = b1.withColumn("ndtata", md($"k1", $"k2")).select("ndtata")
val b3 = b2.withColumn("AllFeatures", explode($"ndtata")).select("AllFeatures").dropDuplicates
val b4 = b3.withColumn(label, lit(minorityclass).cast(frame.schema(1).dataType))
return inputFrame.union(b4).dropDuplicates
}
}
Source

Maybe this project can be useful for your goal:
Spark SMOTE
But I think that 115 records aren't enough for a random forest. You can use other simplest technique like decision trees
You can check this answer:
Is Random Forest suitable for very small data sets?

Related

I have been training a decoder based transformer for word generation. But it keeps generating the same words over and over again

I have been trying to create a decoder based transformer for text generation and the text its generating is the same no matter the input sequence
The following is my code some of , the code for preprocessing was remove
def process_batch(ds):
ds = tokenizer(ds)
## padd short senteces to max len using the [PAD] id
## add special tokens [START] and [END]
ds_start_end_packer = StartEndPacker(
sequence_length=MAX_SEQUENCE_LENGTH + 1,
start_value = tokenizer.token_to_id("[START]"),
end_value = tokenizer.token_to_id("[END]"),
pad_value = tokenizer.token_to_id("[PAD]")
)
ds = ds_start_end_packer(ds)
return ({"decoder_inputs":ds[:, :-1]}, ds[:, 1:])
def make_ds(seq):
dataset = tf.data.Dataset.from_tensor_slices(seq)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.map(process_batch, num_parallel_calls=tf.data.AUTOTUNE)
return dataset.shuffle(128).prefetch(32).cache()
train_ds = make_ds(train_seq)
val_ds = make_ds(val_seq)
This is the decoder section i was using keras_nlp
It have 2 decoders layers
decoder_inputs = Input(shape=(None,), dtype="int64",
name="decoder_inputs")
x = TokenAndPositionEmbedding(
vocabulary_size= VOCAB_SIZE,
sequence_length = MAX_SEQUENCE_LENGTH,
embedding_dim = EMBED_DIM,
mask_zero =True
)(decoder_inputs)
x = TransformerDecoder(
intermediate_dim = INTERMEDIATE_DIM, num_heads= NUM_HEADS
)(x)
x = TransformerDecoder(
intermediate_dim = INTERMEDIATE_DIM, num_heads= NUM_HEADS
)(x)
x = Dropout(0.5)(x)
decoder_ouput = Dense(VOCAB_SIZE, activation="softmax")(x)
decoder = Model([decoder_inputs],decoder_ouput)
decoder_outputs = decoder([decoder_inputs])
transformer = Model(inputs=decoder_inputs, outputs=decoder_outputs, name="transformer")
#transformer.load_weights("/content/my-drive/MyDrive/projects/Olsen/weights-improvement-07-0.41.hdf5")
transformer.compile("adam",loss="sparse_categorical_crossentropy", metrics=['accuracy'])

setting up hyper parameters for LSTM

This is our code can you please tell if LSTM can be used? and what how do we see if prediction is accurate as this code is predicting right the values of csv itself, but unsure about forecasting part. It is forecasting future but unreliably.
This is our data it has missing dates as well. The data ends at 1-Dec-2021
import pandas as pd
import flask
import numpy as np
import keras
import matplotlib.pyplot as plt
import tensorflow as tf
import plotly.graph_objects as go
from keras.preprocessing.sequence import TimeseriesGenerator
filename = "china cotton import concatinated.csv"
df = pd.read_csv(filename)
print(df.info())
df['date'] = pd.to_datetime(df['date'])
#df.set_index(df['date'], inplace=True,)
df.set_axis(df['date'], inplace=True)
df.drop(columns=['CottonChina importFC Index MUS Cents/Lb', 'CottonChina importFC Index LUS Cents/Lb', 'CottonChinadomestic3128BUSCents/Lb', 'CottonChina domestic2227BUS Cents/Lb','CottonChina domestic2129BUS Cents/Lb','CottonChina importUSD1 year = 100','CottonChina domesticUSD1 year = 100'], inplace=True)
close_data = df['CottonChina importFC Index SUS Cents/Lb'].values
close_data = close_data.reshape((-1,1))
split_percent = 0.80
split = int(split_percent*len(close_data))
close_train = close_data[:split]
close_test = close_data[split:]
date_train = df['date'][:split]
date_test = df['date'][split:]
print(len(close_train))
print(len(close_test))
look_back = 15
train_generator = TimeseriesGenerator(close_train, close_train, length=look_back, batch_size=20)
test_generator = TimeseriesGenerator(close_test, close_test, length=look_back, batch_size=1)
from keras.models import Sequential
from keras.layers import LSTM, Dense
model = Sequential()
model.add(
LSTM(10,
activation='relu', return_sequences=True,
input_shape=(look_back,1))
)
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
num_epochs = 25
prediction = model.predict_generator(test_generator)
close_train = close_train.reshape((-1))
close_test = close_test.reshape((-1))
prediction = prediction.reshape((-1))
"""trace1 = go.Scatter(
x = date_train,
y = close_train,
mode = 'lines',
name = 'Data'
)
trace2 = go.Scatter(
x = date_test,
y = prediction,
mode = 'lines',
name = 'Prediction'
)
trace3 = go.Scatter(
x = date_test,
y = close_test,
mode='lines',
name = 'Ground Truth'
)
layout = go.Layout(
title = "Google Stock",
xaxis = {'title' : "Date"},
yaxis = {'title' : "Close"}
) """
"""fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()"""
close_data = close_data.reshape((-1))
def predict(num_prediction, model):
prediction_list = close_data[-look_back:]
for _ in range(num_prediction):
x = prediction_list[-look_back:]
x = x.reshape((1, look_back, 1))
out = model.predict(x)[0][0]
prediction_list = np.append(prediction_list, out)
prediction_list = prediction_list[look_back-1:]
return prediction_list
def predict_dates(num_prediction):
last_date = df['date'].values[-1]
prediction_dates = pd.date_range(last_date, periods=num_prediction+1).tolist()
return prediction_dates
num_prediction = 30
forecast = predict(num_prediction, model)
forecast_dates = predict_dates(num_prediction)
trace1 = go.Scatter(
x = date_train,
y = close_train,
mode = 'lines',
name = 'Data'
)
trace2 = go.Scatter(
x = forecast_dates,
y = forecast,
mode = 'lines',
name = 'Prediction'
)
trace3 = go.Scatter(
x = date_test,
y = close_test,
mode='lines',
name = 'Ground Truth')
layout = go.Layout(
title = "Future Prediction",
xaxis = {'title' : "Date"},
yaxis = {'title' : "Close"}
)
fig = go.Figure(data=[trace1, trace2,trace3], layout=layout)
fig.write_html('first_figure.html',auto_open=True)
This is the graph plotted after ran the code. It has negative values of prices and prices are small as compare to test and train data.

Perceptron algorithm is not working as I desired

I recently tried implementing perceptron algorithm but I was not getting the desired output.
Here is the code:
import numpy as np
import pandas as pd
with open("D:/data.txt",'r') as data: #importing the data
column = data.read()
split = np.array(column.split('\n'))
final =[]
for string in split:
final.append(string.split(','))
df = pd.DataFrame(final,columns=['x','y','response'])
df['x'] = df['x'].astype(float)
df['y'] = df['y'].astype(float)
df['response'] = df['response'].astype(int)
X = np.array(df[['x','y']])
y = np.array(df['response'])
def perceptron_algorithm(x,y,learning_rate=0.01,num_epoch=25):
np.random.seed(2)
x_min, x_max = min(x.T[0]), max(x.T[0])
y_min, y_max = min(x.T[1]), max(x.T[0])
w = np.array(np.random.rand(2,1))
b = np.random.rand(1)[0] + x_max
print(w,b)
for i in range(num_epoch):
w,b = perceptronstep(x,y,w,b,learning_rate)
print(w,b)
return w,b
def perceptronstep(x,y,w,b,learning_rate):
for i in range(len(x)):
y_hat = prediction(x[i],w,b)
if y_hat-y[i] == 1:
for j in range(len(w)):
w[j] += x[i][j]*learning_rate
b += learning_rate
elif y_hat-y[i] == -1:
for j in range(len(w)):
w[j] -= x[i][j]*learning_rate
b -= learning_rate
return w,b
def prediction(x,w,b):
return step(np.matmul(x,w)+b)
def step(t):
if t >=0:
return 1
else:
return 0
w,b = perceptron_algorithm(X,y)
This is the resulting line:
This is how the data looks:
Is there something wrong with my code ?
Here is the link to the data file:
https://drive.google.com/drive/folders/1TSug9tE6bljyBFv-u3mIGWW6F_3ZY2oa?usp=sharing
Edit: I have added the initial part of the code so it will be clear what I am trying to do.
Edit 2: I have added the data file and the "import pandas as pd" line of code

keras change the parameters during training

I have a customized layer to do a simple linear-transformation. like x*w+b. I want to change the w and b during the training, is that possible? For example, I want w1 in the first iteration and w2 in second iteration.(w1 and w2 defined by myself).
Of course, you can do it, but you need to do it in a smart way. Here is some code you can play with.
from keras import backend as K
from keras.layers import *
from keras.models import *
import numpy as np
class MyDense( Layer ) :
def __init__( self, units=64, use_bias=True, **kwargs ) :
super(MyDense, self).__init__( **kwargs )
self.units = units
self.use_bias = use_bias
return
def build( self, input_shape ) :
input_dim = input_shape[-1]
self.count = 0
self.w1 = self.add_weight(shape=(input_dim, self.units), initializer='glorot_uniform', name='w1')
self.w0 = self.add_weight(shape=(input_dim, self.units), initializer='glorot_uniform', name='w0')
if self.use_bias:
self.bias = self.add_weight(shape=(self.units,),initializer='glorot_uniform',name='bias' )
else:
self.bias = None
self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})
self.built = True
return
def call( self, x ) :
if self.count % 2 == 1 :
c0, c1 = 0, 1
else :
c0, c1 = 1, 0
w = c0 * self.w0 + c1 * self.w1
self.count += 1
output = K.dot( x, w )
if self.use_bias:
output = K.bias_add(output, self.bias, data_format='channels_last')
return output
def compute_output_shape(self, input_shape):
assert input_shape and len(input_shape) >= 2
assert input_shape[-1]
output_shape = list(input_shape)
output_shape[-1] = self.units
return tuple(output_shape)
# define a dummy model
x = Input(shape=(128,))
y = MyDense(10)(x)
y = Dense(1, activation='sigmoid')(y)
model = Model(inputs=x, outputs=y)
print model.summary()
# get some dummy data
a = np.random.randn(100,128)
b = (np.random.randn(100,) > 0).astype('int32')
# compile and train
model.compile('adam', 'binary_crossentropy')
model.fit( a, b )
Note: the following code is equivalent to what we did above, but it will NOT work !!!
if self.count % 2 == 1 :
w = self.w0
else :
w = self.w1
Why? Because having zero gradients (the former implementation) for one variable is NOT equivalent to having None gradients (the later implementation).

LinearRegressionWithSGD() returns NaN

I am trying to use LinearRegressionWithSGD on Million Song Data Set and my model returns NaN's as weights and 0.0 as the intercept. What might be the issue for the error ? I am using Spark 1.40 in standalone mode.
Sample data: http://www.filedropper.com/part-00000
Here is my full code:
// Import Dependencies
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
// Define RDD
val data =
sc.textFile("/home/naveen/Projects/millionSong/YearPredictionMSD.txt")
// Convert to Labelled Point
def parsePoint (line: String): LabeledPoint = {
val x = line.split(",")
val head = x.head.toDouble
val tail = Vectors.dense(x.tail.map(x => x.toDouble))
return LabeledPoint(head,tail)
}
// Find Range
val parsedDataInit = data.map(x => parsePoint(x))
val onlyLabels = parsedDataInit.map(x => x.label)
val minYear = onlyLabels.min()
val maxYear = onlyLabels.max()
// Shift Labels
val parsedData = parsedDataInit.map(x => LabeledPoint(x.label-minYear
, x.features))
// Training, validation, and test sets
val splits = parsedData.randomSplit(Array(0.8, 0.1, 0.1), seed = 123)
val parsedTrainData = splits(0).cache()
val parsedValData = splits(1).cache()
val parsedTestData = splits(2).cache()
val nTrain = parsedTrainData.count()
val nVal = parsedValData.count()
val nTest = parsedTestData.count()
// RMSE
def squaredError(label: Double, prediction: Double): Double = {
return scala.math.pow(label - prediction,2)
}
def calcRMSE(labelsAndPreds: RDD[List[Double]]): Double = {
return scala.math.sqrt(labelsAndPreds.map(x =>
squaredError(x(0),x(1))).mean())
}
val numIterations = 100
val stepSize = 1.0
val regParam = 0.01
val regType = "L2"
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize)
.setRegParam(regParam)
val model = algorithm.run(parsedTrainData)
I am not familiar with this specific implementation of SGD, but generally if a gradient descent solver goes to nan that means that the learning rate is too big. (in this case I think it is the stepSize variable).
Try to lower it by an order of magnitude each time until it starts to converge
I can think there are two possibilities.
stepSize is big. You should try something like 0.01, 0.03, 0.1,
0.3, 1.0, 3.0....
Your train data have NaN. If so, result will be likely NaN.

Resources