I'm new to F# and am wondering how I would go about flattening a list.
Essentially in the database I store a record with a min_age and max_age range (this is a fictitious example for the sake of brevity - i am not agist!). My fields look something like the following:
id,
cost,
savings,
min_age,
max_age
I essentially have an F# class that acts as a one-to-one mapping with this table - i.e. all properties are mapped exactly to the database fields.
What I would like to do is flatten this range. So, instead of a list containing items like this:
saving_id = 1, cost = 100, savings = 20, min_age = 20, max_age = 26
saving_id = 2, cost = 110, savings = 10, min_age = 27, max_age = 31
I would like a list containing items like this:
saving_id = 1, cost = 100, savings = 20, age = 20
saving_id = 1, cost = 100, savings = 20, age = 21
etc.
saving_id = 2, cost = 110, savings = 10, age = 27
saving_id = 2, cost = 110, savings = 10, age = 28
etc.
Is there any in-built mechanism to flatten a list in this manner and/or does anyone know how to achieve this?
Thanks in advance,
JP
You might want to use Seq.collect. It concatenates sequences together, so in your case, you can map a function over your input that splits a single age range record to a sequence of age records and use Seq.collect to glue them together.
For example:
type myRecord =
{ saving_id: int;
cost: int;
savings: int;
min_age: int;
max_age: int }
type resultRecord =
{ saving_id: int;
cost: int;
savings: int;
age: int }
let records =
[ { saving_id = 1; cost = 100; savings = 20; min_age = 20; max_age = 26 }
{ saving_id = 2; cost = 110; savings = 10; min_age = 27; max_age = 31 } ]
let splitRecord (r:myRecord) =
seq { for ageCounter in r.min_age .. r.max_age ->
{ saving_id = r.saving_id;
cost = r.cost;
savings = r.savings;
age = ageCounter }
}
let ageRanges = records |> Seq.collect splitRecord
Edit: you can also use a sequence generator with yield!
let thisAlsoWorks =
seq { for r in records do yield! splitRecord r }
Agreeing with cfern's answer, but was wondering if this might benefit from seeing another "built-in" function used. Here's an alternative version of the splitRecord function that shows the library call for unfolding a sequence. No gain here other than having an example for Seq.unfold.
let splitRecord (r:myRecord) =
Seq.unfold (fun curr_age ->
if curr_age <= r.max_age then
Some({ saving_id = r.saving_id;
cost = r.cost;
savings = r.savings;
age = curr_age } ,
curr_age + 1)
else None)
r.min_age
Related
Quick explanation, I have recently started using codewars to further improve my programming skills and my first challenge was to make a roman numeral decoder, I went through many versions because I wasnt satisfied with what I had, So I am asking if there is an easier way of handling all the patterns that roman numerals have, for example I is 1 but if I is next to another number it takes it away for example V = 5 but IV = 4.
here is my CODE:
function Roman_Numerals_Decoder (roman)
local Dict = {I = 1, V = 5, X = 10, L = 50, C = 100, D = 500, M = 1000}
local number = 0
local i = 1
while i < #roman + 1 do
local letter = roman:sub(i,i) -- Gets the current character in the string roman
if roman:sub(i,i) == "I" and roman:sub(i + 1,i + 1) ~= "I" and roman:sub(i + 1,i + 1) ~= "" then -- Checks for the I pattern when I exists and next isnt I
number = number + (Dict[roman:sub(i +1,i + 1)] - Dict[roman:sub(i,i)]) -- Taking one away from the next number
i = i + 2 -- Increase the counter
else
number = number + Dict[letter] -- Adds the numbers together if no pattern is found, currently checking only I
i = i + 1
end
end
return number
end
print(Roman_Numerals_Decoder("MXLIX")) -- 1049 = MXLIX , 2008 = MMVIII
at the moment I am trying to get 1049 (MXLIX) to work but I am getting 1069, obviously I am not following a rule and I feel like its more wrong then it should be because usually if its not correct its 1 or 2 numbers wrong.
The algorithm is slightly different: you need to consider subtraction when the previous character has less weight than the next one.
function Roman_Numerals_Decoder (roman)
local Dict = {I = 1, V = 5, X = 10, L = 50, C = 100, D = 500, M = 1000}
local num = 0
local i = 1
for i=1, #roman-1 do
local letter = roman:sub(i,i) -- Gets the current character in the string roman
local letter_p = roman:sub(i+1,i+1)
if (Dict[letter] < Dict[letter_p]) then
num = num - Dict[letter] -- Taking one away from the next number
print("-",Dict[letter],num)
else
num = num + Dict[letter] -- Adds the numbers together if no pattern is found, currently checking only I
print("+",Dict[letter],num)
end
end
num = num + Dict[roman:sub(-1)];
print("+",Dict[roman:sub(-1)], num)
return num
end
print(Roman_Numerals_Decoder("MXLIX")) -- 1049 = MXLIX , 2008 = MMVIII
This is my String and i have problems when splitting into string array with comma seperated values for keys
{ Yr = 2019, Mth = DECEMBER , SeqN = 0, UComment = tet,tet1, OComment = test,test1, FWkMth = WK, FSafety = Y, FCustConsign = Y, FNCNRPull = 0, FNCNRPush = 0, CreatedTime = 2020-01-03 06:16:53 }
when i try to use string.Split(',') i get "Ucomment = tet","tet1" as seperate array.
But i need to have split string[] when seperated by comma
UComment = tet,tet1
OComment = test,test1
I have tried using the regex ,(?=([^\"]\"[^\"]\")[^\"]$)" but it didnot work.
You may try matching on the regex pattern \S+\s*=\s*.*?(?=\s*,\s*\S+\s*=|\s*\}$):
string input = "{ Yr = 2019, Mth = DECEMBER , SeqN = 0, UComment = tet,tet1, OComment = test,test1, FWkMth = WK, FSafety = Y, FCustConsign = Y, FNCNRPull = 0, FNCNRPush = 0, CreatedTime = 2020-01-03 06:16:53 }";
Regex regex = new Regex(#"\S+\s*=\s*.*?(?=\s*,\s*\S+\s*=|\s*\}$)");
var results = regex.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match.Groups[0].Value);
}
This prints:
Yr = 2019
Mth = DECEMBER
SeqN = 0
UComment = tet,tet1
OComment = test,test1
FWkMth = WK
FSafety = Y
FCustConsign = Y
FNCNRPull = 0
FNCNRPush = 0
CreatedTime = 2020-01-03 06:16:53
Here is an explanation of the regex pattern used:
\S+ match a key
\s* followed by optional whitespace and
= literal '='
\s* more optional whitespace
.*? match anything until seeing
(?=\s*,\s*\S+\s*=|\s*\}$) that what follows is either the start of the next key/value OR
is the end of the input
I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority class itself get higher count and then minority accordingly. Is this possible in PySpark?
+---------+-----+
| SubTribe|count|
+---------+-----+
| Chill| 10|
| Cool| 18|
|Adventure| 18|
| Quirk| 13|
| Mystery| 25|
| Party| 18|
|Glamorous| 13|
+---------+-----+
Here is another implementation of Pyspark and Scala smote that I have used in the past. I have copped the code across and referenced the source because its quite small:
Pyspark:
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.feature import VectorAssembler
def vectorizerFunction(dataInput, TargetFieldName):
if(dataInput.select(TargetFieldName).distinct().count() != 2):
raise ValueError("Target field must have only 2 distinct classes")
columnNames = list(dataInput.columns)
columnNames.remove(TargetFieldName)
dataInput = dataInput.select((','.join(columnNames)+','+TargetFieldName).split(','))
assembler=VectorAssembler(inputCols = columnNames, outputCol = 'features')
pos_vectorized = assembler.transform(dataInput)
vectorized = pos_vectorized.select('features',TargetFieldName).withColumn('label',pos_vectorized[TargetFieldName]).drop(TargetFieldName)
return vectorized
def SmoteSampling(vectorized, k = 5, minorityClass = 1, majorityClass = 0, percentageOver = 200, percentageUnder = 100):
if(percentageUnder > 100|percentageUnder < 10):
raise ValueError("Percentage Under must be in range 10 - 100");
if(percentageOver < 100):
raise ValueError("Percentage Over must be in at least 100");
dataInput_min = vectorized[vectorized['label'] == minorityClass]
dataInput_maj = vectorized[vectorized['label'] == majorityClass]
feature = dataInput_min.select('features')
feature = feature.rdd
feature = feature.map(lambda x: x[0])
feature = feature.collect()
feature = np.asarray(feature)
nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm='auto').fit(feature)
neighbours = nbrs.kneighbors(feature)
gap = neighbours[0]
neighbours = neighbours[1]
min_rdd = dataInput_min.drop('label').rdd
pos_rddArray = min_rdd.map(lambda x : list(x))
pos_ListArray = pos_rddArray.collect()
min_Array = list(pos_ListArray)
newRows = []
nt = len(min_Array)
nexs = percentageOver/100
for i in range(nt):
for j in range(nexs):
neigh = random.randint(1,k)
difs = min_Array[neigh][0] - min_Array[i][0]
newRec = (min_Array[i][0]+random.random()*difs)
newRows.insert(0,(newRec))
newData_rdd = sc.parallelize(newRows)
newData_rdd_new = newData_rdd.map(lambda x: Row(features = x, label = 1))
new_data = newData_rdd_new.toDF()
new_data_minor = dataInput_min.unionAll(new_data)
new_data_major = dataInput_maj.sample(False, (float(percentageUnder)/float(100)))
return new_data_major.unionAll(new_data_minor)
dataInput = spark.read.format('csv').options(header='true',inferSchema='true').load("sam.csv").dropna()
SmoteSampling(vectorizerFunction(dataInput, 'Y'), k = 2, minorityClass = 1, majorityClass = 0, percentageOver = 90, percentageUnder = 5)
Scala:
// Import the necessary packages
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.expressions.Window
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions._
object smoteClass{
def KNNCalculation(
dataFinal:org.apache.spark.sql.DataFrame,
feature:String,
reqrows:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val b1 = dataFinal.withColumn("index", row_number().over(Window.partitionBy("label").orderBy("label")))
val brp = new BucketedRandomProjectionLSH().setBucketLength(BucketLength).setNumHashTables(NumHashTables).setInputCol(feature).setOutputCol("values")
val model = brp.fit(b1)
val transformedA = model.transform(b1)
val transformedB = model.transform(b1)
val b2 = model.approxSimilarityJoin(transformedA, transformedB, 2000000000.0)
require(b2.count > reqrows, println("Change bucket lenght or reduce the percentageOver"))
val b3 = b2.selectExpr("datasetA.index as id1",
"datasetA.feature as k1",
"datasetB.index as id2",
"datasetB.feature as k2",
"distCol").filter("distCol>0.0").orderBy("id1", "distCol").dropDuplicates().limit(reqrows)
return b3
}
def smoteCalc(key1: org.apache.spark.ml.linalg.Vector, key2: org.apache.spark.ml.linalg.Vector)={
val resArray = Array(key1, key2)
val res = key1.toArray.zip(key2.toArray.zip(key1.toArray).map(x => x._1 - x._2).map(_*0.2)).map(x => x._1 + x._2)
resArray :+ org.apache.spark.ml.linalg.Vectors.dense(res)}
def Smote(
inputFrame:org.apache.spark.sql.DataFrame,
feature:String,
label:String,
percentOver:Int,
BucketLength:Int,
NumHashTables:Int):org.apache.spark.sql.DataFrame = {
val groupedData = inputFrame.groupBy(label).count
require(groupedData.count == 2, println("Only 2 labels allowed"))
val classAll = groupedData.collect()
val minorityclass = if (classAll(0)(1).toString.toInt > classAll(1)(1).toString.toInt) classAll(1)(0).toString else classAll(0)(0).toString
val frame = inputFrame.select(feature,label).where(label + " == " + minorityclass)
val rowCount = frame.count
val reqrows = (rowCount * (percentOver/100)).toInt
val md = udf(smoteCalc _)
val b1 = KNNCalculation(frame, feature, reqrows, BucketLength, NumHashTables)
val b2 = b1.withColumn("ndtata", md($"k1", $"k2")).select("ndtata")
val b3 = b2.withColumn("AllFeatures", explode($"ndtata")).select("AllFeatures").dropDuplicates
val b4 = b3.withColumn(label, lit(minorityclass).cast(frame.schema(1).dataType))
return inputFrame.union(b4).dropDuplicates
}
}
Source
Maybe this project can be useful for your goal:
Spark SMOTE
But I think that 115 records aren't enough for a random forest. You can use other simplest technique like decision trees
You can check this answer:
Is Random Forest suitable for very small data sets?
I need to check if first value is >= 'from' and second value is <= 'to', if true then my function retun number. It's working but I don't know if this is the best and most optimized way to get value(number from table).
local table = {
{from = -1, to = 12483, number = 0},
{from = 12484, to = 31211, number = 1},
{from = 31212, to = 53057, number = 2},
{from = 53058, to = 90200, number = 3},
{from = 90201, to = 153341, number = 4},
{from = 153342, to = 443162, number = 5},
{from = 443163, to = 753380, number = 6},
{from = 753381, to = 1280747, number = 7},
{from = 1280748, to = 2689570, number = 8},
{from = 2689571, to = 6723927, number = 9},
{from = 6723928, to = 6723928, number = 10}
}
local exampleFromValue = 31244
local exampleToValue = 42057
local function getNumber()
local number = 0
for k, v in pairs(table) do
if (v.from and exampleFromValue >= v.from) and (v.to and exampleToValue <= v.to) then
number = v.number
break
end
end
return number
end
print(getNumber())
With this small amount of data, such function doesn't seem like a performace issue. However, you can compress the data a bit:
local t = {
12484, 31212, 53058, 90201, 153342, 443163, 753381, 1280748, 2689571, 6723928
}
local exampleFromValue = 31244
local exampleToValue = 42057
local function getNumber()
local last = -1
for i, v in ipairs(t) do
if exampleFromValue >= last and exampleToValue < v then
return i - 1
end
last = v
end
return 0
end
T = {
{Name = "Mark", HP = 54, Breed = "Ghost"},
{Name = "Stan", HP = 24, Breed = "Zombie"},
{Name = "Juli", HP = 100, Breed = "Human"}},
Questions:
How would I Print just the names?
and
How can I sort it by HP?
You need to iterate over the table by using either the pairs or ipairs function to print the name. ipairs iterates from 1 to N (numeric indices only), while pairs iterates over every element, in no defined order.
> T = { {Name = "Mark", HP = 54, Breed = "Ghost"}, {Name = "Stan", HP = 24, Breed = "Zombie"}, {Name = "Juli", HP = 100, Breed = "Human"}}
> for _,t in ipairs(T) do print(t.Name) end
Mark
Stan
Juli
Then you can use the table.sort function to sort the table in-place:
> table.sort(T, function(x,y) return x.HP < y.HP end)
> for _,t in ipairs(T) do print(t.Name, t.HP) end
Stan 24
Mark 54
Juli 100
The second argument to table.sort is a comparison function of your choice; in this case, we only wanted to compare the HP values.