SymbolicExpression - to string - f#

I'm using RProvider in F# to calculate some statistics of my data. When I call R function it returns SymbolicExpression type but it is really difficult to parse data from this type. In my code I count quantiles like this
let seq = R.seq(namedParams ["from", box 0;"to", box 1;"length", box 11;])
let quantiles = R.quantile(namedParams ["x", box dataWithoutNan1; "prob", box seq; "type", box 5;])
Then quantiles is of type SymbolicExpression.
val quantiles : SymbolicExpression =
0% 10% 20% 30% 40% 50%
0.000000e+00 3.203978e-03 5.366154e-03 1.101344e-02 8.259162e-02 4.533620e-01
60% 70% 80% 90% 100%
1.278446e+00 2.706468e+00 4.927400e+00 1.141095e+01 8.944235e+02
SymbolicExpression type has member Value
let quantilesValue = quantiles.Value
and it is of type obj
val quantilesValue : obj =
[|0.0; 0.003203978402; 0.00536615421; 0.01101343569; 0.08259161954;
0.4533619823; 1.278445928; 2.706467538; 4.927399755; 11.41095162;
894.4234507|]
What I trying to do is printing these values like
0.0; 0.003203978402; 0.00536615421; 0.01101343569; 0.08259161954;
0.4533619823; 1.278445928; 2.706467538; 4.927399755; 11.41095162;
894.4234507
I tried to cast these objects to Seq or to List but I was not able to do this.
Any idea how to get values from SymbolicExpression in some simple way?

These code works for this issue
let quantiles = R.quantile(namedParams ["x", box dataWithoutNan1; "prob", box seq; "type", box 5;]).GetValue<double[]>()
or
let quantiles = R.quantile(namedParams ["x", box dataWithoutNan1; "prob", box seq; "type", box 5;]).GetValue<list<double>>()

Related

How to fine tune a masked language model?

I'm trying to follow the huggingface tutorial on fine tuning a masked language model (masking a set of words randomly and predicting them). But they assume that the dataset is in their system (can load it with from datasets import load_dataset; load_dataset("dataset_name")). However, my input dataset is a long string:
text = "This is an attempt of a great example. "
dataset = text * 3000
I followed their approach and tokenized each it:
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
import torch
from transformers import DataCollatorForLanguageModeling
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_long_text(tokenizer, long_text):
individual_sentences = long_text.split('.')
tokenized_sentences_list = tokenizer(individual_sentences)['input_ids']
tokenized_sequence = [x for xs in tokenized_sentences_list for x in xs]
return tokenized_sequence
tokenized_sequence = tokenize_long_text(tokenizer, long_text)
Following by chunking it into equal length segments:
def chunk_long_tokenized_text(tokenizer_text, chunk_size):
# Compute length of long tokenized texts
total_length = len(tokenizer_text)
# We drop the last chunk if it's smaller than chunk_size
total_length = (total_length // chunk_size) * chunk_size
return [tokenizer_text[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
chunked_sequence = chunk_long_tokenized_text(tokenized_sequence, 30)
Created a data collator for random masking:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) # expects a list of dicts, where each dict represents a single chunk of contiguous text
Example of how it works:
d = {}
d['input_ids'] = chunked_sequence[0]
d
>>>{'input_ids': [101,
2023,
2003,
1037,
2307,
103,...
for chunk in data_collator([ d ])["input_ids"]:
print(f"\n'>>> {tokenizer.decode(chunk)}'")
>>>'>>> [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this'
However, the remaining steps (which I believe is just the training component) seem to only work using their trainer method, which can only take their dataset.
How can this work with a dataset in the form of a string?

How to get class labels from TensorFlow prediction

I have a classification model in TF and can get a list of probabilities for the next class (preds). Now I want to select the highest element (argmax) and display its class label.
This may seems silly, but how can I get the class label that matches a position in the predictions tensor?
feed_dict={g['x']: current_char}
preds, state = sess.run([g['preds'],g['final_state']], feed_dict)
prediction = tf.argmax(preds, 1)
preds gives me a vector of predictions for each class. Surely there must be an easy way to just output the most likely class (label)?
Some info about my model:
x = tf.placeholder(tf.int32, [None, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [None, 1], name='labels_placeholder')
batch_size = batch_size = tf.shape(x)[0]
x_one_hot = tf.one_hot(x, num_classes)
rnn_inputs = [tf.squeeze(i, squeeze_dims=[1]) for i in
tf.split(x_one_hot, num_steps, 1)]
tmp = tf.stack(rnn_inputs)
print(tmp.get_shape())
tmp2 = tf.transpose(tmp, perm=[1, 0, 2])
print(tmp2.get_shape())
rnn_inputs = tmp2
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
rnn_outputs = rnn_outputs[:, num_steps - 1, :]
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
logits = tf.matmul(rnn_outputs, W) + b
predictions = tf.nn.softmax(logits)
A prediction is an array of n types of classes(labels). It represents the model's "confidence" that the image corresponds to each of its classes(labels). You can check which label has the highest confidence value by using:
prediction = np.argmax(preds, 1)
After getting this highest element index using (argmax function) out of other probabilities, you need to place this index into class labels to find the exact class name associated with this index.
class_names[prediction]
Please refer to this link for more understanding.
You can use tf.reduce_max() for this. I would refer you to this answer.
Let me know if it works - will edit if it doesn't.
Mind that there are sometimes several ways to load a dataset. For instance with fashion MNIST the tutorial could lead you to use load_data() and then to create your own structure to interpret a prediction. However you can also load these data by using tensorflow_datasets.load(...) like here after installing tensorflow-datasets which gives you access to some DatasetInfo. So for instance if your prediction is 9 you can tell it's a boot with:
import tensorflow_datasets as tfds
_, ds_info = tfds.load('fashion_mnist', with_info=True)
print(ds_info.features['label'].names[9])
When you use softmax, the labels you train the model on are either numbers 0..n or one-hot encoded values. So if original labels of your data are let's say string names, you must map them to integers first and keep the mapping as a variable (such as 0 -> "apple", 1 -> "orange", 2 -> "pear" ...).
When using integers (with loss='sparse_categorical_crossentropy'), you get predictions as an array of probabilities, you just find the array index with the max value. You can use this predicted index to reverse-map to your label:
predictedIndex = np.argmax(predictions) // 2
predictedLabel = indexToLabelMap[predictedIndex] // "pear"
If you use one-hot encoded labels (with loss='categorical_crossentropy'), the predicted index corresponds with the "hot" index of your label.
Just for reference, I needed this info when I was working with MNIST dataset used in Google's Machine learning crash course. There is also a good classification tutorial in the Tensorflow docs.

How to make a value_of_things function f#

Hey guys let's say I have a function that gets that day's rate of how much something costs, by the pound, and multiplies it by how many pounds a customer wants. i.e
let convert_func (crab_rate, lobster_rate);
//and then on a certain day it is
let (crab_rate, lobster_rate) = convert_fun(3.4, 6.8); // $3.8 a pound for crab, $6.8 a pound for lobster.
// 10 being how many pounds i want.
crab_rate10 ;
Then my out put would be whatever 38 since ($3.8 * 10lbs) = $38
I tried doing if statements so that when the user just wants to find out the total cost of one thing and not both. But I keep getting errors. I can't figure out how to store the rate values in the parameters and then calling the function.
This is what i tried
let crab_rate (pound, rate) = (float pound) * rate;
let lobster_rate (pound, rate) = (float pound) * rate;
let convert_func (crab_rate, lobster_rate)= function (first,second ) ->
if crab_rate then (float pound) * rate;
elif lobster_rate (float pound) * rate;
let (crab_rate, lobster_rate) = convert_fun(3.4, 6.8); // $3.8 a pound for crab, $6.8 a pound for lobster.
// 10 being how many pounds i want.
crab_rate10 ;
I think you should start by making a general function for converting a cost/weight and a weight into a cost. In F#, you can even use units of measure to help you:
[<Measure>] type USD // Unit of measure: US Dollars
[<Measure>] type lb // Unit of measure: lbs
let priceForWeight rate (weight : float<lb>) : float<USD> =
rate * weight
The nice thing about functional languages with curried arguments is that we can easily use partial function application. That means when we have a function that has two arguments, we can choose to supply just one argument and get back a new function from that one remaining argument to the result.
We can therefore define a further pair of functions that use this priceForWeight function.
let crabPriceForWeight weight = priceForWeight 3.8<USD/lb> weight
let lobsterPriceForWeight weight = priceForWeight 6.8<USD/lb> weight
Notice that we've used our original function to define two new functions with fixed rates.
We can then evaluate it like this:
let crabPrice10 = crabPriceForWeight 10.0<lb> // result 38.0<USD>
Of course, you can also define a function that returns both prices together as a tuple for a supplied weight:
let crabAndLobsterPriceForWeight weight =
crabPriceForWeight weight, lobsterPriceForWeight weight

Functional Programming Exercise

As a functional programming exercise, I thought I'd write a little program to rank crafting recipes in an mmo by profitability.
In an OO language, I'd make strategy objects for each recipe, with Cost(), ExpectedRevenue(), and Volume() as members. I'd then put all the objects in a list and sort them by a profitability/time function.
Trying to accomplish the same result in F#, but I'm not certain how to go about it. I have some disjointed cost functions, for example:
let cPM (ss,marble) = (15.0 * ss + 10.0 * marble + 0.031) / 5.0
let cTRef (tear,clay) = (tear + 10.0 * clay + 0.031) / 5.0
and then revenue and volume definitions like:
let rPM = 1.05
let vPM = 50
but I'm not sure what to do now. Make a list of tuples that look like
(name: string, cost:double, revenue:double, volume:int)
and then sort the list? It feels like I'm missing something- still thinking in OO, not to mention adding new recipes in this fashion will be rather awkward.
Has anyone any tips to use the functional concepts in a better way? It seemed like this type of calculation problem would be a good fit for the functional style.
Much appreciated.
This is a fairly complex question with multiple possible answers. Also, it is quite hard to guess anything about your domain (I don't know what game you're playing :-)), so I'll try to make something up, based on the example.
The basic functional approach would be to model the different recipes using a discriminated union.
type Recipe =
| FancySword of gold:float * steel:float // Sword can be created from gold & steel
| MagicalStone of frogLegs:float // Magical stone requires some number of frog legs
Also, we need to know the prices of things in the game:
type Prices = { Gold : float; Steel : float; FrogLegs : float }
Now you can write functions to calculate the cost and expected revenue of the recipes:
let cost prices recipe =
match recipe with
| FancySword(g, s) ->
// To create a sword, we need 2 pieces of gold and 15 pieces of steel
2.0 * g * prices.Gold + s * 15.0 * prices.Steel
| MagicalStone(l) -> l * prices.FrogLeg
This takes the record with all the prices and it takes a recipe that you want to evaluate.
The example should give you some idea - starting with a discriminated union to model the problem domain (different recipes) and then writing a function with pattern matching in it is usually a good way to get started - but it's hard to say more with the limited information in your question.
In functional languages you can do anything only with functions. Here you can define common profitability function and sort your recipes with it and List.sortBy:
// recipe type with constants for Revenue, Volume and (ss,marble)
type recipe = {r: float; v: float; smth: float * float}
// list of recipes
let recipes = [
{r = 1.08; v = 47.0; smth = (28.0, 97.0)};
{r = 1.05; v = 50.0; smth = (34.0, 56.0)} ]
// cost function
let cPM (ss,marble) = (15.0 * ss + 10.0 * marble + 0.031) / 5.0
// profitability function with custom coefficients
let profitability recipe = recipe.r * 2.0 + recipe.v * 3.0 + cPM recipe.smth
// sort recipes by profitability
let sortedRecipes =
List.sortBy profitability recipes
// note: it's reordered now
printfn "%A" sortedRecipes
The accepted answer is a little lacking in type safety, I think - you already stated that a FancySword is made of gold and steel, so you shouldn't have to remember to correctly pair the gold quantity with the gold price! The type system ought to check that for you, and prevent an accidental g * prices.Steel mistake.
If the set of possible resource types is fixed, then this is a nice use-case for Units of Measure.
[<Measure>] type Gold
[<Measure>] type Steel
[<Measure>] type FrogLegs
[<Measure>] type GameMoney
type Recipe = {
goldQty : float<Gold>
steelQty : float<Steel>
frogLegsQty : int<FrogLegs>
}
type Prices = {
goldPrice : float<GameMoney/Gold>
steelPrice : float<GameMoney/Steel>
frogLegsPrice : float<GameMoney/FrogLegs>
}
let recipeCost prices recipe =
prices.goldPrice * recipe.goldQty +
prices.steelPrice * recipe.steelQty +
// frog legs must be converted to float while preserving UoM
prices.frogLegsPrice * (recipe.frogLegsQty |> float |> LanguagePrimitives.FloatWithMeasure)
let currentPrices = {goldPrice = 100.0<GameMoney/Gold>; steelPrice = 50.0<GameMoney/Steel>; frogLegsPrice = 2.5<GameMoney/FrogLegs> }
let currentCost = recipeCost currentPrices
let fancySwordRecipe = {goldQty = 25.4<Gold>; steelQty = 76.4<Steel>; frogLegsQty = 0<FrogLegs>}
let fancySwordCost = currentCost fancySwordRecipe
The compiler will now ensure that all calculations check out. In the recipeCost function, for example, it ensures that the total is a float<GameMoney>.
Since you mentioned volume, I think you can see how you can replicate the same pattern to write type-safe functions that will calculate total recipe volumes as a value of type int<InventoryVolume>.

Calculating vector distance for classification with mixed features

I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: http://archive.ics.uci.edu/ml/datasets/Adult The classification problem is whether or not a person makes over 50k a year based on their census data.
Two example entries are as follows:
45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.
Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:
distance = 0 in that coordinate if there's a match, 1 otherwise
The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.
Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.
I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.
You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.
Create individual Maps for each data points and use the map to convert to a double value.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

Resources