I have the following code:
let rand = System.Random()
let gold = [ for i in years do yield rand.NextDouble()]
However I cannot collapse it into one line as
let gold = [ for i in years do yield System.Random.NextDouble()]
Why?
Your two code examples are not equivalent. The first one creates an object, and then repeatedly calls NextDouble() on that object. The second one appears to call a static method on the Random class, but I'd be surprised if it even compiles, since NextDouble() is not actually declared as static.
You can combine the creation of the Random instance and its usage in a couple of ways, if desired:
let gold =
let rand = Random()
[for i in 1..10 do yield rand.NextDouble()]
or
let gold = let rand = Random() in [for i in 1..10 do yield rand.NextDouble()]
Most random numbers generated by computers, such as in the case of your code, are not random in the true sense of the word. They are generated by an algorithm, and given the algorithm and seed (like a starting point for the algorithm), deterministic.
Essentially, when you want a series of random numbers, you select a seed and an algorithm, and then that algorithm starts generating random numbers for you using the seed as the starting point and iterating he algorithm from there.
In the old days, people would produce books of "random numbers". These books used a seed and algorithm to produce the random series of numbers ahead of time. If you wanted a random number, then you would select one from the book.
Computers work similarly. When you call
Let rand = System.Random()
You are initializing the random number generator. It is like you are creating a book full of random numbers. Then to iteratively draw random numbers from the series, you do
rand.NextDouble()
That is like picking the first number from the series (book). Call it again and you pick the second number from the series, etc.
What is the point of F#/.NET having you initialize the random number generator? Well, what if you wanted repeatable results where the random series would contain the same numbers every time you ran the code? Well, doing it this way allows you to set the seed so you are guaranteed to have the same "book of random numbers" each time:
let rand = System.Random(1)
Or, what if you wanted to different series of random numbers?
let rand1 = System.Random(1)
let rand2 = System.Random(2)
Related
I'm trying to calculate the importance (in percentage) of each variable in my model (using smileRandomForest) in GEE.
var RFmodel = ee.Classifier.smileRandomForest(1000).train(trainingData, 'classID', predictionBands);
var var_imp = ee.Feature(null, ee.Dictionary(RFmodel.explain()).get('importance'));
In the example above, "var_imp" is a feature that has "importance" as a property. To calculate importance as %, I'm assuming I'll need to do something like:
Importance (%) = (variable importance value)/(total sum of all importance variables) * 100
Can someone help me to write a function for this? I'm relatively new to GEE and have no idea where to start. I've tried using aggregate_sum() at least to sum all variables, but "var_imp" isn't a FeatureCollection so it doesn't work.
You can work directly with the dictionary. Extract a list of the values and reduce it with a sum reducer to get the total importance. Then you can map over the importance dictionary and calculate the percentage of each band.
For future questions, please include a link to the code editor (use Get Link) and make sure all used assets are shared. It makes it easier to help you, increasing your chance of getting answers to your questions.
var importance = ee.Dictionary(
classifier.explain().get('importance')
)
var totalImportance = importance.values().reduce(ee.Reducer.sum())
var importancePercentage = importance.map(function (band, importance) {
return ee.Number(importance).divide(totalImportance).multiply(100)
})
https://code.earthengine.google.com/bd63aa319a37516d924a6d8c391ab076
Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)
I have a character feature of weather condition i.e rain, snow...."
I'd like to feed the feature to a random forest, what kind of transformation I can do to turn it into numeric
thanks
You can convert a categorical variable into a number by turning the single attribute into n attributes where n is the number of digits necessary to represent the total number of options in binary.
For example, if I have an attribute [weather] that can take the values of "rain","sun","snow" then you could instead create 2 dummy attributes [weather1] and [weather0]. The reason you can do this with 2 dummy attribute is because 3 can be represented in binary with 2 digits: 11.
Then instead of using "rain" you would represent the category as a binary value across the two dummy attributes:
"rain" is first so it would be 01 in binary so that feature would have a 0 for [weather1] and a 1 for [weather0]. "sun" is second so you would represent it as 10 and "snow" is third so you could represent it as 11. The order isn't important so long as it's consistent across your variables.
If we think of these values as python dictionaries then we can see a more clear example:feature[weather] = "rain"new_feature[weather] = [0,1] ornew_feature[weather0] = 1, new_feature[weather1] = 0
You shouldn't. The weather condition is a categorical variable, which random forest handles natively. Leave it as it is and let the algorithm work as it should.
I have an sav file with plenty of variables. What I would like to do now is create macros/routines that detect basic properties of a range of item sets, using SPSS syntax.
COMPUTE scale_vars_01 = v_28 TO v_240.
The code above is intended to define a range of items which I would like to observe in further detail. How can I get the number of elements in the "array" scale_vars_01, as an integer?
Thanks for info. (as you see, the SPSS syntax is still kind of strange to me and I am thinking about using Python instead, but that might be too much overhead for my relatively simple purposes).
One way is to use COUNT, such as:
COUNT Total = v_28 TO v_240 (LO THRU HI).
This will count all of the valid values in the vector. This will not work if the vector contains mixed types (e.g. string and numeric) or if the vector has missing values. An inefficient way to get the entire count using DO REPEAT is below:
DO IF $casenum = 1.
COMPUTE Total = 0.
DO REPEAT V = v_28 TO V240.
COMPUTE Total = Total + 1.
END REPEAT.
ELSE.
COMPUTE Total = LAG(Total).
END IF.
This will work for mixed type variables, and will count fields with missing values. (The DO IF would work the same for COUNT, this forces a data pass, but for large datasets and large lists will only evaluate for the first case.)
Python is probably the most efficient way to do this though - and I see no reason not to use it if you are familiar with it.
BEGIN PROGRAM.
import spss
beg = 'X1'
end = 'X10'
MyVars = []
for i in xrange(spss.GetVariableCount()):
x = spss.GetVariableName(i)
MyVars.append(x)
len = MyVars.index(end) - MyVars.index(beg) + 1
print len
END PROGRAM.
Statistics has a built-in macro facility that could be used to define sets of variables, but the Python apis provide much more powerful ways to access and use the metadata. And there is an extension command SPSSINC SELECT VARIABLES that can define macros based on variable metadata such as patterns in names, measurement level, type, and other properties. It generates a macro listing these variables that can then be used in standard syntax.
I want to write a multiplayer game on iOS platform. The game relied on random numbers that generated dynamically in order to decide what happen next. But it is a multiplayer game so this "random number" should be the same for all device for every player in order to have a consistent game play.
Therefor I need a good reliable pseudorandom number generator that if I seed it a same number first than it will keep generate same sequences of random number on all device (iPad/iPhone/iPodTouch) and all OS version.
Looks like srand and rand will do the job for me but I am not sure does rand guarantee to generate same number on all device across all OS version? Otherwise is any good pseudorandom number generate algorithm?
From the C standard (and Objective C is a thin layer on top of C so this should still hold):
If srand is then called with the same seed value, the sequence of pseudo-random numbers shall be repeated.
There's no guarantee that different implementations (or even different versions of the same implementation) will give a consistent sequence based on the seed. If you really want to guarantee that, you can code up your own linear congruential generator, such as the example one in the standard itself:
// RAND_MAX assumed to be 32767.
static unsigned long int next = 1;
void srand(unsigned int seed) { next = seed; }
int rand(void) {
next = next * 1103515245 + 12345;
return (unsigned int)(next/65536) % 32768;
}
And, despite the fact that there are better generators around, the simple linear congruential one is generally more than adequate, unless you're a statistician or cryptographer.
If you provide a seed value to rand then it should consistently provide the same sequence of pseudorandom numbers. You can also try arc4random().