F# Set.union argument order performance - f#

Is there any recommended way to call a Set.union if I know one of the two sets will be larger than the other?
Set.union large small
or
Set.union small large
Thanks

Internally, sets are represented as ballanced tree (you can check the source online). When calculating the union of sets, the algorithm splits the smaller set (tree) based on the value from the root of the larger set (tree) into a set of smaller and a set of larger elements. The splitting is always performed on the smaller set to do less work. Then it recursively unions the two left and right subsets and performs some re-ballancing.
The summary is, the algorithm does not really depend on which of the sets is the first and which of them is the second argument. It will always choose the better option depending on the size of set (which is stored as part of the data structure).

The intent behind your question seems to improve performance when using Set.union by exploiting undocumented features of this function implementation. But Set.union abstracts you from implementation complexity leaving just set-theoretic meaning of union operation that is agnostic to the argument properties. Purposedly breaking through this abstraction layer adversely affects complexity and maintainability of your code and should be avoided.
Although sometimes you have no choice but deal with leaky abstractions, Set.union is definitely not the case. And it is good to hear from Tomas that Set.union implementation does not have leaky abstraction flaws.

Do whatever you want. You can also do small + large, and large - small for difference (of course also small - large).

Related

How does Word2Vec ensure that antonyms will be far apart in the vector space

Broadly speaking the training of word2vec is a process in which words that are often in the same context are clustered together in the vector space.
We start by randomly shuffling the words on the plane and then with each iteration more and more clusters form.
I think I understood this but how can we assure that the words that are antonyms or rarely appear in the same context don't end up in clusters that are close by? Also how can we know that words that are more irrelevant are farther away than word that are less irrelevant.
We can't. That is the problem of word2vec. We cannot differentiate between negation synonym and antonyms because those words, as you said, often appear in the same context.
To elaborate somewhat on Novak's response:
You seem to regard word2vec as a tool to evaluate semantic meaning. Although much of the result is correlated with meaning, that is not the functionality of word2vec. Rather, it indicates contextual correlation, which is (somewhat loosely) regarded as "relevance".
When this "relevance" is applied to certain problems, especially when multiple "relevance" hits are required to support a reportable result, then the overall effect is often useful to the problem at hand.
For your case, note that a word and its antonym will often appear near one another, for literary contrast or other emphasis. As such, they are contextually quite relevant to one another. Unless you have some pre-processing that can identify and appropriately alter various forms of negation, you will see such pairs often in your vectorization -- as is appropriate to the tool.

Incorporating Transition Probabilities in SARSA

I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.
The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.
If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.

choose cluster in hierarchical clustering

How can i choose a cluster if a point is at the same distance with two different points?
Here, X1 is at the same distance to X2 and X3. Can I directly make a cluster of X1/X2/X3 or just go one by one as X1/X2 and then X1/X2/X3?
In general you should always follow the rule of merging two if you want to have all the typical properties of the hierarchical clustering (like uniform meaning of each "slice through" - if you start merging many steps into one, you will have "unbalanced" structure, thus the height of the clustering tree will have different meanings in multiple places). Furthermore, it actually only makes sense for min linkage, if you use avg linkage or other, more complex rules, then it is not even true then after merging two points, the third one will be the next now to add (it might even end up in a different cluster). However, in general, clustering of this type (greedy) is just a heuristic, with some particular properties. Thus alternating it a bit gives you yet another clustering with some properties. Saying which one is "correct" is impossible - they are both wrong to some extent, what matters is your exact usage later on.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Resources