longest huffman codeword - huffman-code

I'm not quite sure how to determine what the longest possible codeword is under Huffman encoding for a specific set of frequencies?
Any ideas?

For a specific set of frequencies? Generate your tree, then see how tall it is... If you're talking about in general, then it's something like N - 1.

Abu-Mostafa and McEliece provide an answer in terms of the probability of the least frequent symbol. The paper also has references to related work on similar questions.
http://tmo.jpl.nasa.gov/progress_report/42-110/110N.PDF

Related

Find translations of a given word in the corpus e.g. by machine learning, word2vec, text mining

I am using this thread to get some ideas and find some possibilities.
I have about 1000 sermons and their translations into another language. The lengths of the sermons are different. These are religious sermon texts. Because of the domain (religious), there are a lot of words that can be used in different ways based on the context. The same word can become a different meaning.
Is there a way, where I can get "programmatically" the translations of a given word in the aim language?
x1 -> [y2,z2,a2,b2,c2]
where x is the word in the language 1
and the returned array contains translations in the language 2
This would be the best case. Maybe this could be possible by training a translation model by using domain data, but I don't have a lot of data.
Could it be possible by using word2vec? By creating a vector space of both texts (language 1 and language 2) and by using a transformation matrix would it be possible to put the semantical meanings together?
Do you know other ways or have other ideas? Is there maybe such works already and what is these kinds of research called? I was not able to find something like this. I hope you guys have some ideas on how this could be reached.
The general purpose is "to create a tool" for researchers in this specific domain, that can be used to analyse sermons translation quality. If you have another ideas how the quality of a translation (semantically) can be analysed, I would be very thankful.
To get the translation for a specific word in a sentence, you can use what’s called word alignment.
To get the quality of the translation, you can use what’s called quality estimation.
machinetranslate.org/quality-estimation
A solution based on word vectors (FastText vectors are typically better than Word2Vec) is certainly possible. The task that you are looking for is bilingual dictionary induction. The most frequently used tool for that is VecMap that can align two embeddings spaces from two languages. It either uses a small seed dictionary to align all the words or it even can work in a completely unsupervised fashion.
Another solution is doing word alignment, i.e., statistically aligning words in the translations. Then you can get a dictionary based on the frequencies of how often the words are mapped to each other (note there might be problems when the languages differ morphologically). In this case, you can easily show examples of how the translations are used in sentences. If the languages you are interested in are covered by the XLM-R model, I recommend using SimAlign (a neural solution). If not, you can use Eflomal (a statistical solution).

How to calculate which word approximately fits best given a context and possible words?

Unfortunately, I don't find anything which helps me with my problem.
I have a sentence like
if the age of the applicant is higher than 18, then ...
and a list of words like
higher, bigger, greater, wider
which are all quite synonymous, because say, that something is
greater.
Now I want to find out, which of the given words approximately fits
the best at the predefined position in the sentence.
The best fitting word in this example would be 'greater', but for
example 'higher' would be also fine. In my specific case, I want to
show an error message if someone would write 'wider' because this
doesn't make sense in this semantic context.
So I want to have a look at the keyword, which is always unambiguously
in this example, and the given possible words like the four words I
mentioned above. Now I want to calculate which one of the possible
words would fit approximately the best in place of the keyword in this
semantic context.
I don't think there is a simple, single, answer to this. But as a starting point you could check out Continuous Bag-of-Words (CBOW) word embeddings, which aim at predicting a word given its context.
As an example on how to implement it, you could refer to: Tensorflow: Word2vec CBOW model and the original Word2Vec Code Archive: https://code.google.com/archive/p/word2vec/

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Binary classification of webpages where data in categories are very similar

I am working on binary classification of webpages related to a topic of my interest. I want to classify whether the webpage belongs to a certain category or not. I have manually labelled dataset with 2 categories positive and negative. However, my concern here is when I look at bag-of-words from each of the categories, the features are very similar. The positive and negative webpages are indeed very close (content wise).
Some more info - the content is in English, we are also doing stopwords removal.
How can I go about this task? Is there a different approach that can be applied to this problem?
Thanks !
You can use pairs of consecutive words instead of single words (bag of pairs of words). The hope is that pair of words may capture better the concept you 're after. Triplets of words could come next. The issue is that dimensionality goes really high (N^2). If you can't afford it an idea is use the hashing trick (check literature on random projections/hashing) on the pairs of words to bound the dimensionality.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Resources