How to Normalize word frequencies of document in Weka - machine-learning

In Weka, class StringToWordVector defines a method called setNormalizeDocLength. It normalizes word frequencies of a document. My questions are:
what is meant by "normalizing word frequency of a document"?
How Weka does this?
A practical example will help me best. Thanks in advance.

Looking in the Weka source, this is the method that does the normalising:
private void normalizeInstance(Instance inst, int firstCopy) throws Exception
{
double docLength = 0;
if (m_AvgDocLength < 0)
{
throw new Exception("Average document length not set.");
}
// Compute length of document vector
for(int j=0; j<inst.numValues(); j++)
{
if(inst.index(j)>=firstCopy)
{
docLength += inst.valueSparse(j) * inst.valueSparse(j);
}
}
docLength = Math.sqrt(docLength);
// Normalize document vector
for(int j=0; j<inst.numValues(); j++)
{
if(inst.index(j)>=firstCopy)
{
double val = inst.valueSparse(j) * m_AvgDocLength / docLength;
inst.setValueSparse(j, val);
if (val == 0)
{
System.err.println("setting value "+inst.index(j)+" to zero.");
j--;
}
}
}
}
It looks like the most relevant part is
double val = inst.valueSparse(j) * m_AvgDocLength / docLength;
inst.setValueSparse(j, val);
So it looks like the normalisation is value = currentValue * averageDocumentLength / actualDocumentLength.

Related

EmguCV equivalent to Java mat.put(i, 0, mv)

I'm trying to convert a Java class to a C# one using EmguCV. It's for a class in Unsupervised Learning. The teacher made a program using OpenCV and Java. I have to convert it to C#.
The goal is to implement a simple Face Recognition algorithm.
The method I'm stuck at:
Mat sample = train.get(0).getData();
mean = Mat.zeros(/*6400*/sample.rows(), /*1*/sample.cols(), /*CvType.CV_64FC1*/sample.type());
// Calculating it by hand
train.forEach(person -> {
Mat data = person.getData();
for (int i = 0; i < mean.rows(); i++) {
double mv = mean.get(i, 0)[0]; // Gets the value of the cell in the first channel
double pv = data.get(i, 0)[0]; // Gets the value of the cell in the first channel
mv += pv;
mean.put(i, 0, mv); // *********** I'm stuck here ***********
}
});
So far, my C# equivalent is:
var sample = trainSet[0].Data;
mean = Mat.Zeros(sample.Rows, sample.Cols, sample.Depth, sample.NumberOfChannels);
foreach (var person in trainSet)
{
var data = person.Data;
for (int i = 0; i < mean.Rows; i++)
{
var meanValue = (double)mean.GetData().GetValue(i,0);
var personValue = (double)data.GetData().GetValue(i, 0);
meanValue += personValue;
}
}
And I am not finding the put equivalent in C#. But, if I'm being honest, I'm not even sure the previous two lines in my C# equivalent are correct.
Can someone help me figure this one out?
You can convert it like this:
Mat sample = trainSet[0].Data;
Mat mean = Mat.Zeros(sample.Rows, sample.Cols, sample.Depth, sample.NumberOfChannels);
foreach (var person in trainSet)
{
Mat data = person.Data;
for (int i = 0; i < mean.Rows; i++)
{
double meanValue = (double)mean.GetData().GetValue(i, 0);
double personValue = (double)data.GetData().GetValue(i, 0);
meanValue += personValue;
double[] mva = new double[] { meanValue };
Marshal.Copy(mva, 0, mean.DataPointer + i * mean.Cols * mean.ElementSize, 1);
}
}

Cs50 speller: not recognising any incorrect words

I'm currently working on the CS50 Speller function. I have managed to compile my code and have finished a prototype of the full program, however it does not work (it doesn't recognise any mispelled words). I am looking through my functions one at a time and printing out their output to have a look at what's going on inside.
// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
char word[LENGTH + 1];
int counter = 0;
FILE *dicptr = fopen(dictionary, "r");
if (dicptr == NULL)
{
printf("Could not open file\n");
return 1;
}
while (fscanf(dicptr, "%s", word) != EOF)
{
printf("%s", word);
node *n = malloc(sizeof(node));
if (n == NULL)
{
unload();
printf("Memory Error\n");
return false;
}
strcpy(n->word, word);
int h = hash(n->word);
n->next = table[h];
table[h] = n;
amount++;
}
fclose(dicptr);
return true;
}
From what I can see this works fine. Which makes me wonder if the issue is with my check function as shown here:
bool check(const char *word)
{
int n = strlen(word);
char copy[n + 1];
copy[n] = '\0';
for(int i = 0; i < n; i++)
{
copy[i] = tolower(word[i]);
printf("%c", copy[i]);
}
printf("\n");
node *cursor = table[hash(copy)];
while(cursor != NULL)
{
if(strcasecmp(cursor->word, word))
{
return true;
}
cursor = cursor->next;
}
return false;
}
If someone with a keener eye can spy what is the issue I'd be very grateful as I'm stumped. The first function is used to load a the words from a dictionary into a hash table\linked list. The second function is supposed to check the words of a txt file to see if they match with any of the terms in the linked list. If not then they should be counted as incorrect.
This if(strcasecmp(cursor->word, word)) is a problem. From man strcasecmp:
Return Value
The strcasecmp() and strncasecmp() functions return an
integer less than, equal to, or greater than zero if s1 (or the first
n bytes thereof) is found, respectively, to be less than, to match, or
be greater than s2.
If the words match, it returns 0, which evaluates to false.

How can I find the number of elements of a multidimentional dynamic array

My pointer is declared in the header file:
int (*array)[10];
I pass a argument to a function that initializes the array:
void __fastcall TForm1::Initarray(const int cnt)
{
try
{
Form1->array = new int[cnt][53];
}
catch(bad_alloc xa)
{
Application->MessageBoxA("Memory allocation error SEL. ", MB_OK);
}
Form1->Zeroarray();
}
//---------------------------------------------------------------------------
I set all element of the array to "0":
void __fastcall TForm1::Zeroarray()
{
__int16 cnt = SIZEOF_ARRAY(array);
// Here is where I notice the problem. cnt is not correct for the size of the first level of the array.
if(cnt)
{
for(int n = 0; n < cnt; n++)
{
for(int x = 0; x < 53; x++)
{
Form1->array[n][x] = 0;
}
}
}
}
//---------------------------------------------------------------------------
This is my defined size of array macro:
#define SIZEOF_ARRAY(a) (sizeof((a))/sizeof((a[0])));
When array is created with 10 elements of 53 elements array[10][53]
I get a return from SIZEOF_ARRAY == 0. It should equal 10.
I have tried several variation of this macro and just doing straight math with the sizeof() but I cannot get the correct output.
What am I not doing?

Objective C " * " syntax and usage

I am re-writing the particle filter library of iOS in Swift from Objective C which is available on Bitbucket and I have a question regarding a syntax of Objective C which I cannot understand.
The code goes as follows:
- (void)setRssi:(NSInteger)rssi {
_rssi = rssi;
// Ignore zeros in average, StdDev -- we clear the value before setting it to
// prevent old values from hanging around if there's no reading
if (rssi == 0) {
self.meters = 0;
return;
}
self.meters = [self metersFromRssi:rssi];
NSInteger* pidx = self.rssiBuffer;
*(pidx+self.bufferIndex++) = rssi;
if (self.bufferIndex >= RSSIBUFFERSIZE) {
self.bufferIndex %= RSSIBUFFERSIZE;
self.bufferFull = YES;
}
if (self.bufferFull) {
// Only calculate trailing mean and Std Dev when we have enough data
double accumulator = 0;
for (NSInteger i = 0; i < RSSIBUFFERSIZE; i++) {
accumulator += *(pidx+i);
}
self.meanRssi = accumulator / RSSIBUFFERSIZE;
self.meanMeters = [self metersFromRssi:self.meanRssi];
accumulator = 0;
for (NSInteger i = 0; i < RSSIBUFFERSIZE; i++) {
NSInteger difference = *(pidx+i) - self.meanRssi;
accumulator += difference*difference;
}
self.stdDeviationRssi = sqrt( accumulator / RSSIBUFFERSIZE);
self.meanMetersVariance = ABS(
[self metersFromRssi:self.meanRssi]
- [self metersFromRssi:self.meanRssi+self.stdDeviationRssi]
);
}
}
The class continues with more code and functions which are not important and what I do not understand are these two lines
NSInteger* pidx = self.rssiBuffer;
*(pidx+self.bufferIndex++) = rssi;
Variable pidx is initialized to the size of a buffer which was previously defined and then in the next line the size of that buffer and buffer plus one is equal to the RSSI variable which is passed as a parameter in the function.
I assume that * has something to do with reference but I just can't figure out the purpose of this line. Variable pidx is used only in this function for calculating trailing mean and standard deviation.
Let explain those code:
NSInteger* pidx = self.rssiBuffer; means that you are getting pointer of the first value of the buffer.
*(pidx+self.bufferIndex++) = rssi; means that you are setting the value of the buffer at index 0+self.bufferIndex to rssiand then increase bufferIndex by 1. Thanks to #Jakub Vano point it out.
In C++, it will look like that
int self.rssiBuffer[1000]; // I assume we have buffer like that
self.rssiBuffer[self.bufferIndex++] = rssi

Using Flex/Bison to parse underscore delimited value

I would like to parse some input, mainly numbers, that can be delimited using underscore ( _ ) for user readability.
Ex.
1_0001_000 -> 1000100
000_000_111 -> 000000111
How would I set up my flex/yacc to do so?
Here's a potential flex answer (in C):
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int numUnderscores = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] == '_')
numUnderscores++;
int stringLength = yyleng - numUnderscores + 1;
char *string = (char*) malloc(sizeof(char) * stringLength);
/* be sure to check and ensure string isn't NULL */
int pos = 0;
for(int i = 0; i < yyleng; i++) {
if(yytext[i] != '_') {
string[pos] = yytext[i];
pos++;
}
}
return string;
}
If you know the maximum size of the number, you could use a statically sized array instead of dynamically allocating space for the string.
As stated before flex isn't the most efficient tool for solving this problem. If this problem is part of a larger problem (such as a language grammar), then keep using flex. Otherwise, there are many more efficient ways of handling this.
If you just need the string numerically, try this:
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int number = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] != '_')
number = (number*10) + (yytext[i]-'0');
return number;
}
Just be sure to check for overflow!

Resources