In Weka, class StringToWordVector defines a method called setNormalizeDocLength. It normalizes word frequencies of a document. My questions are:
what is meant by "normalizing word frequency of a document"?
How Weka does this?
A practical example will help me best. Thanks in advance.
Looking in the Weka source, this is the method that does the normalising:
private void normalizeInstance(Instance inst, int firstCopy) throws Exception
{
double docLength = 0;
if (m_AvgDocLength < 0)
{
throw new Exception("Average document length not set.");
}
// Compute length of document vector
for(int j=0; j<inst.numValues(); j++)
{
if(inst.index(j)>=firstCopy)
{
docLength += inst.valueSparse(j) * inst.valueSparse(j);
}
}
docLength = Math.sqrt(docLength);
// Normalize document vector
for(int j=0; j<inst.numValues(); j++)
{
if(inst.index(j)>=firstCopy)
{
double val = inst.valueSparse(j) * m_AvgDocLength / docLength;
inst.setValueSparse(j, val);
if (val == 0)
{
System.err.println("setting value "+inst.index(j)+" to zero.");
j--;
}
}
}
}
It looks like the most relevant part is
double val = inst.valueSparse(j) * m_AvgDocLength / docLength;
inst.setValueSparse(j, val);
So it looks like the normalisation is value = currentValue * averageDocumentLength / actualDocumentLength.
Related
I'm trying to convert a Java class to a C# one using EmguCV. It's for a class in Unsupervised Learning. The teacher made a program using OpenCV and Java. I have to convert it to C#.
The goal is to implement a simple Face Recognition algorithm.
The method I'm stuck at:
Mat sample = train.get(0).getData();
mean = Mat.zeros(/*6400*/sample.rows(), /*1*/sample.cols(), /*CvType.CV_64FC1*/sample.type());
// Calculating it by hand
train.forEach(person -> {
Mat data = person.getData();
for (int i = 0; i < mean.rows(); i++) {
double mv = mean.get(i, 0)[0]; // Gets the value of the cell in the first channel
double pv = data.get(i, 0)[0]; // Gets the value of the cell in the first channel
mv += pv;
mean.put(i, 0, mv); // *********** I'm stuck here ***********
}
});
So far, my C# equivalent is:
var sample = trainSet[0].Data;
mean = Mat.Zeros(sample.Rows, sample.Cols, sample.Depth, sample.NumberOfChannels);
foreach (var person in trainSet)
{
var data = person.Data;
for (int i = 0; i < mean.Rows; i++)
{
var meanValue = (double)mean.GetData().GetValue(i,0);
var personValue = (double)data.GetData().GetValue(i, 0);
meanValue += personValue;
}
}
And I am not finding the put equivalent in C#. But, if I'm being honest, I'm not even sure the previous two lines in my C# equivalent are correct.
Can someone help me figure this one out?
You can convert it like this:
Mat sample = trainSet[0].Data;
Mat mean = Mat.Zeros(sample.Rows, sample.Cols, sample.Depth, sample.NumberOfChannels);
foreach (var person in trainSet)
{
Mat data = person.Data;
for (int i = 0; i < mean.Rows; i++)
{
double meanValue = (double)mean.GetData().GetValue(i, 0);
double personValue = (double)data.GetData().GetValue(i, 0);
meanValue += personValue;
double[] mva = new double[] { meanValue };
Marshal.Copy(mva, 0, mean.DataPointer + i * mean.Cols * mean.ElementSize, 1);
}
}
I'm currently working on the CS50 Speller function. I have managed to compile my code and have finished a prototype of the full program, however it does not work (it doesn't recognise any mispelled words). I am looking through my functions one at a time and printing out their output to have a look at what's going on inside.
// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
char word[LENGTH + 1];
int counter = 0;
FILE *dicptr = fopen(dictionary, "r");
if (dicptr == NULL)
{
printf("Could not open file\n");
return 1;
}
while (fscanf(dicptr, "%s", word) != EOF)
{
printf("%s", word);
node *n = malloc(sizeof(node));
if (n == NULL)
{
unload();
printf("Memory Error\n");
return false;
}
strcpy(n->word, word);
int h = hash(n->word);
n->next = table[h];
table[h] = n;
amount++;
}
fclose(dicptr);
return true;
}
From what I can see this works fine. Which makes me wonder if the issue is with my check function as shown here:
bool check(const char *word)
{
int n = strlen(word);
char copy[n + 1];
copy[n] = '\0';
for(int i = 0; i < n; i++)
{
copy[i] = tolower(word[i]);
printf("%c", copy[i]);
}
printf("\n");
node *cursor = table[hash(copy)];
while(cursor != NULL)
{
if(strcasecmp(cursor->word, word))
{
return true;
}
cursor = cursor->next;
}
return false;
}
If someone with a keener eye can spy what is the issue I'd be very grateful as I'm stumped. The first function is used to load a the words from a dictionary into a hash table\linked list. The second function is supposed to check the words of a txt file to see if they match with any of the terms in the linked list. If not then they should be counted as incorrect.
This if(strcasecmp(cursor->word, word)) is a problem. From man strcasecmp:
Return Value
The strcasecmp() and strncasecmp() functions return an
integer less than, equal to, or greater than zero if s1 (or the first
n bytes thereof) is found, respectively, to be less than, to match, or
be greater than s2.
If the words match, it returns 0, which evaluates to false.
My pointer is declared in the header file:
int (*array)[10];
I pass a argument to a function that initializes the array:
void __fastcall TForm1::Initarray(const int cnt)
{
try
{
Form1->array = new int[cnt][53];
}
catch(bad_alloc xa)
{
Application->MessageBoxA("Memory allocation error SEL. ", MB_OK);
}
Form1->Zeroarray();
}
//---------------------------------------------------------------------------
I set all element of the array to "0":
void __fastcall TForm1::Zeroarray()
{
__int16 cnt = SIZEOF_ARRAY(array);
// Here is where I notice the problem. cnt is not correct for the size of the first level of the array.
if(cnt)
{
for(int n = 0; n < cnt; n++)
{
for(int x = 0; x < 53; x++)
{
Form1->array[n][x] = 0;
}
}
}
}
//---------------------------------------------------------------------------
This is my defined size of array macro:
#define SIZEOF_ARRAY(a) (sizeof((a))/sizeof((a[0])));
When array is created with 10 elements of 53 elements array[10][53]
I get a return from SIZEOF_ARRAY == 0. It should equal 10.
I have tried several variation of this macro and just doing straight math with the sizeof() but I cannot get the correct output.
What am I not doing?
I am re-writing the particle filter library of iOS in Swift from Objective C which is available on Bitbucket and I have a question regarding a syntax of Objective C which I cannot understand.
The code goes as follows:
- (void)setRssi:(NSInteger)rssi {
_rssi = rssi;
// Ignore zeros in average, StdDev -- we clear the value before setting it to
// prevent old values from hanging around if there's no reading
if (rssi == 0) {
self.meters = 0;
return;
}
self.meters = [self metersFromRssi:rssi];
NSInteger* pidx = self.rssiBuffer;
*(pidx+self.bufferIndex++) = rssi;
if (self.bufferIndex >= RSSIBUFFERSIZE) {
self.bufferIndex %= RSSIBUFFERSIZE;
self.bufferFull = YES;
}
if (self.bufferFull) {
// Only calculate trailing mean and Std Dev when we have enough data
double accumulator = 0;
for (NSInteger i = 0; i < RSSIBUFFERSIZE; i++) {
accumulator += *(pidx+i);
}
self.meanRssi = accumulator / RSSIBUFFERSIZE;
self.meanMeters = [self metersFromRssi:self.meanRssi];
accumulator = 0;
for (NSInteger i = 0; i < RSSIBUFFERSIZE; i++) {
NSInteger difference = *(pidx+i) - self.meanRssi;
accumulator += difference*difference;
}
self.stdDeviationRssi = sqrt( accumulator / RSSIBUFFERSIZE);
self.meanMetersVariance = ABS(
[self metersFromRssi:self.meanRssi]
- [self metersFromRssi:self.meanRssi+self.stdDeviationRssi]
);
}
}
The class continues with more code and functions which are not important and what I do not understand are these two lines
NSInteger* pidx = self.rssiBuffer;
*(pidx+self.bufferIndex++) = rssi;
Variable pidx is initialized to the size of a buffer which was previously defined and then in the next line the size of that buffer and buffer plus one is equal to the RSSI variable which is passed as a parameter in the function.
I assume that * has something to do with reference but I just can't figure out the purpose of this line. Variable pidx is used only in this function for calculating trailing mean and standard deviation.
Let explain those code:
NSInteger* pidx = self.rssiBuffer; means that you are getting pointer of the first value of the buffer.
*(pidx+self.bufferIndex++) = rssi; means that you are setting the value of the buffer at index 0+self.bufferIndex to rssiand then increase bufferIndex by 1. Thanks to #Jakub Vano point it out.
In C++, it will look like that
int self.rssiBuffer[1000]; // I assume we have buffer like that
self.rssiBuffer[self.bufferIndex++] = rssi
I would like to parse some input, mainly numbers, that can be delimited using underscore ( _ ) for user readability.
Ex.
1_0001_000 -> 1000100
000_000_111 -> 000000111
How would I set up my flex/yacc to do so?
Here's a potential flex answer (in C):
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int numUnderscores = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] == '_')
numUnderscores++;
int stringLength = yyleng - numUnderscores + 1;
char *string = (char*) malloc(sizeof(char) * stringLength);
/* be sure to check and ensure string isn't NULL */
int pos = 0;
for(int i = 0; i < yyleng; i++) {
if(yytext[i] != '_') {
string[pos] = yytext[i];
pos++;
}
}
return string;
}
If you know the maximum size of the number, you could use a statically sized array instead of dynamically allocating space for the string.
As stated before flex isn't the most efficient tool for solving this problem. If this problem is part of a larger problem (such as a language grammar), then keep using flex. Otherwise, there are many more efficient ways of handling this.
If you just need the string numerically, try this:
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int number = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] != '_')
number = (number*10) + (yytext[i]-'0');
return number;
}
Just be sure to check for overflow!