Is it possible to parse a string of fixed length in yacc/lex? - parsing

I have a file format something like this
...
{string_length} {binary_string}
...
example:
...
10 abcdefghij
...
Is this possible to parse using lexer/yacc? There is no null terminator for the string, so I'm at a loss of how to tokenize that.
I'm currently using ply's lexer and yacc for this

You can't do it with a regular expression, but you can certainly extract the lexeme. You're not specific about how the length is terminated; here, I'm assuming that it is terminated by a single space character. I'm also assuming that yylval has some appropriate struct type:
[[:digit:]]+" " { unsigned long len = atol(yytext);
yylval.str = malloc(len);
yylval.len = len;
for (char *p = yylval.str; len; --len, ++p) {
int ch = input();
if (ch == EOF) { /* handle the lexical error */ }
*p = ch;
}
return BINARY_STRING;
}
There are other solutions (a start condition and a state variable for the count, for example), but I think the above is the simplest.

Related

Cs50 speller: not recognising any incorrect words

I'm currently working on the CS50 Speller function. I have managed to compile my code and have finished a prototype of the full program, however it does not work (it doesn't recognise any mispelled words). I am looking through my functions one at a time and printing out their output to have a look at what's going on inside.
// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
char word[LENGTH + 1];
int counter = 0;
FILE *dicptr = fopen(dictionary, "r");
if (dicptr == NULL)
{
printf("Could not open file\n");
return 1;
}
while (fscanf(dicptr, "%s", word) != EOF)
{
printf("%s", word);
node *n = malloc(sizeof(node));
if (n == NULL)
{
unload();
printf("Memory Error\n");
return false;
}
strcpy(n->word, word);
int h = hash(n->word);
n->next = table[h];
table[h] = n;
amount++;
}
fclose(dicptr);
return true;
}
From what I can see this works fine. Which makes me wonder if the issue is with my check function as shown here:
bool check(const char *word)
{
int n = strlen(word);
char copy[n + 1];
copy[n] = '\0';
for(int i = 0; i < n; i++)
{
copy[i] = tolower(word[i]);
printf("%c", copy[i]);
}
printf("\n");
node *cursor = table[hash(copy)];
while(cursor != NULL)
{
if(strcasecmp(cursor->word, word))
{
return true;
}
cursor = cursor->next;
}
return false;
}
If someone with a keener eye can spy what is the issue I'd be very grateful as I'm stumped. The first function is used to load a the words from a dictionary into a hash table\linked list. The second function is supposed to check the words of a txt file to see if they match with any of the terms in the linked list. If not then they should be counted as incorrect.
This if(strcasecmp(cursor->word, word)) is a problem. From man strcasecmp:
Return Value
The strcasecmp() and strncasecmp() functions return an
integer less than, equal to, or greater than zero if s1 (or the first
n bytes thereof) is found, respectively, to be less than, to match, or
be greater than s2.
If the words match, it returns 0, which evaluates to false.

C++ - Removing invalid characters when a user paste in a grid

Here's my situation. I have an issue where I need to filter invalid characters that a user may paste from word or excel documents.
Here is what I'm doing.
First I'm trying to convert any unicode characters to ascii
extern "C" COMMON_STRING_FUNCTIONS long ConvertUnicodeToAscii(wchar_t * pwcUnicodeString, char* &pszAsciiString)
{
int nBufLen = WideCharToMultiByte(CP_ACP, 0, pwcUnicodeString, -1, NULL, 0, NULL, NULL)+1;
pszAsciiString = new char[nBufLen];
WideCharToMultiByte(CP_ACP, 0, pwcUnicodeString, -1, pszAsciiString, nBufLen, NULL, NULL);
return nBufLen;
}
Next I'm filtering out any character that does not have a value between 31 and 127
String __fastcall TMainForm::filterInput(String l_sConversion)
{
// Used to store every character that was stripped out.
String filterChars = "";
// Not Used. We never received the whitelist
String l_SWhiteList = "";
// Our String without the invalid characters.
AnsiString l_stempString;
// convert the string into an array of chars
wchar_t* outputChars = l_sConversion.w_str();
char * pszOutputString = NULL;
//convert any unicode characters to ASCII
ConvertUnicodeToAscii(outputChars, pszOutputString);
l_stempString = (AnsiString)pszOutputString;
//We're going backwards since we are removing characters which changes the length and position.
for (int i = l_stempString.Length(); i > 0; i--)
{
char l_sCurrentChar = l_stempString[i];
//If we don't have a valid character, filter it out of the string.
if (((unsigned int)l_sCurrentChar < 31) ||((unsigned int)l_sCurrentChar > 127))
{
String l_sSecondHalf = "";
String l_sFirstHalf = "";
l_sSecondHalf = l_stempString.SubString(i + 1, l_stempString.Length() - i);
l_sFirstHalf = l_stempString.SubString(0, i - 1);
l_stempString = l_sFirstHalf + l_sSecondHalf;
filterChars += "\'" + ((String)(unsigned int)(l_sCurrentChar)) + "\' ";
}
}
if (filterChars.Length() > 0)
{
LogInformation(__LINE__, __FUNC__, Utilities::LOG_CATEGORY_GENERAL, "The Following ASCII Values were filtered from the string: " + filterChars);
}
// Delete the char* to avoid memory leaks.
delete [] pszOutputString;
return l_stempString;
}
Now this seems to work except, when you try to copy and past bullets from a word document.
o Bullet1:
 subbullet1.
You will get something like this
oBullet1?subbullet1.
My filter function is called on an onchange event.
The bullets are replaced with the value o and a question mark.
What am I doing wrong, and is there a better way of trying to do this.
I'm using c++ builder XE5 so please no Visual C++ solutions.
When you perform the conversion to ASCII (which is not actually converting to ASCII, btw), Unicode characters that are not supported by the target codepage are lost - either dropped, replaced with ?, or replaced with a close approximation - so they are not available to your scanning loop. You should not do the conversion at all, scan the source Unicode data as-is instead.
Try something more like this:
#include <System.Character.hpp>
String __fastcall TMainForm::filterInput(String l_sConversion)
{
// Used to store every character sequence that was stripped out.
String filterChars;
// Not Used. We never received the whitelist
String l_SWhiteList;
// Our String without the invalid sequences.
String l_stempString;
int numChars;
for (int i = 1; i <= l_sConversion.Length(); i += numChars)
{
UCS4Char ch = TCharacter::ConvertToUtf32(l_sConversion, i, numChars);
String seq = l_sConversion.SubString(i, numChars);
//If we don't have a valid codepoint, filter it out of the string.
if ((ch <= 31) || (ch >= 127))
filterChars += (_D("\'") + seq + _D("\' "));
else
l_stempString += seq;
}
if (!filterChars.IsEmpty())
{
LogInformation(__LINE__, __FUNC__, Utilities::LOG_CATEGORY_GENERAL, _D("The Following Values were filtered from the string: ") + filterChars);
}
return l_stempString;
}

Process unicode string in C and Objective C

I write a C function to read characters in an user-input string. Because this string is user-input, so it can contains any unicode characters. There's an Objective C method receives the user-input NSString, then convert this string to NSData and pass this data to the C function for processing. The C function searches for these symbol characters: *, [, ], _, it doesn't care any other characters. Everytime it found one of the symbols, it processes and then calls an Objective C method, pass the location of the symbol.
C code:
typedef void (* callback)(void *context, size_t location);
void process(const uint8_t *data, size_t length, callback cb, void *context)
{
size_t i = 0;
while (i < length)
{
if (data[i] == '*' || data[i] == '[' || data[i] == ']' || data[i] == '_')
{
int valid = 0;
//do something, set valid = 1
if (valid)
cb(context, i);
}
i++;
}
}
Objective C code:
//a C function declared in .m file
void mycallback(void *context, size_t location)
{
[(__bridge id)context processSymbolAtLocation:location];
}
- (void)processSymbolAtLocation:(NSInteger)location
{
NSString *result = [self.string substringWithRange:NSMakeRange(location, 1)];
NSLog(#"%#", result);
}
- (void)processUserInput:(NSString*)string
{
self.string = string;
//convert string to data
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//pass data to C function
process(data.bytes, data.length, mycallback, (__bridge void *)(self));
}
The code works fine if the input string contains only English characters. If it contains composed character sequences, multibyte characters or other unicode characters, the result string in processSymbolAtLocation method is not the expected symbol.
How to convert the NSString object to NSData correctly? How to get the correct location?
Thanks!
Your problem is that you start off with a UTF-16 encoded NSString and produce a sequence of UTF-8 encoded bytes. The number of code units required to represent a string in UTF-16 may not be equal to that number required to represent it in UTF-8, so the offsets in your two forms may not match - as you have found out.
Why are you using C to scan the string for matches in the first place? You might want to look at NSString's rangeOfCharacterFromSet:options:range: method which you can use to find the next occurrence of character from your set.
If you need to use C then convert your string into a sequence of UTF-16 words and use uint16_t on the C side.
HTH

I get a Suspicious pointer conversion in function main. How to get rid of this?

I'm new here at stackoverflow. The title is my question. Can someone please help me on this. Thanks. I've been working on this for like 3 days.
This part of code encodes the file to a huffman code
void encode(const char *s, char *out)
{
while (*s) {
strcpy(out, code[*s]);
out += strlen(code[*s++]);
}
}
This part of code deciphers the file from a huffman code to a human readable code
void decode(const char *s, node t)
{
node n = t;
while (*s) {
if (*s++ == '0') n = n->left;
else n = n->right;
if (n->c) putchar(n->c), n = t;
}
putchar('\n');
if (t != n) printf("garbage input\n");
}
This part is where I get my error.
int main(void)
{
int i;
const char *str = "this is an example for huffman encoding", buf[1024];
init(str);
for (i=0;i<128;i++)
if (code[i]) printf("'%c': %s\n", i, code[i]);
encode(str, buf); /* I get the error here */
printf("encoded: %s\n", buf);
printf("decoded: ");
decode(buf, q[1]);
return 0;
}
Declare 'buf' in a different line, and not as 'const':
char buf[1024];
The const applies to all the declarations on the line, so you're declaring buf as a const char[1024]. That means that calling encode casts away the constness, resulting in the warning.
Avoid having multiple variable declarations on the same line, unless they are all exactly the same type.

Using Flex/Bison to parse underscore delimited value

I would like to parse some input, mainly numbers, that can be delimited using underscore ( _ ) for user readability.
Ex.
1_0001_000 -> 1000100
000_000_111 -> 000000111
How would I set up my flex/yacc to do so?
Here's a potential flex answer (in C):
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int numUnderscores = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] == '_')
numUnderscores++;
int stringLength = yyleng - numUnderscores + 1;
char *string = (char*) malloc(sizeof(char) * stringLength);
/* be sure to check and ensure string isn't NULL */
int pos = 0;
for(int i = 0; i < yyleng; i++) {
if(yytext[i] != '_') {
string[pos] = yytext[i];
pos++;
}
}
return string;
}
If you know the maximum size of the number, you could use a statically sized array instead of dynamically allocating space for the string.
As stated before flex isn't the most efficient tool for solving this problem. If this problem is part of a larger problem (such as a language grammar), then keep using flex. Otherwise, there are many more efficient ways of handling this.
If you just need the string numerically, try this:
DIGIT [0-9]
%%
{DIGIT}+("_"{DIGIT}+)* { int number = 0;
for(int i = 0; i < yyleng; i++)
if(yytext[i] != '_')
number = (number*10) + (yytext[i]-'0');
return number;
}
Just be sure to check for overflow!

Resources