Going back to old position in lex - flex-lexer

During my lex processing, I need to go back in the lex input file, to process the same input several times with different local settings.
However, just doing fseek(yyin, old_pos, SEEK_SET); does not work, since the input data are buffered by lex. How can I (portably) deal with this?
I tried to add a YY_FLUSH_BUFFER after the fseek(), but it didn't help, since the old file position was incorrect (it was set to the point after filling the buffer, not to the point where I evaluate the token).

The combination of YY_FLUSH_BUFFER() and fseek(yyin, position, SEEK_SET) (in either order, but I would do the YY_FLUSH_BUFFER() first) will certainly cause the next token to be scanned starting at position. The problem is figuring out the correct value of position.
It is relatively simple to track the character offset (but see the disclaimer below if you require a portable scanner which could run on non-Posix platforms such as Windows):
%{
long scan_position = 0;
%}
%%
[[:space:]]* scan_position += yyleng;
"some pattern" { scan_position += yyleng; ... }
Since it's a bit tedious to insert scan_position += yyleng; into every rule, you can use flex's helpful YY_USER_ACTION macro hook: this macro is expanded at the beginning of every action (even empty actions). So you could write the above more simply:
%{
long scan_position = 0;
#define YY_USER_ACTION scan_position += yyleng;
%}
%%
[[:space:]]*
"some pattern" { ... }
One caveat: This will not work if you use any of the flex actions which adjust token length or otherwise alter the normal scanning procedure. That includes at least yyless, yymore, REJECT, unput and input. If you use any of the first three, you need to reset scan_position -= yyleng; (that needs to go just before the invocation of yyless, yymore or REJECT. For input and unput, you need to increment / decrement scan_position to account for the character read outside of the scanning process.
Disclaimer:
Tracking positions like that assumes that there is a one-to-one correspondence between bytes read from an input stream and raw bytes in the underlying file system. For Posix systems, this is guaranteed to be the case: fread(3) and read(2) will read the same bytes and the b open mode flag has no effect.
In general, though, there is no reliable way of tracking file position. You could open the stream in binary mode and deal with the system's idiosyncratic line endings yourself (this will work on Windows but there is no portable way of establishing what the line ending sequence is, so it is not portable either). But on other non-Posix systems, it is possible that a binary read produces a completely different result (for example, the underlying file might use fixed-length records so that each line is padded (with some system-specific padding character) to make it the correct length.
That's why the C standard prohibits the use of computed offset values:
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET. (§7.21.9.2 "The fseek function", paragraph 4).
There is no way to turn buffering off in flex -- or any version of lex that I know of -- because correctly handling fallback depends on being able to buffer. (Fallback happens when the scan has proceeded beyond the end of a token, because the token matches the prefix of a longer token which happens not to be present.)
I think the only portable solution would be to copy the input stream token by token into your own buffer (or temporary file) and then use yy_push_buffer_state and yy_scan_buffer (if you're using a buffer) to insert that buffer into the input stream. That solution would look a lot like the tracking code above, except that YY_USER_ACTION would append the tokens read to your own string buffer or temporary file. (You would want to make that conditional on a flag so that it only happens in the segment of the file you want to rescan.) If you have nested repeats, you could track the position in your own buffer/file in order to be able to return to it.

Related

forth implementation with JIT write protection?

I believe Apple has disabled being able to write and execute memory at the same time on the ARM64 architecture, see:
See mmap() RWX page on MacOS (ARM64 architecture)?
This makes it difficult to port implementations like jonesforth, which keeps generated code and the code to generate it (like the built-in assembler in jonesforth.f) in the same segment.
I thought I could do something like map the user space from start to HERE as 'r-x', and from here to the end as 'rw-'. Then I'd have to constantly remap memory as I compile new words, and I couldn't go and fix up previous words (I believe SCODE would make use of it).
Do you have any advice on how to handle such limitations ?
I guess I should look into other forth implementations that are running on M1 Macs.
A Forth implementation can have a problem with write-protected segments of code only when it generates machine code that should be executable at once. There is no such a problem if it uses threaded code. So it's supposed bellow that the Forth system have to generate machine code.
Data space and code space
Obviously, you have to separate code space from data space. Data space (at least mutable regions of, including regions for variables and data fields), as well as internal mutable memory regions and probably headers, should be mapped to 'rw-' segments. Code space should be mapped to 'r-x' segments.
The word here ( -- addr ) returns the address of the first cell available for reservation, which is writable for a program, and it should be always in an 'rw-' segment. You can have an internal word code::here ( -- addr ) that returns address in code space, if you need.
A decision for execution tokens is a compromise between speed and simplicity of implementation (an 'r-x' segment vs 'rw-'). The simplest case is that an execution token is represented by an address in an 'rw-' segment, and then execute does an additional dereferencing to get the corresponding address of code.
Code generation
In the given conditions we should generate machine code into an 'rw-' segment, but before this code is executed, this segment should be made 'r-x'.
Probably, the simplest solution is to allocate a memory block for every new definition, resize (minimize) the block on completion and make it 'r-x'. Possible disadvantages — losses due to page size (e.g. 4 KiB), and maybe memory fragmentation.
Changing protection of the main code segment starting from code::here also implies losses due to page size granularity.
Another variant is to break creating of a definition into two stages:
generate intermediate representation (IR) in a separate 'rw-' segment during compilation of a definition;
when the definition is completed, generate machine code in the main code segment from IR, and discard IR code.
Actually, it could be machine code on the first stage too, and then it's just relocated into another place on the second stage.
Before write to the main code segment you change it (or its part) to 'rw-', and after that revert it to 'r-x'.
A subroutine that translates IR code should be resided in another 'r-x' segment that you don't change.
Forth is agnostic to the format of generated code, and in a straightforward system only a handful of definitions "know" what format is generated. So only these definitions should be changed to generate IR code. If you relocate machine code, you probably don't need to change even these definitions.

STM32 Current Flash Vector Address

I'm working on a dual OS system with STM32F103, I have two separate program that programmed on different FLASH locations. if both of the programs are the same, the only way to know which of them running is just by its start vector address.
But How I Can Read The Current Program Start Vector Address in STM32 ???
After reading the comments, it sounds like what you have/want is a bootloader. If your goal here is to have two different applications, one to do your main processing and real time handling and the other to just program new firmware, then you want to make a bootloader in your default boot flash space.
Bootloaders fundamentally do a few things, everything else is extra.
Check itself using some type of data integrity check like a CRC.
Checks the application
Jumps to the application.
Bootloaders will also program applications in the app space and verify they are programmed correctly before jumping as well. Colin gave some good advice about appending a CRC to the hex file before it is programmed in flash space to verify the applications.
There are a few things to look out for. The first would be the linker script and this is extremely important. A linker script will be used to map input objects to output objects and then determine based upon that script, what memory space they go into. For both of your applications, you need to create a memory map of how you want both programs to sit inside of the flash space. From this point, you can then make linker scripts for both programs so that a hex file can be generated within the parameters of what you deem acceptable flash space for the program. Each project you have will have its own linker script. An example would look something like this:
LR_IROM1 0x08000000 0x00010000 { ; load region size_region
ER_IROM1 0x08000000 0x00010000 { ; load address = execution address
*.o (RESET, +First)
*(InRoot$$Sections)
.ANY (+RO)
}
RW_IRAM1 0x20000000 0x00018000 { ; RW data
.ANY (+RW +ZI)
}
}
This will give RAM for the application to use as well as a starting point for the application.
After that, you can start the bootloader and give it information about where the application space lies for jumping and programming. Once again this is all determined by you from your memory map and both applications' linker scripts. You are going to need to add a separate entry inside of the linker for your CRC and length for a comparison of the calculated versus stored as well. Whatever tool you use to append the CRC to the hex file and have it programmed to flash space, remember to note the location and make it known to the linker script so you can reference those addresses to check integrity later.
After you check everything and it is determined that it is okay to go to the application, you can use some ARM assembly to jump to the starting application address. Before jumping, make sure to disable all peripherals and interrupts that were enabled in the bootloader. As Colin mentioned, these will share RAM, so it is important you de-initialize all used, otherwise, you'll end up with a hard fault.
At this point, the program used another hex file laid out by a linker script, so it should begin executing as planned, as long as you have the correct vector table offset, which gets into your question fully.
As far as your question on the "Flash vector address", I think what your really mean is your interrupt vector table address. An interrupt vector table is a data structure in memory that maps interrupt requests to the addresses of interrupt handlers. This is where the PC register grabs the next available instruction address upon hardware interrupt triggers, for example. You can see this by keeping track of the ARM pipeline in a few lines of assembly code. Each entry of this table is a handler's address. This offset must be aligned with your application, otherwise you will never go into the main function and the program will sit in the application space, but have nothing to do since all handlers addresses are unknown. This is what the SCB->VTOR is for. It is a vector interrupt table offset address register. In this case, there are a few things you can do. Luckily, these are hard-coded inside of STM generated files inside of the file "system_stm32(xx)xx.c" (xx is your microcontroller variant). There is a define for something called VECT_TAB_OFFSET which is the offset in the memory map of the vector table and is assigned to the SCB->VTOR register with the value that is chosen. Your interrupt vector table will always lie at the starting address of your main application, so for the bootloader it can be 0x00, but for the application, it will be the subtraction of the starting address of the application space, and the first addressable flash address of the microcontroller.
/************************* Miscellaneous Configuration ************************/
/*!< Uncomment the following line if you need to relocate your vector Table in
Internal SRAM. */
/* #define VECT_TAB_SRAM */
#define VECT_TAB_OFFSET 0x00 /*!< Vector Table base offset field.
This value must be a multiple of 0x200. */
/******************************************************************************/
Make sure you understand what is expected from the micro side using STM documentation before programming things. Vector tables in this chip can only be in multiples of 0x200. But to answer your question, this address can be determined by a few things. Your memory map, and eventually, you will have a hard-coded reference to it as a define. You can figure it out from there.
Hope this helps and good luck to you on your application.

Getting the current index in the input string (flex lexer)

I am using flex lexer. Is there a way to (1) get the current index in the input string (2) jump back to that index in a future time point?
Thanks.
It's fairly easy to maintain the current input position. When any rule is matched, yyleng contains the length of the match, so it is sufficient to add yyleng to the cumulative length processed. Assuming you are using flex, it is not necessary to insert the code directly into every rule action, which would be tedious. Instead, you can use the YY_USER_ACTION macro:
#define YY_USER_ACTION input_pos += yyleng;
(This assumes that you have defined input_pos somewhere, and arranged for it to be initialized to 0 when the lexical scan commences.)
This will lead to incorrect results if you use REJECT, yymore(), yyless() or input(); in all of these cases, you will have to adjust the value of input_pos. For every call to yymore(), you need to subtract yyleng from input_pos; this will also work for REJECT. For a call to yyless(), you can subtract yyleng before the call and add it back after the call. For each call to input(), you need to add one to input_pos.
Within a rule, you can then use input_pos as the position at the end of the match, or input_pos - yyleng as the position at the beginning of the match.
Returning to a saved position is trickier.
(F)lex does not maintain the entire input in memory, so in principle you would need to use fseek() to rewind yyin to the correct place. However, in the common case where yyin has not been opened in binary mode, you cannot reliably use fseek() to return to a computed input offset. So at a minimum, you would have to ensure that yyin was opened (or reopened) in binary mode.
Moreover, it is not in general possible to guarantee that whatever stream yyin is attached to can be rewound at all (it might be console input, a pipe, or some other non-seekable device). So to be fully general, you might have to use a temporary file to store data read from the stream. This will create additional complications when you attempt to reread previous input, because you will have to switch to the temporary file for reading until it is finished, at which point you would have to return to the main file. Creative use of yywrap will simplify this procedure.
Note that after you rewind the input stream -- whether or not you switch to reading from a temporary file -- you must call yyrestart() to reset the scanner's input buffer. (This is also a flex-only feature; Posix lex does not specify the mechanism by which you inform the scanner that its buffer needs to be reset, so if you are not using flex you will have to consult the relevant documentation for your scanner generator.)

Tracking address when writing to flash

My system needs to store data in an EEPROM flash. Strings of bytes will be written to the EEPROM one at a time, not continuously at once. The length of strings may vary. I want the strings to be saved in order without wasting any space by continuing from the last write address. For example, if the first string of bytes was written at address 0x00~0x08, then I want the second string of bytes to be written starting at address 0x09.
How can it be achieved? I found that some EEPROM's write command does not require the address to be specified and just continues from lastly written point. But EEPROM I am using does not support that. (I am using Spansion's S25FL1-K). I thought about allocating part of memory to track the address and storing the address every time I write, but that might wear out flash faster. What is widely used method to handle such case?
Thanks.
EDIT:
What I am asking is how to track/save the address in a non-volatile way so that when next write happens, I know what address to start.
I never worked with this particular flash, but I've implemented something similar. Unfortunately, without knowing your constrains / priorities (memory or CPU efficient, how often write happens etc.) it is impossible to give a definite answer. Here are some techniques that you may want to consider. I don't know if they are widely used though.
Option 1: Write X bytes containing string length before the string. Then on initialization you could parse your flash: read the length n, jump n bytes forward; read the next byte. If it's empty (all ones for your flash according to the datasheet) then you got your first empty bit. Otherwise you've just read the length of the next string, so do the same over again.
This method allows you to quickly search for the last used sector, since the first byte of the used sector is guaranteed to have a value. The flip side here is overhead of extra n bytes (depending on the max string length) each time you write a string, and having to parse it to get the value (although this can only be done once on boot).
Option 2: Instead of prepending the size, append the unique "end-of-string" sequence, and then parse on boot for the last sequence before ones that represent empty flash.
Disadvantage here is longer parse, but you possibly could get away with just 1 byte-long overhead for each string.
Option 3 would be just what you already thought of: allocating a separate sector that would contain the value you need. To reduce flash wear you could also write these values back-to-back and search for the last one each time you boot. Also, you might consider the expected lifetime of the device that you program versus 100,000 erases that your flash can sustain (again according to the datasheet) - is wearing even a problem? That of course depends on how often data will be saved.
Hope that helps.

Using PARSE on a PORT! value

I tried using PARSE on a PORT! and it does not work:
>> parse open %test-data.r [to end]
** Script error: parse does not allow port! for its input argument
Of course, it works if you read the data in:
>> parse read open %test-data.r [to end]
== true
...but it seems it would be useful to be able to use PARSE on large files without first loading them into memory.
Is there a reason why PARSE couldn't work on a PORT! ... or is it merely not implemented yet?
the easy answer is no we can't...
The way parse works, it may need to roll-back to a prior part of the input string, which might in fact be the head of the complete input, when it meets the last character of the stream.
ports copy their data to a string buffer as they get their input from a port, so in fact, there is never any "prior" string for parse to roll-back to. its like quantum physics... just looking at it, its not there anymore.
But as you know in rebol... no isn't an answer. ;-)
This being said, there is a way to parse data from a port as its being grabbed, but its a bit more work.
what you do is use a buffer, and
APPEND buffer COPY/part connection amount
Depending on your data, amount could be 1 byte or 1kb, use what makes sense.
Once the new input is added to your buffer, parse it and add logic to know if you matched part of that buffer.
If something positively matched, you remove/part what matched from the buffer, and continue parsing until nothing parses.
you then repeat above until you reach the end of input.
I've used this in a real-time EDI tcp server which has an "always on" tcp port in order to break up a (potentially) continuous stream of input data, which actually piggy-backs messages end to end.
details
The best way to setup this system is to use /no-wait and loop until the port closes (you receive none instead of "").
Also make sure you have a way of checking for data integrity problems (like a skipped byte, or erroneous message) when you are parsing, otherwise, you will never reach the end.
In my system, when the buffer was beyond a specific size, I tried an alternate rule which skipped bytes until a pattern might be found further down the stream. If one was found, an error was logged, the partial message stored and a alert raised for sysadmin to sort out the message.
HTH !
I think that Maxim's answer is good enough. At this moment the parse on port is not implemented. I don't think it's impossible to implement it later, but we must solve other issues first.
Also as Maxim says, you can do it even now, but it very depends what exactly you want to do.
You can parse large files without need to read them completely to the memory, for sure. It's always good to know, what you expect to parse. For example all large files, like files for music and video, are divided into chunks, so you can just use copy|seek to get these chunks and parse them.
Or if you want to get just titles of multiple web pages, you can just read, let's say, first 1024 bytes and look for the title tag here, if it fails, read more bytes and try it again...
That's exactly what must be done to allow parse on port natively anyway.
And feel free to add a WISH in the CureCode database: http://curecode.org/rebol3/

Resources