Dissassembly of Forth code words with 'see' - forth

I am preparing overall knowledge on building a Forth interpreter and want to disassemble some of the generic Forth code words such as +, -, *, etc.
My Gforth (I currently have version 0.7.3, installed on Ubuntu Linux) will allow me to disassemble colon definitions that I make with the command see, as well as the single code word .. But when I try it with other code words, see + or see /, I get an error that says, Code +, and then I'm not able to type in my terminal anymore, even when I press control-c.
I should be able to decompile/disassemble the code words, as shown by the Gforth manual: https://www.complang.tuwien.ac.at/forth/gforth/Docs-html/Decompilation-Tutorial.html
Has anyone else had this issue, and do you know how to fix it?

Reverting to the old ptrace method did it for me.
First, from the command line as user root run:
echo 0 >/proc/sys/kernel/yama/ptrace_scope
After which see should disassemble whatever it can't decompile. Command line example (need not be root):
gforth -e "see + bye"
Output:
Code +
0x000055a9bf6dad66 <gforth_engine+2454>: mov %r14,0x21abf3(%rip) # 0x55a9bf8f5960 <saved_ip>
0x000055a9bf6dad6d <gforth_engine+2461>: lea 0x8(%r13),%rax
0x000055a9bf6dad71 <gforth_engine+2465>: mov 0x0(%r13),%rdx
0x000055a9bf6dad75 <gforth_engine+2469>: add $0x8,%r14
0x000055a9bf6dad79 <gforth_engine+2473>: add %rdx,(%rax)
0x000055a9bf6dad7c <gforth_engine+2476>: mov %rax,%r13
0x000055a9bf6dad7f <gforth_engine+2479>: mov -0x8(%r14),%rcx
0x000055a9bf6dad83 <gforth_engine+2483>: jmpq *%rcx
end-code
Credit: Anton Ertl

Most versions of SEE that I've seen are meant only for decompiling colon definitions. + and / and other arithmetic operations are usually written in assembly code and SEE doesn't know what to do with them. That's why you were getting the CODE error message: they're written in code, not Forth. There are several Forth implementations I've seen that have built in assemblers, but I don't think I've ever seen a dis-assembler. Your best bet for seeing the inner workings of + or / or other such words might be to use DUMP or another such word to get a list of the bytes in the word and either disassemble the word by hand or feed the data into an external disassembler. Or see if you can find the source code for your implementation or a similar one.

SEE is a word that has not a tightly controlled behaviour. It is a kind of best effort to show the code of a word X if invoked as
SEE X
It behaves slightly different according how difficult it is to do this. If you defined the word yourself in the session, you're pretty much guaranteed to get your code back. If it is a built in word, especially if it is a very elementary word like + , it is harder. It may look nothing much like the original definition, because of optimisation or compilation into machine code.
Specifically for gforth, if it gets hard gforth invokes the standard tools that are present on the system to analyse object files. So it may be necessary to install gdb and/or investigate how gforth tries to connect to it. For the concrete example of Ubuntu and gforth 0.7.3 Lutz Mueller gives a recipee.
.

I think SEE does it's job as designed.
There are words in FORTH defined in machine code (often called as primitives) and also there is a possibility to define machine code via assembler by the user ie.:
: MYCODE assembler memonics ;CODE
So the output of SEE shows not Code error, but that (ie.) + word was defined as machine code and one can see the disassembled mnenonics on the right of it's output.

Related

Erlang and Elixir's colorful REPL shells

How does Learn some Erlang or IEx colorize the REPL shell? Is kjell a stable drop-in replacement?
The way this is done in LYSE is to use a javascript plugin called highlight.js, so LYSE isn't actually doing it, your browser is. There are plugins/modes for most mainstream(ish) languages available for highlight.js. If the web is what you are interested in, this is one way to do it (except for when a user can't use JS or has it turned off).
This isn't actually the shell being highlighted at all, nor is it useful anywhere outside of browsers. I've been messing around with a way to do this more generically, initially by inserting static formatting in HTML and XML documents (feed it a document, and it outputs one with Erlang syntax highlighted a certain way whenever this is detected/tagged). I don't yet have a decent project to publish for this (very low on my priority list atm), but I can point you in the direction of some solid inspiration: the source for wx:demo.
Pay particular attention to the function demo:code_area/1. There you will see how the tokenization routines are used to provide highlight hints for the source code text display area in the wx:demo application. This can provide a solid foundation to build your own source highlighting/display utility. (I think it wouldn't be impossible, considering every terminal in common use today responds correctly to ANSI color codes, to write a plugin to the shell that highlights terminal input directly -- not that there is a big clamor for this feature at the moment.)
EDIT (Prompted by a comment by Fred the Magic Wonder Dog)
On the subject of ANSI color codes, if this is what you are actually after, they are easy to implement as a prepend to any string value you are returning within a terminal. The terminal escapes them, so you won't see the characters, but will perform whatever action the code represents. There is no termination (its not like a markup tag that encloses the text) and typically no concept of "default color to go back to" (though there are a gajillion-jillion extensions to telnet and terminal modes that enable all sorts of nonsense like this).
An example of basic colorization is the telcon:greet/0 and telcon:sys_help/0 functions in the v0.1 code of ErlMUD (along with a slew of other places -- colorization in games is sort of a thing). What you see there is a pre-built list per color, but this could be represented any way that would get those values at the front of the string. (I just happened to remember the code value sequences, but didn't remember the characters that make them up; the next version of the code represents this somewhat differently.) Here is a link to a list of ANSI color codes and a discussion about colorizing the shell. Play around! Its nerdy fun, 1980's style!
Oh, I almost forgot... if you really want to go down the rabbit hole without silly little child toys like ncurses to help you, take a look at termcap.
I don't know if kjell is a stable drop-in replacement for Erl but it wouldn't be for IEx.
As far as how the colors are done; to the best of my knowledge it's done with ANSI Escape Sequences.

Is there a list of Special characters to be avoided in JCL/MVS Script variables

I have a program that generates random pin codes. These pins are generated in Java, then stored in the mainframe, via a NATURAL program. From there, they are eventually physically printed by a batch JCL job that calls an MVS SCRIPT to print the form, with the pin code on it.
I've come across some issues with special characters before such as: |{}![]^~<>; that for one reason or another do not print properly. I've also removed 0OQ1l for OCR reasons.
Recently, an error came to my attention with another character that doesn't print properly, . but the . character only fails when it is the first character of the PIN Code.
So since I've run into this issue I thought I would see if I could find other special jcl, Natural or MVS Script characters that might interfere with my programs operation so that I can test them now and hopefully not run into this issue later or have to fallback to only using OCR'ed AlphaNumeric characters.
EDIT
Java - Web Application Runs Under Tomcat 6.x on a Solaris Server.
Natural - The Natural Program is called using webmethods Broker generated classes (POJOs).
My understanding is it uses RPC for actual communication.
The program verifies some data and stores the Pin in combination with a GUID on a record, in ADABAS.
There is a batch job that runs to print the forms. The Batch job is written in JCL.
My Understanding from the maintainer of the Batch Job, and the Forms stuff is the actual language to describe the forms themselves and how they get printed is an outdated/unsupported language called MVS SCRIPT.
the Bottom section of the script looks like this:
//**********************************************************************
//* PRINT SORTED FORMS TO #### USING MVS SCRIPT
//**********************************************************************
PRINTALL EXEC PGM=DSMSPEXEC,PARM='LIST'
//* less 'interesting' lines omitted
SYSPRINT DD SYSOUT=*
PRINT1 DD SYSOUT=A, OUTPUT=*.C####,
RECFM=VBM,LRECL=####,BLKSIZE=####
//* less 'interesting' lines omitted
//SYSIN DD *
AUTH /* redacted */
SCRIPT FROM(MYFORMS) (MESSAGE(ID TRACE) CONT -
FILE(PRINT1) PROFILE(redacted) -
NOSEGLIB DEVICE(PG4A) CHARS(X0A055BC))
.C#### is an actual number and is a variable that points to the chosen printer.
NOTE: I'm a Web Programmer, I don't speak mainframe, JCL, MVS, etc.
I think you will find the program (pgm=) is DSMSPEXC and not DSMSPEXEC.
I am guessing (could be wrong) we are talking about Script/DCF (which later became IBM Bookmaster / Bookmanager on other platforms).
Script/DCF is basically a GML based language. It was from GML that SGML was derived (HTML and XML are prominent examples of SGML languages).
In Script : starts a tag, . ends a tag. There are also macro's which have a . in column 1
.* ".*" in column 1 starts a line comment
.* .fo off is Format off (like <pre> in html)
.fo off
.* Starting an ordered list
:ol.
:li.Item in orded list
:eol.
i.e.
Script HTML
: < - Starts tag
. > - end of tag Script/DCF is generally pretty tolerant of .
& & - Starts a variable
There are variables (&gml. = :) for most of the special characters.
Characters to worry about are
: - always
& - always
. - in column one or after a :.
Other characters should be ok provided there are no translation errors. There charset X0A055BC (Mainframe SONORAN SANS SERIF ??) might not have all the special chars in it.
There are manuals around for Script/DCF tags.
Your data is not going to affect the JCL in any way.
I don't know about ADABAS or NATURAL. If you ask here, http://www.ibmmainframeforum.com/viewforum.php?f=25, specifically about that part, with as much detail as you can provide, there is a very expert guy, RDZbrog, who can probably answer that for you.
For the SCRIPT/VS itself, as Bruce Martin has pointed out, there may be some issues. With .xx and :xx there is not a clash with normal text. But you don't have normal text. With the &, which indicates a SCRIPT variable, it is more likely to be problematic and at any location.
I would fire some test data through: Your PINs with position one being all available punctuation preceding "fo" and "ol", and the same with those sequences "embedded" in your PINs. Also include a double & and a triple &.
Your query should be resolved in your specification. It is not, but I'm sure you'll get all the documentation updated when you get a resolution.

Determine Cobol coding style

I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.
In my application I need to determine the coding style in order to know if I should parse content after column 72.
I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.
So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.
I've find some but it's hard to tell if it will work everytime :
dot after column 72, determine the end of sentence but I fear that dot can be in comments too
find the close character of a statement after column 72 : " ' ) }
look for char at columns 71 - 72 - 73, if there is not space then find the whole word, and check if it's a key word or a var. Problem, it can be a var from a COPY or a replacement etc...
I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.
I don't need an API or something just solid rules that I will be able to rely on.
I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.
So.... which compiler(s)?
And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.
There is no absolutely reliable way to determine if a COBOL program
is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify
the programming language based only on source code. Check out
this classic polyglot - it is valid under 8 different language compilers. That
said, you could try a few heuristics that might yield
the correct answer more often than not.
Compiler directives imbedded in source code
Watch for certain compiler directives that determine code format.
Unfortunately, every compiler vendor uses their own flavour of directive.
For example, Microfocus COBOL uses the
SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan
could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and
>>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program
could be formatted differently!
The bottom line here is that you will have to support the conventions of multiple COBOL compilers.
Compiler switches
Source code format can be also be specified using a compiler switch. In this case, there are no concrete
clues to go on. However, you can be reasonably sure that the entire source program will be either
fixed or free. All you can do here is guess. Unless the programmer is out to "mess with
your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8.
Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the
absence of imbedded compiler directives.
Warning - this is far from fool proof, but might be a good start.
There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.
But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.
You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.
EDIT:
Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.
One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:
Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
Natural languages are WAY harder to parse than something like COBOL.
The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.
All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.
If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.
Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.
The text preprocessor output can be seen (using OpenCOBOL for the example)
cobc -E program.cob
The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.

How to "disable" a file output stream

I'm working on some legacy code in which there are a lot of WriteLn(F, '...') commands scattered pretty much all over the place. There is some useful information in these commands (what information variables contain, etc), so I'd prefer not to delete it or comment it out, but I want to stop the program from writing the file.
Is there any way that I can assign the F variable so that anything written to it is ignored? We use the console output, so that's not an option.
Going back a long long time to the good old days of DOS - If you assign 'f' to the device 'nul', then there should be no output.
assign (f, 'nul')
I don't know whether this still works in Windows.
Edit:
You could also assign 'f' to a file - assignfile (f, 'c:\1.txt') - for example.
Opening the null device and letting output go there would probably work. Under DOS, the performance of the NUL device was astonishingly bad IIRC (from what I understand, it wasn't buffered, so the system had to look up NUL in the device table when processing each byte) but I would not be at all surprised if it's improved under newer systems. In any case, that's probably the easiest thing you can do unless you really need to maximize performance. If performance is critical, it might in theory be possible to override the WriteLn function so it does nothing for certain files, but unfortunately I believe it allows syntax forms that were not permissible for any user-defined functions.
Otherwise, I would suggest doing a regular-expression find/replace to comment out the WriteLn statements in a fashion that can be mechanically restored.

Tex command which affects the next complete word

Is it possible to have a TeX command which will take the whole next word (or the next letters up to but not including the next punctuation symbol) as an argument and not only the next letter or {} group?
I’d like to have a \caps command on certain acronyms but don’t want to type curly brackets over and over.
First of all create your command, for example
\def\capsimpl#1{{\sc #1}}% Your main macro
The solution to catch a space or punctuation:
\catcode`\#=11
\def\addtopunct#1{\expandafter\let\csname punct#\meaning#1\endcsname\let}
\addtopunct{ }
\addtopunct{.} \addtopunct{,} \addtopunct{?}
\addtopunct{!} \addtopunct{;} \addtopunct{:}
\newtoks\capsarg
\def\caps{\capsarg{}\futurelet\punctlet\capsx}
\def\capsx{\expandafter\ifx\csname punct#\meaning\punctlet\endcsname\let
\expandafter\capsend
\else \expandafter\continuecaps\fi}
\def\capsend{\expandafter\capsimpl\expandafter{\the\capsarg}}
\def\continuecaps#1{\capsarg=\expandafter{\the\capsarg#1}\futurelet\punctlet\capsx}
\catcode`\#=12
#Debilski - I wrote something similar to your active * code for the acronyms in my thesis. I activated < and then \def<#1> to print the acronym, as well as the expansion if it's the first time it's encountered. I also went a bit off the deep end by allowing defining the expansions in-line and using the .aux files to send the expansions "back in time" if they're used before they're declared, or to report errors if an acronym is never declared.
Overall, it seemed like it would be a good idea at the time - I rarely needed < to be catcode 12 in my actual text (since all my macros were in a separate .sty file), and I made it behave in math mode, so I couldn't foresee any difficulties. But boy was it brittle... I don't know how many times I accidentally broke my build by changing something seemingly unrelated. So all that to say, be very careful activating characters that are even remotely commonly-used.
On the other hand, with XeTeX and higher unicode characters, it's probably a lot safer, and there are generally easy ways to type these extra characters, such as making a multi (or compose) key (I usually map either numlock or one of the windows keys to this), so that e.g. multi-!-! produces ¡). Or if you're running in emacs, you can use C-\ to switch into TeX input mode briefly to insert unicode by typing the TeX command for it (though this is a pain for actually typing TeX documents, since it intercepts your actual \'s, and please please don't try defining your own escape character!)
Regarding whitespace after commands: see package xspace, and TeX FAQ item Commands gobble following space.
Now why this is very difficult: as you noted yourself, things like that can only be done by changing catcodes, it seems. Catcodes are assigned to characters when TeX reads them, and TeX reads one line at a time, so you can not do anything with other spaces on the same line, IMHO. There might be a way around this, but I do not see it.
Dangerous code below!
This code will do what you want only at the end of the line, so if what you want is more "fluent" typing without brackets, but you are willing to hit 'return' after each acronym (and not run any auto-indent later), you can use this:
\def\caps{\begingroup\catcode`^^20 =11\mcaps}
\def\mcaps#1{\def\next##1 {\sc #1##1\catcode`^^20 =10\endgroup\ }\next}
One solution might be setting another character as active and using this one for escaping. This does not remove the need for a closing character but avoids typing the \caps macro, thus making it overall easier to type.
Therefore under very special circumstances, the following works.
\catcode`\*=\active
\def*#1*{\textsc{\MakeTextLowercase{#1}}}
Now follows an *Acronym*.
Unfortunately, this makes uses of \section*{} impossible without additional macro definitions.
In Xetex, it seems to be possible to exploit unicode characters for this, so one could define
\catcode`\•=\active
\def•#1•{\textsc{\MakeTextLowercase{#1}}}
Now follows an •Acronym•.
Which should reduce the effects on other commands but of course needs to have the character ‘•’ mapped to the keyboard somewhere to be of use.

Resources