The company I work for has a program that is no longer supported called QADisplay. Inside of this program is a tool for annotating images. It's very similar to photoshop in that it takes a layer based approach to the annotations with each annotation as its own class in Delphi 7. These annotations are stored as the base image and a text file with the information describing the contents of the annotaion.
The issue is that the text that is displayed in the annotations is somehow encoded in the text file. For example, if the annotation displays as "Arial" (without the quotes), the text file will be written as:
TEXT (Type of annotation)
5 (Length of the literal string, in this case: Arial)
07)I86P (The encoded string)
What I need to do is extract all of the text from the annotations in preparation for the installation of our new software system.
I am not familiar with Delphi and do not have access to the source code. I have tried to disassemble the executable but haven't had much luck there. Does anyone have any ideas on how to approach decoding this? I've googled around a bit (Arial "07)I86P") and found some results relating to virus scan error logs and things of that nature but no dice on anything that I found helpful in relation to the issue I'm having.
That is not a standard text encoding. Maybe it is encrypted?
Without documentation or contact with the original developers, you will have to reverse engineer the app. Using a disassembler/debugger like IDA, if you can pause the app after it loads 07)I86P into memory, you can follow the code as it processes the characters, which will help you reconstruct the decode algorithm.
Related
I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074
Context
A friend of mine is having trouble printing source code to a human readable format.
The compiled (I assume) programs of their welding robot have the .rpg extension. They want to collect print-outs in human-readable format, possibly for backup or future reference.
Their supplier can provide the software that accomplishes this, be it at a considerable cost (and possibly: an annual license). Because of this, my friend decided to ask me if a easier/cheaper solution exists.
Examples & Pictures
The files can be read on the console of the robot, an example:
I've done some minor research and I'm fairly sure this is the Report Program Generator (RPG) language developed by IBM. The Assembly-like syntax seems to match; it might be one of the later versions of the language.
My friend has send me an example .rpg file, the contents seem binary with some string literals scattered throughout. Screenshot of the contents of an example file in hexadecimal:
The Question
There is not much, if any, clear information to be found online so I suppose I have multiple questions (for anyone that might know more about this):
Is this (first image) Report Program Generator (RPG) code?
Does the .rpg file contain compiled or processed code? Maybe an intermediate format?
Is it possible to convert files as shown in the example, back to source-code or human-readable format, kind of 'disassemble' it?
If anyone knows more, don't hesitate to give me any information or ask more details if necessary. Thanks in advance!
And maybe not an important question but still something that bugs me (and might indicate I'm on the wrong track):
If this is indeed an RPG program, why would the compiled/processed binary have the .rpg extension, shouldn't the source-file have that? This leads me to believe I'm either (a) assuming the wrong things (the language, etc...) or (b) this is an intermediate format, easier for machines to read, that has to be interpreted by some kind of runtime system.
I don't think that's any version of IBM's RPG language. RPG does have a MOVEL opcode, but it doesn't have any of the others.
Also, all the versions of the IBM language have been intended for business programming. I doubt that it would have been used for robotics.
My guess is that's a proprietary language of the company that makes the robot.
There are some similarities but it does not look like IBM RPG language.
RPG sources are in fact source physical file members. They are not stored in the "traditional" file system but in OS/400 libraries. Therefore RPG sources have no extension. They can be converted to Integrated File System stream file though.
I can't answer this question I'm afraid as it's unknown language to me.
I expect possibly that the OP misidentifies the file type/extension; that the extension is actually .prg, and the files serve as instructions for a Panasonic Industrial Welding Robot. The following forum [drilled down to Panasonic Robots] bills itself as the biggest Industrial Robots Supportforum worldwide!; perhaps a good place to ask about those images provided in the OP, and the inquiry about getting source from what appears to be a binary instruction stream.
FWiW, the first image seems to show that the Ezed utility [on the console] gives that human-readable format, so then the question might be how to get that saved and then how to transfer that elsewhere; e.g. what type of comm ports and file transfer utilities are available from whatever platform/OS.
I downloaded the EverNote API Xcode Project but I have a question regarding the OCR feature. With their OCR service, can I take a picture and show the extracted text in a UILabel or does it not work like that?
Or is the text that is extracted not shown to me but only is for the search function of photos?
Has anyone ever had any experience with this or any ideas?
Thanks!
Yes, but it looks like it's going to be a bit of work.
When you get an EDAMResource that corresponds to an image, it has a property called recognition that returns an EDAMData object that contains the XML that defines the recognition info. For example, I attached this image to a note:
I inspected the recognition info that was attached to the corresponding EDAMResource object, and found this:
the xml i found on pastie.org, because it's too big to fit in an answer
As you can see, there's a LOT of information here. The XML is defined in the API documentation, so this would be where you parse the XML and extract the relevant information yourself. Fortunately, the structure of the XML is quite simple (you could write a parser in a few minutes). The hard part will be to figure out what parts you want to use.
It doesn't really work like that. Evernote doesn't really do "OCR" in the pure sense of turning document images into coherent paragraphs of text.
Evernote's recognition XML (which you can retrieve after via the technique that #DaveDeLong shows above) is most useful as an index to search against; the service will provide you sets of rectangles and sets of possible words/text fragments with probability scores attached. This makes a great basis for matching search terms, but a terrible one for constructing a single string that represents the document.
(I know this answer is like 4 years late, but Dave's excellent description doesn't really address this philosophical distinction that you'll run up against if you try to actually do what you were suggesting in the question.)
I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.
I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.
Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.
Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.