Which pagecode was used to encode this DOC document? - character-encoding

I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.

Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.

Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.

Related

How to print a paper in Lua

I want to make a program in lua where I put the user puts several inputs and then press on a button and all these inputs are then printed in a certain format in a paper, How do I physically print a paper in lua ?
Accessing a physical printer and sending it a file to print is fairly complicated and is beyond the scope of what I've done with Lua so far, but I would suggest checking out this forum post.
Your best option might be to save the stuff you want to print to a text file (such as a PDF using a Lua to PDF library, there are several available with a Google Search, such as this one by cpressey or this one by jung-kurt) and then using C++ or some other language to send that file to a physical printer and print it. Microsoft has a pretty decent guide on how to do that.
Hopefully you find this helpful, have a great weekend!

Loading textfile into stringlist with firemonkey on osx when the encoding is unknown

I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074

Parse image to SESHAT tool?

I am trying to create a simple tool that uses this website's functionality http://cat.prhlt.upv.es/mer/ which parses some strokes of text to a math formula. I noticed that they mention that it converts the input to InkML or MathML.
Now I noticed that according to this link: Tradeoff between LaTex, MathML, and XHTMLMathML in an iOS app? you can use MathJax to convert certain input to MathML.
What I need clarification/assistance with is how can I take input (say from finger strokes) or a picture and then convert it to a format in which I can provide this website from an iOS device and read the result at the top of the page. I have done everything regarding taking a picture or drawing an equation on an iPhone but I am just confused how I can take that and feed it to this site in order to get a result.
Is this possible, and if so how?
I think there's a misunderstanding here. http://cat.prhlt.upv.es/mer/ isn't an API for converting images into formulae—it's just an example demonstration of the Seshat software tool.
If you're looking to convert hand-drawn math expressions into LaTeX or MathML (which can then be pretty printed on your device), you want to compile Seshat and then feed it, not the website, your input. My answer here explains how to format input for Seshat.

Delphi encoding

The company I work for has a program that is no longer supported called QADisplay. Inside of this program is a tool for annotating images. It's very similar to photoshop in that it takes a layer based approach to the annotations with each annotation as its own class in Delphi 7. These annotations are stored as the base image and a text file with the information describing the contents of the annotaion.
The issue is that the text that is displayed in the annotations is somehow encoded in the text file. For example, if the annotation displays as "Arial" (without the quotes), the text file will be written as:
TEXT (Type of annotation)
5 (Length of the literal string, in this case: Arial)
07)I86P (The encoded string)
What I need to do is extract all of the text from the annotations in preparation for the installation of our new software system.
I am not familiar with Delphi and do not have access to the source code. I have tried to disassemble the executable but haven't had much luck there. Does anyone have any ideas on how to approach decoding this? I've googled around a bit (Arial "07)I86P") and found some results relating to virus scan error logs and things of that nature but no dice on anything that I found helpful in relation to the issue I'm having.
That is not a standard text encoding. Maybe it is encrypted?
Without documentation or contact with the original developers, you will have to reverse engineer the app. Using a disassembler/debugger like IDA, if you can pause the app after it loads 07)I86P into memory, you can follow the code as it processes the characters, which will help you reconstruct the decode algorithm.

Matlab Parse Binary File

I am looking to speed up the reading of a data file which has been converted from binary (it is my understanding that "binary" can mean a lot of different things - I do not know what type of binary file I have, just that it's a binary file) to plaintext. I looked into reading files quickly awhile ago, and was informed that reading/parsing a binary file is faster than text. So, I would like to parse/read the binary file (that was converted to plaintext) in an effort to speed up the program.
I'm using Matlab for this project (I have a Matlab "program" that needs the data in the file). I guess I need some information on the different "types" of binary, but I really want information on how to read/parse said binary file (I know what I'm looking for in plaintext, so I imagine I'll need to convert that to binary, search the file, then pull the result out into plaintext). The file is a logfile, if that helps in any way.
Thanks.
There are several issues in what you are asking -- however, you need to know the format of the file you are reading. If you can say "At position xx, I can expect to find data yy", that's what you need to know. In you question/comments you talk about searching for strings. You can also do it (much like a text file) "when I find xxxx in the file, give me the following data up to nth character, or up to the next yyyy".
You want to look at the documentation for fread. In the documentation there are snippets of code that will get you started, but as I (and others) said you need to know the format of your binary files. You can use a hex editor to ascertain some information if you are desperate, but what should be quicker is the documentation for the program that outputs these files.
Regarding different "binary files", well, there is least significant byte first or LSB last. You really don't need to know about that for this work. There are also other platform-dependent issues which I am almost certain you don't need to know about (unless you are moving the binary files from Mac to PC to unix machines). If you read to almost the bottom of the fread documentation, there is a section entitled "Reading Files Created on Other Systems" which talks about the issues and how to deal with them.
Another comment that I have to make, you say that "reading/parsing a binary file is faster than text". This is not true (or even if it is, odds are you won't notice the performance gain). In terms of development time, however, reading/parsing a textfile will save you huge amounts of time.
The simple way to store data in a binary file is to use the 'save' command.
If you load from a saved variable it should be significantly faster than if you load from a text file.

Resources