Read/Parse Binary files with Powershell - parsing

I'm trying to parse a binary file, and I need some help on where to go. I've looking online for "parsing binary files", "reading binary files", "reading text inside binaries", etc. and I haven't had any luck.
For example, how would I read this text out of this binary file? Any help would be MUCH appreciated. I am using powershell.

It seems that you have a binary file with text on a fixed or otherwise deducible position. Get-Content might help you but... It'll try to parse the entire file to an array of strings and thus creating an array of "garbage". Also, you wouldn't know from what file position a particular "rope of characters" was.
You can try .NET classes File to read and Encoding to decode. It's just a line for each call:
# Read the entire file to an array of bytes.
$bytes = [System.IO.File]::ReadAllBytes("path_to_the_file")
# Decode first 12 bytes to a text assuming ASCII encoding.
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 12)
In your real case you'd probably go through the array of bytes in a loop finding the start and end of a particular string sequence and using those indices to specify the range of bytes you want to extract the text from by the GetString.
The .NET methods I mentioned are available in .NET Framework 2.0 or higher. If you installed PowerShell 2.0 you already have it.

If you're just looking for strings, check out the strings.exe utility from SysInternals.

You can read in the file via Get-Content -Encoding byte . I'm not sure how to parse it though.

Related

Reading files per character with Lua (Löve API)

I am trying to read files per character in Lua with the Löve API but I just can't figure out how. There must be some way to do this right? In other posts I found something about reading files per line but I really need to read them per char. Could someone please tell me how to do this?
Thanks in advance,
Leveljaap
If f is a file handle, then f:read(1) returns the next byte in the file or nil at the end of the file.
Note that the next byte may not be the next character if the file contains UTF-8 Unicode for instance.

Is it possible to change a value inside a Lua bytecode? How? Any idea?

I got a script that is no longer supported and I'm looking for a way to change the value of a variable in it... The script is encrypted (loadstring/bytecode/something like that) e.g.: loadstring('\27\76\117\97\81\0\1\4\4\4\8\0\')
I can find what I want to change (through notepad after I compile the script), but if I try to change the value, the script won't work, if I change and try to recompile it still won't work: "luac: Testing09.lua: unexpected end in precompiled chunk" ...
Any ideas? I did something like that with a program long a go using ollydbg but I can't use it with lua scripts... I'm kinda lost here, doing some Googling for quite a while couldn't find a way... Any ideas?
It is easy to change a string in a Lua bytecode. You just have to adjust the length of the string after you change it. The length comes before the string. It probably takes four or eight bytes just before the string, depending on whether you have a 32-bit or 64-bit platform. The length is stored in the endianness of the machine where the bytecode was generated. Note that strings include a trailing '\0' and this counts in the length.
Perhaps it is easier to just copy some bytes directly. Write this file
return "this is the new string you want"
Generate bytecode from it with luac and look at an dump of luac.out and locate the string and its length. Copy those bytes to the original file.
I don't know whether notepad handles binary data. if it doesn't, you'll need an hex editor to do this.
Another solution is to write a Lua program that reads the bytecode as a strings, generate bytecode for return "this is the new string you want", perform the change in the original bytecode using string operations and write it back to file.
You can also try my bytecode inspector library lbci, which allows you to change constants in functions. You'd load the bytecode (but not execute it), and use setconstant after locating the constant that has the string you want to change.
In all, there is some fun to be had here...

Determine if received data is PostScript or PCL

I have a service that receives printer data via tcp/ip. When the data is received, is there reliable, efficient way to examine the data stream and determine if the data is PostScript vs PCL data? For example, are there characters I could look for at the beginning of the data stream to indicate the format?
I would probably just count the number of escape characters in the file. PCL will have gobs of them. Postscript will have gobs of % signs. That isn't a perfect solution, but it's dead simple and I'll bet it would actually be quite reliable.
The only "real" way I can see doing this is to actually parse the PCL and parse the postscript and see which one works.
I'll add my 2¢.
Like others have mentioned here, your first stab at programmatically identifying the document would be to look at the first two characters. If it starts with %!, it is PostScript, if it starts with an escape character (hex 1B, oct 033, ascii 27), as very likely PCL will start with PCL commands, then it is PCL. This will likely resolve 99% of the documents you need to process. If it still isn't known, then you can search the document for a showpage string. If it's PostScript, it has to have a showpage to render the page. If you can't find one, and if there are any escape characters in the file, you know it is PCL, and you can err on the side of PCL if there is no showpage, and there is no escape characters, because raw text files are valid PCL and printers can blort them out as they come.
Postscript data must begin with "%!ps" or "%!PS" - it may be a longer readable string like "%!PS-Adobe-3.0" - but that is basically this.
Most likely PCL have a similar signature - I remember seeing it in the past.
According to the PCL 5 General Printing FAQs PCL files should start with ESC "E". I assume another ESC sequence must follow. So my guess is that files starting with bytes 1B 45 1B are most likely PCL files.
This leaves PCL files unrecognized which don't adhere to this rule.
In my use case it's macOS that always produces PCL with the ESC E at the beginning.

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Erlang, reading a file with character offset

I have code to find a specific occurance of text in a file and give me an offset so I know where this occurance end. Now I want to read the file from that offset to the end of the file. The file contains binary data as well as text. How do I do this in Erlang?
Use pread. (See Erlang documentation on the file module). You have to take care of any character encoding yourself as the function deals with only bytes.

Resources