Weird text after Unzipping - character-encoding

After unzipping a file and reading some .txt files in it i noticed that the text files were messed up.
There were lots of equal signs like this ==~ scatter amongst words , and in some words letters were jumbled and unreadable.
Does anyone have any idea what this could be?
Also some files were left empty after unzipping.
To unzip i used the command like unzip tool with no parameters.
It showed errors such as bad CRC b2fd8b3e (should be ac92cdc0)
and
file #21: bad zipfile offset (local header sig): 249997

Sounds like the Zip files are corrupt, or the tool you're using doesn't correctly handle the compression algorithm used.

Related

Unknown exported text artifacts $FL? $G?

I have received a flat file to import in a DB. This flat file has many problems: the French text characters are converted to ? (it is not my editor, the file as realy the "3F" = "?" ASCII character for different letters with accent. But the main problem is that this file is studded with strange artefacts: "$FL?", "$B?", "$E?". They are anywhere, including in the middle of words.
It does not seams to be quotted printable.
Have you already faced such symptoms and what was the cause ?
Theses artifacts are created by an upstream buggy program and are not related to encoding changes.

dataset import error for AutoML text classification

I have trying to import dataset into AutoML NL Text Classification. However, the Ui gave me an error of Invalid row in CSV file , Error details: Error detected: "FILE_TYPE_NOT_SUPPORTED"
I am uploading the csv file, what should I do?
Please make sure there is no hidden quotes in your dataset. Complete requirements can be found on “Preparing your training data” page.
Common .csv errors:
Using Unicode characters in labels. For example, Japanese characters are not supported.
Using spaces and non-alphanumeric characters in labels.
Empty lines.
Empty columns (lines with two successive commas).
Missing quotes around embedded text that includes commas.
Incorrect capitalization of Cloud Storage text paths.
Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
The URI of a text file points to a different bucket than the current project. > > - Only files in the project bucket can be accessed.
Non-CSV-formatted files.

Contents of executable files cannot be copied?

So, text files can be copied and pasted to another location by copying the contents of the original file into a blank text file. This can be done with a text editor. Highlight contents of text file, copy, create new blank text file, paste in to it.
But, why can't image, audio, video, executable files, etc., be copied and pasted like this? For example, I open an executable file with a text editor, copy all of it's contents, create a new blank text file, change the extension to .exe, and paste into it (through a text editor). But, the file cannot be run. Why?
Also, I would like to be able to edit these types of files like I do with text files. Is there a way?
Because executable and media files are "binary" files. Text files are binary as well, but different. All files are created binary, but some are created more binary than others.
You're opening a binary file in a text editor. This immediately changes the semantics of the bytes. The main problem is bytes containing a value that happens to correspond to those of newline characters if it were a text file (0x0A and 0x0D), which will be rendered as a platform-dependent newline (\r\n on Windows, for example). When you copy that, you've changed either 0x0A or 0x0D to 0x0D 0x0A.
Then there's control characters or non-printable characters. Not all bytes between 0x00 and 0xFF can be represented as a character. They'll either be omitted or replaced with a displayable character.
So when you copy a text containing those, they'll be omitted or otherwise mangled.
In conclusion: you cannot reliably use text to display all possible byte values, unless you choose to encode the bytes' values, as is done using for example Base64 encoding.
If you want to edit a binary file, use an editor that is aware of those bytes: a "hex editor". Do note that changing random byte values in a binary file does not guarantee the sanity of that file: there may be checksums built into the format, and your edit will invalidate that checksum.

Comparing h5 files

I often have to compare hdf files. How I do it is either with a binary diff (which tells me files are different even though the actual numbers inside are the same) or by dumping the content into a txt file with h5dump and the comparing the content of the two files (which is also quite annoying).
I was wondering if there is a more clever way to do this, perhaps a feature of h5 or of softwares like HDFView or Panoply.
Perhaps hdiff is what you require ? Some examples here
h5diff can be used to compare HDF5 files, and on Ubuntu it can be installed with
apt-get install hdf5-tools
then it's simply
h5diff file1.hdf5 file2.hdf5

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Resources