I have received a flat file to import in a DB. This flat file has many problems: the French text characters are converted to ? (it is not my editor, the file as realy the "3F" = "?" ASCII character for different letters with accent. But the main problem is that this file is studded with strange artefacts: "$FL?", "$B?", "$E?". They are anywhere, including in the middle of words.
It does not seams to be quotted printable.
Have you already faced such symptoms and what was the cause ?
Theses artifacts are created by an upstream buggy program and are not related to encoding changes.
Related
I have trying to import dataset into AutoML NL Text Classification. However, the Ui gave me an error of Invalid row in CSV file , Error details: Error detected: "FILE_TYPE_NOT_SUPPORTED"
I am uploading the csv file, what should I do?
Please make sure there is no hidden quotes in your dataset. Complete requirements can be found on “Preparing your training data” page.
Common .csv errors:
Using Unicode characters in labels. For example, Japanese characters are not supported.
Using spaces and non-alphanumeric characters in labels.
Empty lines.
Empty columns (lines with two successive commas).
Missing quotes around embedded text that includes commas.
Incorrect capitalization of Cloud Storage text paths.
Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
The URI of a text file points to a different bucket than the current project. > > - Only files in the project bucket can be accessed.
Non-CSV-formatted files.
I have a file in Spanish, when seen on my teacher's PC a bit of text would display as
regresión cuantílica más
but now that I've opened on mine I see this:
regresión cuantÃÂlica más
I have tried "Save with Encoding" to ISO-8859-1 and UTF-8 but it doesn't seem to change anything. Will I need to run some regex replacements on my file or is there a simpler way to fix this?
If you have already saved it and you've lost the original version of the file, it will be a pain to recover.
What you should have done when you noticed the bad characters was "Reopen with encoding", and chosen the "UTF-8" encoding. If you can still get the original file, do this now.
If you can't, then you're stuck with lots of manual fixing. Accented characters (and Euro signs, and a few other things) will show up as multi-character sequences. When you recognize one, use search and replace to replace that sequence with the correct character.
I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.
Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.
In my case specifically, I have bullet-points (• or #149) in my text file.
If I copy paste "•" into my Unity text field in the editor, it shows up, so I am pretty sure the bullet-point is lost in the reading process. (I checked in debug mode, and indeed the bullet-point is lost at reading).
This is how I read in my text file as a TextAsset:
TextAsset content = Resources.Load(SlideManager.slideLanguage+"\\"+fileName+" ("+SlideManager.slideNumber+")") as TextAsset;
It turns out, that the way I read is completely fine. It reads the file correctly, but the encoding of the file is ASCII, therefore the resource loader cannot interpret none ASCII characters, and drops them.
Thus, since the bullet-point is not standard ASCII, but extended ASCII character, you have to specify the encoding of your text files.
For example, set encoding to UTF-8, and then it will work. I used notepad++ to set encoding, but I am sure there are many other ways you can do it.
To set encoding in Notepad++
Click on the tab named Encoding (fifth tab from the left on the top by default), and select Convert to UTF-8.
I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...