Issue displaying XLSXWriter file special characters in Google SPreadsheets (UTF8_decode) - google-sheets

I am using XLSXWriter php library to generate xlsx files that have special characters in a few columns .
The encoding is UTF8 and the special chars seem to display correctly both on my website and when I open them with excel .
The strange thing is that when I import the file to Google Spreadsheets there is just a bunch of squares and characters that do not match the original .
Example : âðð¼ð¼â
ð³ð¾ð±: ð¸ð¾ ð³ð´ð² ð¸ð¶
What I have tried to fix the issue :
I use utf8_decode to convert my strings before exporting the file
I tried a composer library (forceutf8) in order to fix any errors related to utf8 double encoding
I tried to export directly using UTF-8 encoding but the columns are empty when I do this
I don't really know what else to try, supposely google spreadsheets should read UTF8 encoding and even if it didn't, it makes no sense that when I use utf8_decode the characters are displaying correctly in Excel but not in Google Spreadsheets .
Any suggestion, been trying to fix this for several hours .
Thanks

Related

How do I make Cypher respect character encoding when using LOAD CSV in browser?

My case: List of Danish-named students (with names including characters as ü,æ,ø,å). Minimal Working Example
CSV file:
Fornavn;Efternavn;Mobil;Adresse
Øjvind;Ørnenæb;87654321;Paradisæblevej 125, 5610 Åkirkeby
Süzette;Ågård;12345678;Ærøvej 123, 2000 Frederiksberg
In-browser neo4j-editor:
$ LOAD CSV WITH HEADERS FROM 'file:///path/to/file.csv' AS line FIELDTERMINATOR ";"
CREATE (:Elev {fornavn: line.Fornavn, efternavn: line.Efternavn, mobil: line.Mobilnr, adresse: line.Adresse})
Resulting in registrations like:
Neo4j browser screenshot, containing ?-characters, where Danish/German characters are wanted. My data come from a Learning Management System into Excel. When exporting as CSV from Excel, I can control file encoding as a function of the Save As dialogue box. I have tried encoding from Excel as "UTF-8" (which the Neo4j manual says it wants), "ISO-Western European", "Windows-Western European", "Unicode" in separately named file, and adjusted the FROM 'file:///path/to/file.csv' clause accordingly.
Intriguingly, exactly the same misrepresentation results, independent of which (apparent?) file encoding, I request from Excel when "Saving As". When Copy-pasting the names and addresses directly into the editor, I do not encounter the same problem.
Check Michael Hunger's blog post here which contains some tips, namely:
if you use non-ascii characters (umlauts, accents etc.) make sure to use the appropriate locale or provide the System property -Dfile.encoding=UTF8

Setting spreadsheetgear CSV delimiter

How can i set CSV delimeter to ";" semicolon in spreadsheetgear component if possible? We want to set it to semicolon because dutch date format includes commas so the separation does not work well in that case.
I searched SO and Google but couldnt come up with any info.
SpreadsheetGear has no APIs to specify the delimiter used for text-based data files, unfortunately. If you need to read in or write out a file that uses some other delimiter, you would likely need to build your own file reader/writer that parses out the desired delimiter from incoming files or saves outgoing files with the desired delimiter.

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Bad characters from CSV into database

I am trying to figure out why I keep getting bad characters when I import information into my database from a CSV file.
Setup:
Database is UTF-8 encoding
HTML Page = UTF-8 Encoding (Meta Tag)
What I'm receiving when the file is imported is.
But in the CSV file everything looks clean, and the actual number is +1 (250) 862-8350
So I don't know what the issue is, my hunch is something to do with a form of trimming but I haven't been able to figure out what it is... any light would be appreciated!
Well I found out my answer, and it's somewhat embarasing. when my phone number gets put into the database I run it through my cleaner, and then encode the data... But I didn't notice that my database column was set to a small character count... and my encoding was longer that what would be inserted into the database... So in short, I made a my column 64 vs 32 and it solved the problem.
Thank you for your time in trying to help me though!

Resources