detect encoding of xlsx content in ruby - ruby-on-rails

I have an app which allows uploading spreadsheets in xls, xlsx and csv format. The data is later used at various client facing places. The people managing the data use various tools to create the spreadsheets, including mac/excel, win/excel, win/openoffice, linux/libreoffice...
The real problem is the mac/excel encoding, which creates some nasty looking strings. Is there any way to make sure the file content's encoding is valid utf-8?
My approach of just File.read(file.path).valid_encoding? checking works only for csv...

I would look into charlock_holmes, a gem which lets you easily detect and even attempt to transcode files based on their encoding.

Related

Any feasible solution to read/write simple XLS file on IOS?(I just need plain text)

I found some questions similar to this question was asked but no one provide a decent solution.
I've found two projects on sourceforge (libxls and xlslib) to read/write XLS, but they are under GNU license which I think can not be used in IOS app.
I don't want a full function read/write XLS lib. I just need to write some plain text in several rows and columns.Pretty like just the XLS version of CSV.(CSV is all i need but for some reason i need XLS file format. I can manually read/write CSV format because it's really easy to understand. But XLS is way too complicated...XLS format's documentation is like hundreds of pages...I really don't want to deal with like file headers and something like that)
Is there any solution? Or is there any way to easily convert a CSV file to XLS in IOS?

Can I create a .xls file programmatically in iOS?

I need to create a .xls file from the Array data programmatically in iPhone. How can this be done?
Maybe you're in trouble, maybe not. The "old" xls format is a binary one and I am not aware of any free libraries which are able to read or write to that format. If this one is required, you're propably out of luck.
If however a more recent format will do you're back in business, because you can use xml (objc wrappers for lib2xml are readily available). Wikipedia features a short overwiev of the format which you might want to check out: Excel file formats on Wikipedia

How to pre-process CSV data for FasterCSV?

We're having a significant number of problems creating a bulk upload function for our little app. We're using the FasterCSV gem to upload data to a MySQL database but he Faster CSV is so twitchy and precise in its requirements that it constantly breaks with malformed CSV errors and time out errors.
The csv files are generally created by users' pasting text from their web sites or from Microsoft Word docs so it is not reasonable to expect that there will never be odd characters like smart quotes or accents in the data. Also users aren't going to be readily able to identify whether their data is perfect enough for FasterCSV or not. We need to find a way to fix it for them automatically.
Is there a good way or a reliable tool for pre-processing CSV data to fix any nits in the data before having the FasterCSV gem process it?
Try the CSV library in the standard lib. It is more forgiving about malformed CSV:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
You can pass the file's encoding type into the FasterCSV options when creating a new instance of the FasterCsv parser. (see docs here: http://fastercsv.rubyforge.org/classes/FasterCSV.html#M000018)
Setting it to utf-8 or the Microsoft encoding should get it past most dodgy extra characters, allowing it to actually parse into your required strings... then you can clean the strings to your heart's content.
There's also something in the docs about "converters" that you can pass in - though this is aimed more at converting, say, numeric or date types, you ight be able to use it to gsub for the dodgy chars.
Try the smarter_csv Gem - you can pass a block to it's proces method and clean-up data before it is used
https://github.com/tilo/smarter_csv

Which pagecode was used to encode this DOC document?

I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.
Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.
Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources