How to pre-process CSV data for FasterCSV? - ruby-on-rails

We're having a significant number of problems creating a bulk upload function for our little app. We're using the FasterCSV gem to upload data to a MySQL database but he Faster CSV is so twitchy and precise in its requirements that it constantly breaks with malformed CSV errors and time out errors.
The csv files are generally created by users' pasting text from their web sites or from Microsoft Word docs so it is not reasonable to expect that there will never be odd characters like smart quotes or accents in the data. Also users aren't going to be readily able to identify whether their data is perfect enough for FasterCSV or not. We need to find a way to fix it for them automatically.
Is there a good way or a reliable tool for pre-processing CSV data to fix any nits in the data before having the FasterCSV gem process it?

Try the CSV library in the standard lib. It is more forgiving about malformed CSV:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html

You can pass the file's encoding type into the FasterCSV options when creating a new instance of the FasterCsv parser. (see docs here: http://fastercsv.rubyforge.org/classes/FasterCSV.html#M000018)
Setting it to utf-8 or the Microsoft encoding should get it past most dodgy extra characters, allowing it to actually parse into your required strings... then you can clean the strings to your heart's content.
There's also something in the docs about "converters" that you can pass in - though this is aimed more at converting, say, numeric or date types, you ight be able to use it to gsub for the dodgy chars.

Try the smarter_csv Gem - you can pass a block to it's proces method and clean-up data before it is used
https://github.com/tilo/smarter_csv

Related

detect encoding of xlsx content in ruby

I have an app which allows uploading spreadsheets in xls, xlsx and csv format. The data is later used at various client facing places. The people managing the data use various tools to create the spreadsheets, including mac/excel, win/excel, win/openoffice, linux/libreoffice...
The real problem is the mac/excel encoding, which creates some nasty looking strings. Is there any way to make sure the file content's encoding is valid utf-8?
My approach of just File.read(file.path).valid_encoding? checking works only for csv...
I would look into charlock_holmes, a gem which lets you easily detect and even attempt to transcode files based on their encoding.

Xls (csv) or xml for rails importing data

I need to import data to my app, now i do it via xls spreadsheets, but when in my case it has about 80.000 rows it is slow, so maybe is it better to chose another format? For example, will xml data be more faster in importing?
XML is unlikely to be any faster - it still needs to be parsed as strings and converted.
80,000 rows is quite a lot. How long does it take you?
Edit:
You can make what's happening more visible by dropping puts statements into your code, with timestamps. It's crude, but you can then time between various parts of your code to see which part takes the longest.
Or better yet, have a go at using ruby-prof to profile your code and see where the code is spending the most amount of time.
Either way, getting a more detailed picture of the slow-points is a Good Idea.
You may find there's just one or two bottlenecks that can be easily fixed.

Binary Serialized File - Delphi

I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.

Which pagecode was used to encode this DOC document?

I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.
Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.
Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources