How to clean messy CSV files with regex commands in Rails?

How to clean messy CSV files with regex commands in Rails? - ruby-on-rails

I'm dealing with messy CSV files, and really hope some regexes (or other viable substitutions) will help. Thousands of files per day, thousands of lines per file, hundreds of columns. For the most part the CSVs are very well formatted - all fields are contained within "quotes" and are comma-delimited. However, what's inside the fields sometimes causes trouble.
For simplicity, here's a brief demonstration of the issues I'm running into.
Example data:
"option1","option2","product","description","reference","price","currency","foo","bar","address1","address2","city","state","postal","country","option3","option4","option5"
"","","apples","multi-color ""macintosh"" apples","PO#1-3503", 24.00"" x 12.00"" x 12.00"","12.50","USD","","","1234 Main Street ""","","Genericsberg","VA","10324","USA","","",""
Issues:
""" appearing some place or other ... see address1
"" appearing inside of fields ... see description, reference
" appearing inside of fields ... see reference - the purchase order # and product dimensions are part of the same field (don't let the ", fool you)
What I'm pretty sure I want:
when "" appears inside of a non-empty field, remove it
when ", appears inside of a field, remove the " but keep the comma
when """ appears at the beginning / end of a field, change it to "
when "" appears at the beginning / end of a non-empty field, change it to "
What I really want is for these messy CSVs to import properly, with the data arriving in the correct fields.
Background:
Our tech stack involves Ruby on Rails, currently using SmarterCSV to import CSV files from an S3 bucket.
What we've tried:
text.gsub(/(?<!^|,)"(?!,|$)/, '')
This has served us well for various issues up to this point, but not so well on the above.
text.gsub('"""', '"')
This seems to accomplish # 3 well enough - haven't deployed it to production as I'd like to get past these other issues as well.

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.

I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

Ruby removing diacritics from filenames - how to preserve them?

I have a directory full of files which have Unicode characters with diacritics in their file names, e.g. ăn.mp3, bất.mp3. (They're Vietnamese words.)
I'm iterating over these files using Dir.glob("path/to/folder/*").each, but the diacritics don't work properly. For example:
Dir.glob("path/to/folder/*").each do |file|
# e.g. file = "path/to/folder/bất.mp3"
word = file.split("/").last.split(".").first # bất
puts word[1] # outputs "a", but should be "ấ"
end
Bizarrely, if I run puts word then the diacritics appear correctly, but if I puts individual letters, they're not there. The file names eventually get saved as an attribute in a table in my Rails app, and all kinds of problems are occurring from the diacritics being inconsistent and disappearing.
Clearly something's wrong with my encoding, but I have no idea how to go about fixing this. This is a problem not just with Rails but with Ruby itself, because the above output is from irb, independent of any Rails app.
(I'm running Ruby 2.0.0p247.)
What the hell is going on?

There are two ways to produce a diatric. One is to use the letter with the diatric on it. Another is to use the normal letter, and to immediately follow it with a special diatric letter. Are you sure you're not in the latter scenario? (If so, puts 'a' + word[2] should produce the letter wiht a diatric.)
Also, are you sure your strings are properly encoded using utf8 (or utf16), rather than sequences of bytes?

retrieve txt content of as many file types as possible

I maintain a client server DMS written in Delphi/Sql Server.
I would like to allow the users to search a string inside all the documents stored in the db. (files are stored as blob, they are stored as zipped files to save space).
My idea is to index them on "checkin", so as i store a nwe file I extract all the text information in it and put it in a new DB field. So somehow my files table will be:
ID_FILE integer
ZIPPED_FILE blob
TEXT_CONTENT text field (nvarchar in sql server)
I would like to support "indexing" of at least most common text-like files, such as:pdf, txt, rtf, doc, docx,pdf, may be adding xls and xlsx, ppt, pptx.
For MS Office files I can use ActiveX since I alerady do it in my application, for txt files i can simply read the file, but for pdf and odt?
Could you suggest the best techinque or even a 3rd party component (not free too) that parses with "no fear" all file types?
Thanks

searching documents this way would leed to a very slow and inconvenient to use, I'd advice you produce two additional tables instead of TEXT_CONTENT field.
When you parse the text, you should extract valuable words and try to standardise them so that you
- get rid of lower/upper case problems
- get rid of characters that might be used interchangeably.
i.e. in Turkish we have ç character that might be entered as c.
- get rid of verbs that are common in the language you are dealing with.
i.e. "Thing I am looking for", "Thing" "Looking" might be in your interest
- get rid of whatever problem use face.
Each word, that has already an entry in the table should re-use the ID already given in the string_search table.
the records may look like this.
original_file_table
zip_id number
zip_file blob
string_search
str_id number
standardized_word text (or any string type with an appropriate secondary index)
file_string_reference
zip_id number
str_id number
I hope that I could give you the idea what I am thinking of.

Your major problem is zipping your files before putting them as a blob in your database which makes them unsearchable by the database itself. I would suggest the following.
Don't zip files you put in the database. Disk space is cheap.
You can write a query like this as long as you save the files in a text field.
Select * from MyFileTable Where MyFileData like '%Thing I am looking for%'
This is slow but it will work. This will work because the text in most of those file types is in plain text not binary (though some of the newer file types are now binary)
The other alternative is to use an indexing engine such as Apache Lucene or Apache Solr which will as you put it
parses with "no fear" all file types?

Ruby on Rails: How are people making search forms on sites with the broken textfield parsing that removes quotation marks?

I'm using Sunspot Solr search, which works fine for the most part for basic search. It's supposed to be able to handle quotation marks around phrases, so that a search for test case will return documents with both test and case, whereas a search for "test case" should return documents with the phrase test case.
However, I've been pulling my hair out over this, because it seems like Rails is stripping out any outer quotation marks from user inputs before it even hits the search engine. Thus "test case" returns the exact same results as test case but ""test case"" or "test case" (with leading and trailing spaces) actually works, in the first case because the outer quotation marks are stripped away leaving the inner ones, and in the second case because this problem only affects leading and trailing quotation marks.
Apparently, this is a known bug and has been marked as won't fix by the Rails team. I'm really surprised how little I can find on this online, since it seems like very common functionality.
How are people getting around this? It doesn't seem like a reasonable solution to me the ask the users to double quote things because of this, and I don't particularly want to go making my own custom modifications to Rack.

Apparently, from the aforementioned Lighthouse ticket, "If you add a new line character then double quotes [are] preserved." (Example.)
You might consider using Javascript to append a newline to your search string. It's a bit of a hack, but should get you past the Rack bug with no adverse effect on your queries.
A quick example off the top of my head in jQuery. Un-tested, ymmv, etc.
// append a newline to a field when submitting a form
// to work around a Rack parsing bug
$('#your_form').submit(function() {
$('#your_input').val(function(i, val) {
return val + "\n";
};
});

Hmm guess not many are running Rails apps that need searching that supports quotation marks?
I'm getting around this using the Rack patch linked to through the bug report for now until Rails fixes this bug.
Edit: Adding the links
Found on this page:
https://rails.lighthouseapp.com/projects/8994/tickets/4808
Direct download:
https://rails.lighthouseapp.com/projects/8994/tickets/4808/a/662679/fix_rack_110_quote_parsing.rb
However, this is definitely not perfect, and I am still finding some cases, such as ending with a trailing quote cause the leading part to be truncated.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...

You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.

Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.

This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart