RoR: Handling Blanks AND/OR Special Characters

RoR: Handling Blanks AND/OR Special Characters - ruby-on-rails

I'm processing emails for upload and occasionally an embedded image in the email comes through either without a file extension or with an extension containing a random combination of letters, numbers and special characters (for example: image001.gif#01CFA02B.47556390). If either instance arrives, I want to ignore it and move on. I think I've got the without extension covered, but wasn't clear on how best to handle the random characters and well as the cleanest way in to write the conditionals. Here is what I have so far:
filename_extension = File.extname(filename)
if filename_extension.blank?
puts "FILENAME EXT IS BLANK"
elsif filename_extension #NEED REGEX or something to handle Random?
puts "FILENAME EXT IS Random"
else #DO PROCESSING
Thanks.

known_extensions = %w[.csv .rb .rbw .html .htm .css]
filenames = %w[1.txt 2.csv 3]
filenames.each do |filename|
filename_extension = File.extname(filename)
if filename_extension.empty?
puts "FILENAME EXT IS BLANK"
elsif known_extensions.include? filename_extension
puts "FILENAME EXT IS Random"
else #DO PROCESSING
puts "Processing"
end
end
The question was tagged ruby without any indicate of gems that may give you the blank? method.
The idea of an 'invalid' extension is rather varied, and of course tied to what it means to be a valid file name. On most Unix file systems, for example, the only limitations on an file name would the limitation of the filename size of 255 bytes, and the reserved characters of / and null. In fact, there is no specification that I am aware of about 'extensions' in Unix, as they are simply a part of a file name, the period in a file name being valid, not signifying anything special. (With the exception of a file name that starts with a period indicating that it should be a 'hidden' file.) On a Windows system, it is a longer list of characters, some of which are / : < > ? \ | + , . ; = [ ] as well as the single and double quotes. On my Commodore, I think it was :, , and =, and on my Amiga system I could use anything except for ", :, or /.
So I think 'invalid' extension might be easier to match than the 'valid' ones. If you are indeed using Rails, and hosting on Unix, then you have a smaller set of things to check for, to ensure a valid extension (indeed, a valid filename). Basing that invalid extension on your hosting system, and any restrictions you would place due to your idea of what a valid extension means to your program.

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!

Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).

It seems like you need to do attachment.body.decoded instead of attachment.decoded

MalformedCSVError with rails CSV (FasterCSV)

I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"

Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!

Lua: Pattern match after a string?

For example, I have arbitrary lines in this format:
directory C:\Program Files\abc\def\
or something like.
log-enabled On
I want to be able to extract the "C:\Program Files\ab\def\" part out of that first line. Likewise, I want to extract the "On" out in the second line. The spaces between the variable and its value are arbitrary. I will know the name of the variable, but I need extract the value based on that.
So basically, I want to remove the first word and a number of arbitrary spaces that follow the first word, and return what remains until the end of the line.

Assuming that, by "word" you mean "a string of characters without spaces", you can do this:
for line in ioFile:lines() do
local variable, value = line:match("(%S+)%s+(.+)")
... --Do stuff with variable and value
end

One alternative with string.match was shown by Nicol Bolas, here is another alternative:
function splitOnFirstSpace(input)
local space = input:find(' ') or (#input + 1)
return input:sub(1, space-1), input:sub(space+1)
end
Usage:
local command, param = splitOnFirstSpace(line)
If no argument is given (splitOnFirstSpace('no-param-here')), then param is the empty string.

I do not believe Lua is packaged with a split() function like Ruby or Perl.
I found that this guy built a lua version of Perl's split function:
http://lua-users.org/lists/lua-l/2011-02/msg01145.html
If you can guarantee that the argument will only have 1 word before it, with that word not containing any spaces, you can read in that line, run the split function on it, and use the return array's 1 index value as what you want.
You could error check that too and make sure you get a 'C:\' within your expected directory, or check to make sure the string is == to 'On' or 'Off'. Because of using the hardcoded index value I really advocate you error check your expected value. Nothing is worse than having an assumed value be wrong.
If an error is detected make sure to log or print it to the screen so you know about it.
This could catch bugs where maybe the string that was input is improper.
Some simple code that models what I suggest you do:
line = "directory C:\Program Files\abc\def/";
contents = line.split(" "); --Split using a space
directory = contents[2]; --Here is your directory
if(errorCheckDir(directory))
--Use directory
end
EDIT:
In response to comments below Lua indeed begins indexing at 1, not 0.
Also, in the case that the directory contains spaces (which is probable) instead of simply using contents[2], I would loop through all of contents except index 1, and piece back together the directory making sure to add the required space between each index that you attach.
So in the case above, contents[2] and contents[3] would have to be stitched back together with a space in between to recover the proper directory.
directory = contents[2].." "..contents[3]
This can be easily automated using a function which has a loop in it and returns back the proper directory:
function recoverDir(contents)
directory = "";
--Recover the directory
for i=2, table.getn(contents) do
directory = directory..contents[i].." ";
end
--strip extra space on the end
dirEnd = string.len(directory);
directory = string.sub(directory,1,dirEnd-1);
return directory; --proper directory
end

Rails detect changes to files programatically

I would like to write a method that programatically detects whether any of the files in my rails app have been changed. Is it possible do do something like an MD5 of the whole app and store that in a session variable?
This is mostly for having some fun with cache manifest. I already have a dynamically generated cache and it works well in production. But in my dev environment, I would like the id of that cache to update whenever I change anything in the app directory (as opposed to every 10 seconds, which is how I have it setup right now).
Update
File.ctime(".") would be perfect, except that "." is not marked as having changed when deeper directory files have changed.
Does it make sense to iterate through all directories in "." and add together the ctimes for each?

Have you considered using Guard.
You can programatically do anything whenever a file in your project changes.
There is a nice railscast about it

There is a simple ruby gem called filewatcher. This is the most advanced example:
require 'filewatcher'
FileWatcher.new(["README.rdoc"]).watch() do |filename, event|
if(event == :changed)
puts "File updated: " + filename
end
if(event == :delete)
puts "File deleted: " + filename
end
if(event == :new)
puts "New file: " + filename
end
end

File.ctime is the key. Iterate through all files and create a unique id based on the sum of all their ctimes:
cache_id = 0
Dir.glob('./**/*') do |this_file|
ignore_files = ['.', '..', "log"]
ignore_files.each do |ig|
next if this_file == ig
end
cache_id += File.ctime(this_file).to_i if File.directory?(this_file)
end
Works like a charm, page only re-caches when it needs to, even in development.

Text searching PDF

When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8) representation?
If I need a CMap for this, how do I create (or retrieve) and apply the CMap?

You'll probably have to parse the font data itself. Identity-H just means "use the bytes as raw glyph indexes into the given font". That's why you MUST embed fonts when using Identity-H... different versions of the same font need not have the same glyph order.
There's example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I'm biased).
You'd mentioned a CMap. Identity-H fonts can have a CMap but aren't required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren't all that complex:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TTX+0)
/Ordering (T42UV)
/Supplement 0
>> def
/CMapName /TTX+0 def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
80 beginbfrange
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003b><003b><0058>
<003c><003c><0059>
<003d><003d><005a>
<0065><0065><00c9>
<00c8><00c8><00c1>
<00cb><00cb><00cd>
<00cf><00cf><00d3>
<00d2><00d2><00da>
<00e2><00e2><0160>
<00e4><00e4><017d>
<00e9><00e9><00dd>
<00fd><00fd><010c>
<0104><0104><0104>
<0106><0106><010e>
<0109><0109><0118>
<010b><010b><011a>
<0115><0115><0147>
<011b><011b><0158>
<0121><0121><0164>
<0123><0123><016e>
<01a0><01a0><0116>
<01b2><01b2><012e>
<01cb><01cb><016a>
<01cf><01cf><0172>
<022c><022c><0401>
<023b><023b><0411>
<023c><023c><0412>
<023d><023d><0413>
<023e><023e><0414>
<023f><023f><0415>
<0240><0240><0416>
<0241><0241><0417>
<0242><0242><0418>
<0243><0243><0419>
<0244><0244><041a>
<0245><0245><041b>
<0246><0246><041c>
<0247><0247><041d>
<0248><0248><041e>
<0249><0249><041f>
<024a><024a><0420>
<024b><024b><0421>
<024c><024c><0422>
<024d><024d><0423>
<024e><024e><0424>
<024f><024f><0425>
<0250><0250><0426>
<0251><0251><0427>
<0252><0252><0428>
<0253><0253><0429>
<0254><0254><042a>
<0255><0255><042b>
<0256><0256><042c>
<0257><0257><042d>
<0258><0258><042e>
<0259><0259><042f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
Wow. That particular CMap is horribly inefficient. A "bfrange" starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.
For example:
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
could be represented as
<0003><0003><0020>
<0024><0032><0041>
A quick google search turned up the CMap/CID font spec.
There are also beginbfchar/endbfchar which just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe's character ID tables. They're part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they're called)), and various other stuff you really out to read that spec to find out about.

There are multiple ways this data may be encoded (some using CMAPs). You can also have custom encodings (http://www.jpedal.org/PDFblog/2011/04/understanding-the-pdf-file-format-%E2%80%93-custom-font-encodings/). You also need to understand CID fonts (http://www.jpedal.org/PDFblog/2011/03/understanding-the-pdf-file-format-%E2%80%93-what-are-cid-fonts/)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

RoR: Handling Blanks AND/OR Special Characters - ruby-on-rails

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

MalformedCSVError with rails CSV (FasterCSV)

Lua: Pattern match after a string?

Rails detect changes to files programatically

Text searching PDF

Categories

Resources