What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded
I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!
For example, I have arbitrary lines in this format:
directory C:\Program Files\abc\def\
or something like.
log-enabled On
I want to be able to extract the "C:\Program Files\ab\def\" part out of that first line. Likewise, I want to extract the "On" out in the second line. The spaces between the variable and its value are arbitrary. I will know the name of the variable, but I need extract the value based on that.
So basically, I want to remove the first word and a number of arbitrary spaces that follow the first word, and return what remains until the end of the line.
Assuming that, by "word" you mean "a string of characters without spaces", you can do this:
for line in ioFile:lines() do
local variable, value = line:match("(%S+)%s+(.+)")
... --Do stuff with variable and value
end
One alternative with string.match was shown by Nicol Bolas, here is another alternative:
function splitOnFirstSpace(input)
local space = input:find(' ') or (#input + 1)
return input:sub(1, space-1), input:sub(space+1)
end
Usage:
local command, param = splitOnFirstSpace(line)
If no argument is given (splitOnFirstSpace('no-param-here')), then param is the empty string.
I do not believe Lua is packaged with a split() function like Ruby or Perl.
I found that this guy built a lua version of Perl's split function:
http://lua-users.org/lists/lua-l/2011-02/msg01145.html
If you can guarantee that the argument will only have 1 word before it, with that word not containing any spaces, you can read in that line, run the split function on it, and use the return array's 1 index value as what you want.
You could error check that too and make sure you get a 'C:\' within your expected directory, or check to make sure the string is == to 'On' or 'Off'. Because of using the hardcoded index value I really advocate you error check your expected value. Nothing is worse than having an assumed value be wrong.
If an error is detected make sure to log or print it to the screen so you know about it.
This could catch bugs where maybe the string that was input is improper.
Some simple code that models what I suggest you do:
line = "directory C:\Program Files\abc\def/";
contents = line.split(" "); --Split using a space
directory = contents[2]; --Here is your directory
if(errorCheckDir(directory))
--Use directory
end
EDIT:
In response to comments below Lua indeed begins indexing at 1, not 0.
Also, in the case that the directory contains spaces (which is probable) instead of simply using contents[2], I would loop through all of contents except index 1, and piece back together the directory making sure to add the required space between each index that you attach.
So in the case above, contents[2] and contents[3] would have to be stitched back together with a space in between to recover the proper directory.
directory = contents[2].." "..contents[3]
This can be easily automated using a function which has a loop in it and returns back the proper directory:
function recoverDir(contents)
directory = "";
--Recover the directory
for i=2, table.getn(contents) do
directory = directory..contents[i].." ";
end
--strip extra space on the end
dirEnd = string.len(directory);
directory = string.sub(directory,1,dirEnd-1);
return directory; --proper directory
end
I would like to write a method that programatically detects whether any of the files in my rails app have been changed. Is it possible do do something like an MD5 of the whole app and store that in a session variable?
This is mostly for having some fun with cache manifest. I already have a dynamically generated cache and it works well in production. But in my dev environment, I would like the id of that cache to update whenever I change anything in the app directory (as opposed to every 10 seconds, which is how I have it setup right now).
Update
File.ctime(".") would be perfect, except that "." is not marked as having changed when deeper directory files have changed.
Does it make sense to iterate through all directories in "." and add together the ctimes for each?
Have you considered using Guard.
You can programatically do anything whenever a file in your project changes.
There is a nice railscast about it
There is a simple ruby gem called filewatcher. This is the most advanced example:
require 'filewatcher'
FileWatcher.new(["README.rdoc"]).watch() do |filename, event|
if(event == :changed)
puts "File updated: " + filename
end
if(event == :delete)
puts "File deleted: " + filename
end
if(event == :new)
puts "New file: " + filename
end
end
File.ctime is the key. Iterate through all files and create a unique id based on the sum of all their ctimes:
cache_id = 0
Dir.glob('./**/*') do |this_file|
ignore_files = ['.', '..', "log"]
ignore_files.each do |ig|
next if this_file == ig
end
cache_id += File.ctime(this_file).to_i if File.directory?(this_file)
end
Works like a charm, page only re-caches when it needs to, even in development.
When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8) representation?
If I need a CMap for this, how do I create (or retrieve) and apply the CMap?
You'll probably have to parse the font data itself. Identity-H just means "use the bytes as raw glyph indexes into the given font". That's why you MUST embed fonts when using Identity-H... different versions of the same font need not have the same glyph order.
There's example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I'm biased).
You'd mentioned a CMap. Identity-H fonts can have a CMap but aren't required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren't all that complex:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TTX+0)
/Ordering (T42UV)
/Supplement 0
>> def
/CMapName /TTX+0 def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
80 beginbfrange
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003b><003b><0058>
<003c><003c><0059>
<003d><003d><005a>
<0065><0065><00c9>
<00c8><00c8><00c1>
<00cb><00cb><00cd>
<00cf><00cf><00d3>
<00d2><00d2><00da>
<00e2><00e2><0160>
<00e4><00e4><017d>
<00e9><00e9><00dd>
<00fd><00fd><010c>
<0104><0104><0104>
<0106><0106><010e>
<0109><0109><0118>
<010b><010b><011a>
<0115><0115><0147>
<011b><011b><0158>
<0121><0121><0164>
<0123><0123><016e>
<01a0><01a0><0116>
<01b2><01b2><012e>
<01cb><01cb><016a>
<01cf><01cf><0172>
<022c><022c><0401>
<023b><023b><0411>
<023c><023c><0412>
<023d><023d><0413>
<023e><023e><0414>
<023f><023f><0415>
<0240><0240><0416>
<0241><0241><0417>
<0242><0242><0418>
<0243><0243><0419>
<0244><0244><041a>
<0245><0245><041b>
<0246><0246><041c>
<0247><0247><041d>
<0248><0248><041e>
<0249><0249><041f>
<024a><024a><0420>
<024b><024b><0421>
<024c><024c><0422>
<024d><024d><0423>
<024e><024e><0424>
<024f><024f><0425>
<0250><0250><0426>
<0251><0251><0427>
<0252><0252><0428>
<0253><0253><0429>
<0254><0254><042a>
<0255><0255><042b>
<0256><0256><042c>
<0257><0257><042d>
<0258><0258><042e>
<0259><0259><042f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
Wow. That particular CMap is horribly inefficient. A "bfrange" starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.
For example:
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
could be represented as
<0003><0003><0020>
<0024><0032><0041>
A quick google search turned up the CMap/CID font spec.
There are also beginbfchar/endbfchar which just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe's character ID tables. They're part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they're called)), and various other stuff you really out to read that spec to find out about.
There are multiple ways this data may be encoded (some using CMAPs). You can also have custom encodings (http://www.jpedal.org/PDFblog/2011/04/understanding-the-pdf-file-format-%E2%80%93-custom-font-encodings/). You also need to understand CID fonts (http://www.jpedal.org/PDFblog/2011/03/understanding-the-pdf-file-format-%E2%80%93-what-are-cid-fonts/)