Ok - I have the following in my test/test_helper.rb:
def read_pdf_from_response(response)
file = Tempfile.new
file.write response.body.force_encoding('UTF-8')
begin
reader = PDF::Reader.new(file)
reader.pages.map(&:text).join.squeeze("\n")
ensure
file.close
file.unlink
end
end
I use it like this in an integration test:
get project_path(project, format: 'pdf')
read_response_from_pdf(#response).tap do |pdf|
assert_match(/whatever/, pdf)
end
This works fine as long as I run a test singly or when running all tests with only one worker, e.g. PARALLEL_WORKERS=1. But tests that use this method will fail intermittently when I run my suite with more than 1 parallel worker. My laptop has 8 cores, so that's normally what it's running with.
Here's the error:
PDF::Reader::MalformedPDFError: PDF malformed, expected 5 but found 96 instead
or sometimes: PDF::Reader::MalformedPDFError: PDF file is empty
The PDF reader is https://github.com/yob/pdf-reader which hasn't given any problems.
The controller that sends the PDF returns like so:
send_file out_file,
filename: "#{#project.name}.pdf",
type: 'application/pdf',
disposition: (params[:download] ? 'attachment' : 'inline')
I can't see why this isn't working. No files should ever have the same name at the same time, since I'm using Tempfile, right? How can I make all this run with parallel tests?
While I cannot confirm why this is happening the issue may be that:
You are forcing the encoding to "UTF-8" but PDF documents are binary files so this conversion could be damaging the PDF.
Some of the responses you are receiving are truly empty or malformed.
Maybe try this instead:
def read_pdf_from_response(response)
doc = StringIO.new(response.body.to_s)
begin
PDF::Reader.new(doc)
.pages
.map(&:text)
.join
.squeeze("\n")
rescue PDF::Reader::MalformedPDFError => e
# handle issues with the pdf itself
end
end
This will avoid the file system altogether while still using a compatible IO object and will make sure that the response is read as binary to avoid any conversion conflicts.
In my user model, I have an after_create callback that looks like this:
def set_default_profile_image
file = Tempfile.new([self.initials, ".jpg"])
file.binmode
file.write(Avatarly.generate_avatar(self.full_name, format: "jpg", size: 300))
begin
self.profile_image = File.open(file.path)
ensure
file.close
file.unlink
end
self.save
end
(self.initials is simply a utility method that returns the user's initials, so that e.g. my profile image would be "HB.jpg".)
If I call the method directly on an existing user, it works maybe 80% of the time. The other times, it gives me an error message so long I can't reproduce it here (I can't even scroll back far enough in tmux to see the start of it). The error message (or what I can see of it, anyway) comprises a list of MIME types, followed by this bit:
content type discovered from file command: application/x-empty. See documentation to allow this combination.
If I create a new user, the callback results in the same error message 100% of the time.
My method uses the Avatarly gem to generate placeholder avatars; the gem yields them in blob form, hence the creation of a Tempfile to write to.
I can't understand why the above error would occur.
Make sure that full_name has a valid return value and try moving your save call into the begin section. You may be racing against a save and the tempfile being removed/unlink.
What do you expect happens when you do this?
self.profile_image = File.open(file.path)
Without a block, this is the same as:
self.profile_image = File.new(file.path)
They both return a file object. Is profile_image in the database? I'm pretty sure it is going to be mad that you sent a File object to be persisted. If you want the data from that file in the database, do something like:
self.profile_image = File.open(file.path).read
If you want to save the tempfile's path:
self.profile_image = File.path(file.path)
If you are using the path remember that you are saving a tempfile, and the file will not last very long!
I found the solution in an issue on Paperclip's github. I don't really understand the causes very well, but it seems that this is a filesystem issue, where the Tempfile is not yet persisted to disk by the time it gets read into the model.
The solution is to do absolutely anything to the Tempfile before assigning it; file.read works just fine.
def set_default_profile_image
file = Tempfile.new([self.initials, ".jpg"])
file.binmode
file.write(Avatarly.generate_avatar(self.full_name, format: "jpg", size: 300))
file.read # <-- this fixes the issue
begin
self.profile_image = File.open(file.path)
ensure
file.close
file.unlink
end
self.save
end
I am uploading txt files using carrierwave. The files are not small (80 MB - 500 MB) and I want to remove some of the lines to reduce this size (about 80% of the file size is going to be reduced).
I have created a model method in order to clear these lines:
require 'fileutils'
def clear_unnecessary_lines
old_file_path = Rails.root.join('public').to_s + log_file.to_s
new_file_path = old_file_path.sub! '.txt', '_temp.txt'
File.open(old_file_path, 'r') do |old_file|
File.open(new_file_path, 'w') do |new_file|
old_file.each_line do |line|
new_file.write(line) unless line.grep(/test/).size > 0
end
end
end
FileUtils.mv new_file_path, old_file_path
end
but I am getting error when I am trying to open the new file saying there is no such file. As I have read opening a file with the w option should create an empty file for writing. Then why I am getting such error?
Also, since log_file column is holding the path to the original file, and I am changing it, could you tell how to rename the new file with the old name? As I have checked I should specify only old and new names, not paths.
Errno::ENOENT: No such file or directory - /home/gotqn/Rails/LogFilesAnalyser/LogFilesAnalyser/public/uploads/log_file/log_file/3/log_debug_temp.txt
It is strange that If I execute the following command in rails console, it is not throwing an error and the file is created.
File.open('/home/gotqn/Rails/LogFilesAnalyser/LogFilesAnalyser/public/uploads/log_file/log_file/3/log_debug_temp.txt','w')
Ah, i see your problem now. When you do this
new_file_path = old_file_path.sub! '.txt', '_temp.txt'
you call the "self-altering" version of sub, ie sub!. This will actually change the value of old_file_path as a side effect. Then, in the next line, you try to open this file, which hasn't been created yet. Take out the exclamation mark and you should be fine.
With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:
Docsplit.extract_pages('doc.pdf')
I can have the text content of a PDF file.
I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.
Is there a way to get the text of this PDF avoiding the creation of a temporary file?
I'm using attachment_fu if it matters.
Use a temporary directory:
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
pdf_to_text('doc.pdf')
If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.
Look at either of:
new(string=""[, mode])
Creates new StringIO instance from with string and mode.
open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.
I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!