Ruby on Rails - How to convert to images some elements from a word document - ruby-on-rails

Context
In our platform we allow users to upload word documents, those documents are stored in google drive and then dowloaded again to our platform in HTML format to create a section where the users can interact with that content.
Rails 5.0.7
Ruby 2.5.7p206
selenium-webdriver 3.142.7 (latest stable version compatible with our ruby and rails versions)
Problem
Some of the documents have charts or graphics inside that are not processed correctly giving wrong results after all the process.
We have been trying to fix this problem at the moment we get the word document and before to send it to google drive.
I'm looking for a simple way to export the entire chart and/or table as an image, if anyone knows of a way to do this the advice would be much appreciated.
Edit 1: Adding some screenshots:
This screenshot is from the original word doc:
And this is how it looks in our systems:
Here are the approaches I have tried that haven't worked for me so far.
Approach 1
Using nokogiri to read the document and found the nodes that contain the charts (we've found that they are called drawing) and then use Selenium to navigate through the file and take and screenshot of that particular section.
The problem we found with this approach is that the versions our gems are not compatible with the latest versions of selenium and its web drivers (chrome or firefox) and it is not posible to perform this action.
Other problem, and it seems is due to security, is that selenium is not able to browse inside local files and open it.
options = Selenium::WebDriver::Firefox::Options.new(binary: '/usr/bin/firefox', headless: true)
driver = Selenium::WebDriver.for :firefox, options: options
path = "#{Rails.root}/doc_file.docx"
driver.navigate.to("file://#{path}")
# Here occurs the first issue, it is not able to navigate to the file
puts "Title: #{driver.title}"
puts "URL: #{driver.current_url}"
# Below is the code that I am trying to use to replace the images with the modified images
drawing_elements = driver.find_elements(:css, 'w|drawing')
modified_paragraphs = []
drawing_elements.each do |drawing_element|
paragraph_element = drawing_element.find_element(:xpath, '..')
paragraph_element.screenshot.save('paragraph.png')
modified_paragraph = File.read('paragraph.png')
modified_paragraphs << modified_paragraph
end
driver.quit
file = File.open(File.join(Rails.root, 'doc_file.docx'))
doc = Nokogiri::XML(file)
drawing_elements = doc.css('w|drawing')
drawing_elements.each_with_index do |drawing_element, i|
paragraph_element = drawing_element.parent
paragraph_element.replace(modified_paragraphs[i])
end
new_doc_file = File.write('modified_doc.docx', doc.to_xml)
s3_client.put_object(bucket: bucket, key: #document_path, body: new_doc_file)
File.delete('doc_file.docx')
Approach 2
Using nokogiri to get the drawing elements and the try to convert it directly to an image using rmagick or mini_magick.
It is only possible if the drawing element actually contains an image, it can convert that correctly to an image, but the problem is when inside of the drawing element are not images but other elements like graphicData, pic, blipFill, blip. It needs to start looping into the element and rebuilding it, but at that point of time it seems that the element is malformed and it can't rebuild it.
Other issue with this approach is when it founds elements that seem to conform an svg file, it also needs to loop into all the elements and try to rebuild it, but the same as the above issue, it seems that the element is malformed.
response = s3_client.get_object(bucket: bucket, key: #document_path)
docx = response.body.read
Zip::File.open_buffer(docx) do |zip|
doc = zip.find_entry("word/document.xml")
doc_xml = doc.get_input_stream.read
doc = Nokogiri::XML(doc_xml)
drawing_elements = doc.xpath("//w:drawing")
drawing_elements.each do |drawing_element|
node = get_chil_by_name(drawing_element, "graphic")
if node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").any?
img_data = node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").first.attributes["r:embed"].value
img = Magick::Image.from_blob(img_data).first
img.write("node.jpeg")
node.replace("<img src='#{img.to_blob}'/>")
elsif node.xpath("//a:graphicData/a:svg").any?
svg_data = node.xpath("//a:graphicData/a:svg").to_s
Prawn::Document.generate("node.pdf") do |pdf|
pdf.svg svg_data, at: [0, pdf.cursor], width: pdf.bounds.width
end
else
puts "unsupported format"
end
end
# update the file in S3
s3.put_object(bucket: bucket, key: #document_path, body: doc)
end
Approach 3
Convert the elements since its parents to a pdf file and then to an image.
Basically the same issue as in the approach 2, it needs to loop inside all the elements and try to rebuild it, we haven't found a way to do that.

Related

Rails: Received string from web service should be converted to a pdf image to be displayed

When using a web service I receive a string that looks like this:
JVBERi0xLjIgDQoxIDAgb2JqDQo8PA0KL1R5cGUgL0NhdGFsb2cNCi9QYWdlcyAzIDAgUg0KL1BhZ2VNb2RlIC9Vc2VOb25lDQovUGFnZUxheW91dCAvU2luZ2xlUGFnZQ0KPj4NCmVuZG9iag0KMiAwIG9iag0KPDwNCi9Qcm9kdWNlciAoUGRmQ3JlYXRvciB2ZXJzaW9uIDEuMCkNCi9BdXRob3IgKCkNCi9DcmVhdGlvbkRhdGUgKEQ6MjAxNzA4MjUxMTM0MDApDQovQ3JlYXRvciAoKQ0KL0tleXdvcmRzICgpDQovU3ViamVjdCAoKQ0KL1RpdGxlICgpDQovTW9kRGF0ZSAoRDoyMDE3MDgyNTExMzQwMCkNCj4+DQplbmRvYmoNCjMgMCBvYmoNCjw8DQovVHlwZSAvUGFnZXMNCi9LaWRzIFs0IDAgUiBdDQovQ291bnQgMQ0KPj4NCmVuZG9iag0KNCAwIG9iag0KPDwNCi9UeXBlIC9QYWdlDQovUGFyZW50IDMgMCBSDQovTWVkaWFCb3ggWzAgMCA0MjUgMjkyIF0NCi9SZXNvdXJjZXMgPDwNCi9Gb250IDw8DQovRjAgNiAwIFINCi9GMSA4IDAgUg0KPj4NCi9YT2JqZWN0IDw8DQovMCA3IDAgUg0KPj4NCi9Qcm9jU2V0IFsvUERGIC9UZXh0IC9JbWFnZUMgXQ0KPj4NCi9Db250ZW50cyA1IDAgUg0KPj4NCmVuZG9iag0KNSAwIG9iag0KPDwNCi9MZW5ndGggNDE4DQovRmlsdGVyIFsvRmxhdGVEZWNvZGUgXQ0KPj4NCnN0cmVhbQ0KeF6lVMtu2zAQvPMr9tjmwHCXD1G9uXkiCBI3VtND0YNj0akL20ppGQX69V1Slp2iRWI4ECCMRHJmd2ek43MFpSwsVFORMKqEFKQrPoqPlUCpFD89u3dvtHe820g+uhDvEE6a+bx5D9UPcVbtCBAl8Y7rl5k8kKYN0+D+omf5+k1BLbQpgbyFhTBoMpp3yDlGeTWjrui7C4HwS6z264EPUtErn8fxcrJtoRMnVQBRmcV5L6Mkzp2jZ5RWO7QVV3uLI4tjP78jPTo9uz+xxnhntTNH/Qx+CnJWKpObs066EpyXSgNZaQ1MFuJYwWkjPmX7/F/ula8MPg+RJGUHB9PfYVmH+OEfD1+jyTNi/zqe4e2ournuSf6TqX0ikSoz5SYRw2bV7hi32dyXRxcbntEsztarOoYwBUMwOCisqTJyG0ZCTV8quGyaaV038ekgRuJvjtC/vdlMpPrahqENEeoAd+sZo4NLw5KDlp29nddh+bCOjyGu2jget4C62BmNgJyAZz8P1Ongi59+Khk9Ss4tC2hb6NEVfG5jmHxve+Y/1ub6WwplbmRzdHJlYW0NCmVuZG9iag0KNiAwIG9iag0KPDwNCi9UeXBlIC9Gb250DQovU3VidHlwZSAvVHlwZTENCi9GaXJzdENoYXIgMzINCi9MYXN0Q2hhciAyNTUNCi9CYXNlRm9udCAvSGVsdmV0aWNhDQovRW5jb2RpbmcgPDwNCi9UeXBlIC9FbmNvZGluZw0KL0RpZmZlcmVuY2VzIFszMiAvc3BhY2UgMTk2IC9BZGllcmVzaXMgMjE0IC9PZGllcmVzaXMgMjIwIC9VZGllcmVzaXMgMjI4IC9hZGllcmVzaXMgMjQ2IC9vZGllcmVzaXMgMjUyIC91ZGllcmVzaXMgMjIzIC9nZXJtYW5kYmxzIDE5MiAvQWdyYXZlIDE5NCAvQWNpcmN1bWZsZXggMTk4IC9BRSAxOTkgL0NjZWRpbGxhIDIwMCAvRWdyYXZlIDIwMSAvRWFjdXRlIDIwMiAvRWNpcmN1bWZsZXggMjAzIC9FZGllcmVzaXMgMjA2IC9JY2lyY3VtZmxleCAyMDcgL0lkaWVyZXNpcyAyMTIgL09jaXJjdW1mbGV4IDE0MCAvT0UgMjE3IC9VZ3JhdmUgMjE5IC9VY2lyY3VtZmxleCAxNTkgL1lkaWVyZXNpcyAyMjQgL2FncmF2ZSAyMjYgL2FjaXJjdW1mbGV4IDIzMCAvYWUgMjMxIC9jY2VkaWxsYSAyMzIgL2VncmF2ZSAyMzMgL2VhY3V0ZSAyMzQgL2VjaXJjdW1mbGV4IDIzNSAvZWRpZXJlc2lzIDIzOCAvaWNpcmN1bWZsZXggMjM5IC9pZGllcmVzaXMgMjQ0IC9vY2lyY3VtZmxleCAxNTYgL29lIDI0OSAvdWdyYXZlIDI1MSAvdWNpcmN1bWZsZXggMjU1IC95ZGllcmVzaXMgMTk3IC9BcmluZyAyMTYgL09zbGFzaCAxOTMgL0FhY3V0ZSAyMDUgL0lhY3V0ZSAyMTggL1VhY3V0ZSAyMjEgL1lhY3V0ZSAyMjkgL2FyaW5nIDI0OCAvb3NsYXNoIDIyNSAvYWFjdXRlIDIzNyAvaWFjdXRlIDI1MCAvdWFjdXRlIDI1MyAveWFjdXRlIDIxMCAvT2dyYXZlIDI0MiAvb2dyYXZlIDIwNCAvSWdyYXZlIDIzNiAvaWdyYXZlIDE5NSAvQXRpbGRlIDIxMyAvT3RpbGRlIDIyNyAvYXRpbGRlIDI0NSAvb3RpbGRlIDIyMiAvVGhvcm4gMjU0IC90aG9ybiAyMDkgL050aWxkZSAyNDEgL250aWxkZSAxMzggL1NjYXJvbiAxNTQgL3NjYXJvbiAyNDMgL29hY3V0ZSAyMTEgL09hY3V0ZSBdDQo+Pg0KL1dpZHRocyBbMjc4IDI3OCAzNTUgNTU2IDU1NiA4ODkgNjY3IDE5MSAzMzMgMzMzIDM4OSA1ODQgMjc4IDMzMyAyNzggMjc4IDU1NiA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiAyNzggMjc4IDU4NCA1ODQgNTg0IDU1NiAxMDE1IDY2NyA2NjcgNzIyIDcyMiA2NjcgNjExIDc3OCA3MjIgMjc4IDUwMCA2NjcgNTU2IDgzMyA3MjIgNzc4IDY2NyA3NzggNzIyIDY2NyA2MTEgNzIyIDY2NyA5NDQgNjY3IDY2NyA2MTEgMjc4IDI3OCAyNzggNDY5IDU1NiAzMzMgNTU2IDU1NiA1MDAgNTU2IDU1NiAyNzggNTU2IDU1NiAyMjIgMjIyIDUwMCAyMjIgODMzIDU1NiA1NTYgNTU2IDU1NiAzMzMgNTAwIDI3OCA1NTYgNTAwIDcyMiA1MDAgNTAwIDUwMCAzMzQgMjYwIDMzNCA1ODQgMCA1NTYgMCAyMjIgNTU2IDMzMyAxMDAwIDU1NiA1NTYgMzMzIDEwMDAgNjY3IDMzMyAxMDAwIDAgNjExIDAgMCAyMjIgMjIyIDMzMyAzMzMgMzUwIDU1NiAxMDAwIDMzMyAxMDAwIDUwMCAzMzMgOTQ0IDAgNTAwIDY2NyAwIDY2NyA1NTYgNTU2IDU1NiA1NTYgNjY3IDU1NiAzMzMgNzM3IDM3MCA1NTYgNjExIDAgNzM3IDYxMSA0MDAgNTU2IDMzMyAyMjIgMzMzIDU1NiA1MDAgMzMzIDMzMyAyNzggMzY1IDU1NiA1MDAgODM0IDgzNCA1MDAgNjY3IDY2NyA2NjcgNjY3IDY2NyA2NjcgMTAwMCA3MjIgNjY3IDY2NyA2NjcgNjY3IDI3OCAyNzggMzMzIDMzMyA3MjIgNzIyIDc3OCA3NzggNzc4IDc3OCA3NzggNTg0IDc3OCA3MjIgNzIyIDcyMiA3MjIgNjY3IDY2NyA2MTEgNTU2IDU1NiA1NTYgNTU2IDU1NiA1NTYgODg5IDUwMCA1NTYgNTU2IDU1NiA1NTYgMjc4IDI3OCAzMzMgMzMzIDU1NiA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiA1ODQgNjExIDU1NiA1NTYgNTU2IDU1NiA1MDAgNTU2IDUwMCBdDQovTmFtZSAvRjANCj4+DQplbmRvYmoNCjcgMCBvYmoNCjw8DQovTGVuZ3RoIDE2Mg0KL0ZpbHRlciBbL0ZsYXRlRGVjb2RlIF0NCi9UeXBlIC9YT2JqZWN0DQovU3VidHlwZSAvSW1hZ2UNCi9Db2xvclNwYWNlIFsvSW5kZXhlZCAvRGV2aWNlUkdCIDIyMyA8MDAwMDAwIDAwMDAzMyAwMDAwNjYgMDAwMDk5IDAwMDBjYyAwMDAwZmYgMDAzMzAwIDAwMzMzMyAwMDMzNjYgMDAzMzk5IDAwMzNjYyAwMDMzZmYgMDA2NjAwIDAwNjYzMyAwMDY2NjYgMDA2Njk5IDAwNjZjYyAwMDY2ZmYgMDA5OTAwIDAwOTkzMyAwMDk5NjYgMDA5OTk5IDAwOTljYyAwMDk5ZmYgMDBjYzAwIDAwY2MzMyAwMGNjNjYgMDBjYzk5IDAwY2NjYyAwMGNjZmYgMDBmZjAwIDAwZmYzMyAwMGZmNjYgMDBmZjk5IDAwZmZjYyAwMGZmZmYgMzMwMDAwIDMzMDAzMyAzMzAwNjYgMzMwMDk5IDMzMDBjYyAzMzAwZmYgMzMzMzAwIDMzMzMzMyAzMzMzNjYgMzMzMzk5IDMzMzNjYyAzMzMzZmYgMzM2NjAwIDMzNjYzMyAzMzY2NjYgMzM2Njk5IDMzNjZjYyAzMzY2ZmYgMzM5OTAwIDMzOTkzMyAzMzk5NjYgMzM5OTk5IDMzOTljYyAzMzk5ZmYgMzNjYzAwIDMzY2MzMyAzM2NjNjYgMzNjYzk5IDMzY2NjYyAzM2NjZmYgMzNmZjAwIDMzZmYzMyAzM2ZmNjYgMzNmZjk5IDMzZmZjYyAzM2ZmZmYgNjYwMDAwIDY2MDAzMyA2NjAwNjYgNjYwMDk5IDY2MDBjYyA2NjAwZmYgNjYzMzAwIDY2MzMzMyA2NjMzNjYgNjYzMzk5IDY2MzNjYyA2NjMzZmYgNjY2NjAwIDY2NjYzMyA2NjY2NjYgNjY2Njk5IDY2NjZjYyA2NjY2ZmYgNjY5OTAwIDY2OTkzMyA2Njk5NjYgNjY5OTk5IDY2OTljYyA2Njk5ZmYgNjZjYzAwIDY2Y2MzMyA2NmNjNjYgNjZjYzk5IDY2Y2NjYyA2NmNjZmYgNjZmZjAwIDY2ZmYzMyA2NmZmNjYgNjZmZjk5IDY2ZmZjYyA2NmZmZmYgOTkwMDAwIDk5MDAzMyA5OTAwNjYgOTkwMDk5IDk5MDBjYyA5OTAwZmYgOTkzMzAwIDk5MzMzMyA5OTMzNjYgOTkzMzk5IDk5MzNjYyA5OTMzZmYgOTk2NjAwIDk5NjYzMyA5OTY2NjYgOTk2Njk5IDk5NjZjYyA5OTY2ZmYgOTk5OTAwIDk5OTkzMyA5OTk5NjYgOTk5OTk5IDk5OTljYyA5OTk5ZmYgOTljYzAwIDk5Y2MzMyA5OWNjNjYgOTljYzk5IDk5Y2NjYyA5OWNjZmYgOTlmZjAwIDk5ZmYzMyA5OWZmNjYgOTlmZjk5IDk5ZmZjYyA5OWZmZmYgY2MwMDAwIGNjMDAzMyBjYzAwNjYgY2MwMDk5IGNjMDBjYyBjYzAwZmYgY2MzMzAwIGNjMzMzMyBjYzMzNjYgY2MzMzk5IGNjMzNjYyBjYzMzZmYgY2M2NjAwIGNjNjYzMyBjYzY2NjYgY2M2Njk5IGNjNjZjYyBjYzY2ZmYgY2M5OTAwIGNjOTkzMyBjYzk5NjYgY2M5OTk5IGNjOTljYyBjYzk5ZmYgY2NjYzAwIGNjY2MzMyBjY2NjNjYgY2NjYzk5IGNjY2NjYyBjY2NjZmYgY2NmZjAwIGNjZmYzMyBjY2ZmNjYgY2NmZjk5IGNjZmZjYyBjY2ZmZmYgZmYwMDAwIGZmMDAzMyBmZjAwNjYgZmYwMDk5IGZmMDBjYyBmZjAwZmYgZmYzMzAwIGZmMzMzMyBmZjMzNjYgZmYzMzk5IGZmMzNjYyBmZjMzZmYgZmY2NjAwIGZmNjYzMyBmZjY2NjYgZmY2Njk5IGZmNjZjYyBmZjY2ZmYgZmY5OTAwIGZmOTkzMyBmZjk5NjYgZmY5OTk5IGZmOTljYyBmZjk5ZmYgZmZjYzAwIGZmY2MzMyBmZmNjNjYgZmZjYzk5IGZmY2NjYyBmZmNjZmYgZmZmZjAwIGZmZmYzMyBmZmZmNjYgZmZmZjk5IGZmZmZjYyBmZmZmZmYgYzBjMGMwIDgwODA4MCA4MDAwMDAgMDA4MDAwIDAwMDA4MCA4MDgwMDAgODAwMDgwIDAwODA4MCA+IF0NCi9XaWR0aCAyMjA5DQovSGVpZ2h0IDENCi9CaXRzUGVyQ29tcG9uZW50IDgNCi9OYW1lIC8wDQo+Pg0Kc3RyZWFtDQp4XsWVUQ7AIAxCvf8Fvc7+tImEPuKy+bXUKZTSOtaaaontHSpf+6zfpveNdWEDV0mLVM6QZUBRHVIlnGsqGFBSXgqpI5b5n4RlRlLfi6rmGlAGmv5ZTto8k/aqJyjg0lDJDPuK2hh1cesMVdWYwduF1qOBJcznIB3QrOh3GvgnwjHAvpIQXz4L22qtK9l0D//iqGqAxO81lbYxO/XVA7D3SAYKZW5kc3RyZWFtDQplbmRvYmoNCjggMCBvYmoNCjw8DQovVHlwZSAvRm9udA0KL1N1YnR5cGUgL1R5cGUxDQovRmlyc3RDaGFyIDMyDQovTGFzdENoYXIgMjU1DQovQmFzZUZvbnQgL0hlbHZldGljYS1Cb2xkDQovRW5jb2RpbmcgPDwNCi9UeXBlIC9FbmNvZGluZw0KL0RpZmZlcmVuY2VzIFszMiAvc3BhY2UgMTk2IC9BZGllcmVzaXMgMjE0IC9PZGllcmVzaXMgMjIwIC9VZGllcmVzaXMgMjI4IC9hZGllcmVzaXMgMjQ2IC9vZGllcmVzaXMgMjUyIC91ZGllcmVzaXMgMjIzIC9nZXJtYW5kYmxzIDE5MiAvQWdyYXZlIDE5NCAvQWNpcmN1bWZsZXggMTk4IC9BRSAxOTkgL0NjZWRpbGxhIDIwMCAvRWdyYXZlIDIwMSAvRWFjdXRlIDIwMiAvRWNpcmN1bWZsZXggMjAzIC9FZGllcmVzaXMgMjA2IC9JY2lyY3VtZmxleCAyMDcgL0lkaWVyZXNpcyAyMTIgL09jaXJjdW1mbGV4IDE0MCAvT0UgMjE3IC9VZ3JhdmUgMjE5IC9VY2lyY3VtZmxleCAxNTkgL1lkaWVyZXNpcyAyMjQgL2FncmF2ZSAyMjYgL2FjaXJjdW1mbGV4IDIzMCAvYWUgMjMxIC9jY2VkaWxsYSAyMzIgL2VncmF2ZSAyMzMgL2VhY3V0ZSAyMzQgL2VjaXJjdW1mbGV4IDIzNSAvZWRpZXJlc2lzIDIzOCAvaWNpcmN1bWZsZXggMjM5IC9pZGllcmVzaXMgMjQ0IC9vY2lyY3VtZmxleCAxNTYgL29lIDI0OSAvdWdyYXZlIDI1MSAvdWNpcmN1bWZsZXggMjU1IC95ZGllcmVzaXMgMTk3IC9BcmluZyAyMTYgL09zbGFzaCAxOTMgL0FhY3V0ZSAyMDUgL0lhY3V0ZSAyMTggL1VhY3V0ZSAyMjEgL1lhY3V0ZSAyMjkgL2FyaW5nIDI0OCAvb3NsYXNoIDIyNSAvYWFjdXRlIDIzNyAvaWFjdXRlIDI1MCAvdWFjdXRlIDI1MyAveWFjdXRlIDIxMCAvT2dyYXZlIDI0MiAvb2dyYXZlIDIwNCAvSWdyYXZlIDIzNiAvaWdyYXZlIDE5NSAvQXRpbGRlIDIxMyAvT3RpbGRlIDIyNyAvYXRpbGRlIDI0NSAvb3RpbGRlIDIyMiAvVGhvcm4gMjU0IC90aG9ybiAyMDkgL050aWxkZSAyNDEgL250aWxkZSAxMzggL1NjYXJvbiAxNTQgL3NjYXJvbiAyNDMgL29hY3V0ZSAyMTEgL09hY3V0ZSBdDQo+Pg0KL1dpZHRocyBbMjc4IDMzMyA0NzQgNTU2IDU1NiA4ODkgNzIyIDIzOCAzMzMgMzMzIDM4OSA1ODQgMjc4IDMzMyAyNzggMjc4IDU1NiA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiAzMzMgMzMzIDU4NCA1ODQgNTg0IDYxMSA5NzUgNzIyIDcyMiA3MjIgNzIyIDY2NyA2MTEgNzc4IDcyMiAyNzggNTU2IDcyMiA2MTEgODMzIDcyMiA3NzggNjY3IDc3OCA3MjIgNjY3IDYxMSA3MjIgNjY3IDk0NCA2NjcgNjY3IDYxMSAzMzMgMjc4IDMzMyA1ODQgNTU2IDMzMyA1NTYgNjExIDU1NiA2MTEgNTU2IDMzMyA2MTEgNjExIDI3OCAyNzggNTU2IDI3OCA4ODkgNjExIDYxMSA2MTEgNjExIDM4OSA1NTYgMzMzIDYxMSA1NTYgNzc4IDU1NiA1NTYgNTAwIDM4OSAyODAgMzg5IDU4NCAwIDU1NiAwIDI3OCA1NTYgNTAwIDEwMDAgNTU2IDU1NiAzMzMgMTAwMCA2NjcgMzMzIDEwMDAgMCA2MTEgMCAwIDI3OCAyNzggNTAwIDUwMCAzNTAgNTU2IDEwMDAgMzMzIDEwMDAgNTU2IDMzMyA5NDQgMCA1MDAgNjY3IDAgNzIyIDU1NiA2MTEgNTU2IDU1NiA2NjcgNTU2IDMzMyA3MzcgMzcwIDU1NiA2MTEgMCA3MzcgNjExIDQwMCA1NTYgMzMzIDI3OCAzMzMgNjExIDU1NiAyNzggMzMzIDMzMyAzNjUgNTU2IDUwMCA4MzQgODM0IDUwMCA3MjIgNzIyIDcyMiA3MjIgNzIyIDcyMiAxMDAwIDcyMiA2NjcgNjY3IDY2NyA2NjcgMjc4IDI3OCAyNzggMjc4IDcyMiA3MjIgNzc4IDc3OCA3NzggNzc4IDc3OCA1ODQgNzc4IDcyMiA3MjIgNzIyIDcyMiA2NjcgNjY3IDYxMSA1NTYgNTU2IDU1NiA1NTYgNTU2IDU1NiA4ODkgNTU2IDU1NiA1NTYgNTU2IDU1NiAyNzggMjc4IDI3OCAyNzggNjExIDYxMSA2MTEgNjExIDYxMSA2MTEgNjExIDU4NCA2MTEgNjExIDYxMSA2MTEgNjExIDU1NiA2MTEgNTU2IF0NCi9OYW1lIC9GMQ0KPj4NCmVuZG9iag0KeHJlZg0KMCA5DQowMDAwMDAwMDAwIDY1NTM1IGYNCjAwMDAwMDAwMTEgMDAwMDAgbg0KMDAwMDAwMDExMSAwMDAwMCBuDQowMDAwMDAwMjk4IDAwMDAwIG4NCjAwMDAwMDAzNjMgMDAwMDAgbg0KMDAwMDAwMDU3MyAwMDAwMCBuDQowMDAwMDAxMDc0IDAwMDAwIG4NCjAwMDAwMDMwMDcgMDAwMDAgbg0KMDAwMDAwNDk1MSAwMDAwMCBuDQp0cmFpbGVyDQo8PA0KL1NpemUgOQ0KL1Jvb3QgMSAwIFINCi9JbmZvIDIgMCBSDQo+Pg0Kc3RhcnR4cmVmDQo2ODg4DQolJUVPRg0K
Sorry for that.
Anyway, this should be a pdf because that was one of the request parameters. (I could also have chosen for .gif, or .jpg in different qualities.)
I tried reading the content, like this:
#label_binary = #response_label[:generate_label_response][:response_shipments][:response_shipment][:labels][:label][:content]
binary_data = Base64.decode64(#label_binary)
#label = File.open('file_name.pdf', 'wb') {|f| f.write(Base64.decode64(binary_data))}
But what this prints then as #label is just a number (three of four digits). Same when I use urlsafe_decode64.
Not very helpful.
Preferred end-result would be a pdf image file that I can immediately display (in this case a barcode on top of an invoice for sending a package).
Could I for this purpose save the file temporarily? Meaning: every time I need to display I rebuild this image. I think that would be best.
EDIT
I am making the call now directly to a 600dpi .jpg image and then:
base_64_binary_data = #response_label[:generate_label_response][:response_shipments][:response_shipment][:labels][:label][:content]
#jpg_file = File.open('label.jpg', 'wb') do |file|
content = Base64.decode64 base_64_binary_data
file << content
end
This file shows up at my_apps_name/label.jpg.
In my filesystem I can see the picture crystal clear.
In my view I then put:
<%= image_tag #jpg_file.path %>
When I inspect my html in the browser I see:
<img src="/images/label.jpg" alt="Label">
I suppose this is easily fixed, suggestions are still welcome of course, but...
..more importantly: What if this call is made several times at once. Will rails manage? Should I maybe save these images (to my S3) or build some kind of chronejob to manage it?
(I am not sure of the added value of mini_magic anymore, since I can see the .jpg being already there.)
Then on a complete side note, maybe better left for a separate question: what if the API on the other side fails?
Step one
decode http response
require "base64"
BASE_64_BINARY_DATA = "PASTE YOUR BIG STRING HERE"
pdf_file = File.open('your_pdf.pdf', 'wb') do |file|
content = Base64.decode64 BASE_64_BINARY_DATA
file << content
end
Step two
convert pdf file to an image
add to gemfile 'gem 'mini_magick' and use 'bundle install' command
put somewhere in your code
require 'mini_magick'
pdf = MiniMagick::Image.new(pdf_file.path)
pdf.pages.first.write("preview.png")
image saved as preview.png
To display pdf image put in the view
<%= image_tag ("path_to_image/preview.jpg")%>

How to scrape images from eBay and Amazon using XPath in Nokogiri from JSON

I'm trying to scrape images from websites using Nokogiri and XPath, so far with limited success. For a typical website whose HTML has img and src, I can use:
tmp2 = Nokogiri::HTML(open(site_url))
tmp2.xpath("//img/#src").each do |src|
...do whatever
end
However, some sites like Amazon and eBay only trigger certain images with JavaScript. If I look at the code I can see the data in arrays. For example, from Amazon:
<script type="text/javascript">
P.when('jQuery', 'cf').execute(function($, cf){
P.load.js('http://z-ecx.images-amazon.com/images/G/01/browser-scripts/imageBlock-udp-airy/imageBlock-udp-airy-4060168860._V1_.js');
});
P.when('A', 'jQuery', 'ImageBlockATF', 'cf').register('ImageBlockBTF', function(A, $, imageBlockATF, cf){
var data = {"indexToColor":[],"burjImageBlock":0,"isSwatchHoverConsistent":1,"heroFocalPoint":null,"visualDimensions":["color_name"],"productGroupID":"apparel_display_on_website","newVideoMissing":0,"useIV":0,"useClickZoom":null,"useChildVideos":0,"numColors":7,"logMetrics":0,"defaultColor":"initial","airyConfig":{"enableContinuousPlay":null,"installFlashButtonText":"Install Flash Player","contentTitle":null,"autoplayCutOffTimeSeconds":null,"ageGate":{"monthNames":["January","February","March","April","May","June","July","August","September","October","November","December"],"deniedPrompt":"We're sorry. You are not old enough to watch this video.","submitText":"Submit","prompt":"This video is not intended for all audiences. What date were you born?"},"videoAds":null,"videoUnsupportedPrompt":"Sorry, this video is unsupported on this browser.","desiredMode":null,"swfUrl":"http://g-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/flash/AiryBasicRenderer._V304902271_.swf","isAutoplayEnabled":null,"installFlashPrompt":"Adobe Flash Player is required to watch this video.","isLiveStream":null,"regionCode":"NA","contentId":null,"playbackErrorPrompt":"Sorry, an error has occurred while attempting video playback. Please try again later.","contentMinAge":null,"isForesterTrackingDisabled":null,"streamingUrls":null,"parentId":null,"foresterMetadataParams":{"client":"Dpx","requestId":"1MX7VHFRVAS6TWY64BXC","marketplaceId":"ATVPDKIKX0DER","session":"182-9511970-7757812","method":"Apparel.ImageBlock"},"jsUrl":"http://z-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/js/airy.chromeless._V304902265_.js"},"mainImageMaxSizes":null,"staticStrings":{"playVideo":"Click to play video","rollOverToZoom":"Roll over image to zoom in","images":"Images","video":"video","clickToZoom":"Click on image to zoom in","touchToZoom":"Touch the image to zoom in","videos":"Videos","close":"Close","pleaseSelect":"Please select","clickToExpand":"Click to open expanded view","allMedia":"All Media"},"notThumbnailClickImmersiveView":1,"gIsNewTwister":1,"title":"Threads 4 Thought Women's Tabitha Basic Tank Top","ivRepresentativeAsin":{"6":"B00T46V76W","4":"B00WM3O7ES","1":"B00T46YZES","3":"B00WM3NLPE","2":"B00T46VD16","5":"B00T46VGXQ"},"mainImageSizes":[[342,445],[385,500],[425,550],[466,606],[522,679]],"isQuickview":0,"ipadVideoSizes":[[340,444],[384,500]],"colorToAsin":{"Coral Dreams":{"asin":"B00T46V76W"},"Heather Grey":{"asin":"B00WM3NLPE"},"Black":{"asin":"B00T46YZES"},"White":{"asin":"B00T46VGXQ"},"Deep Blue Sea":{"asin":"B00T46VD16"},"Sea Glass":{"asin":"B00WM3O7ES"}},"thumbExperimentEnabledValue":1,"showLITBOnClick":0,"videoSizes":[[342,445],[384,500]],"stretchyGoodnessWidth":[1280,1440,1640,1800],"autoplayVideo":0,"hoverZoomIndicator":"","sitbReftag":"","useHoverZoom":1,"staticImages":{"zoomOut":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoom-out._V184888738_.bmp","hoverZoomIcon":"http://g-ecx.images-amazon.com/images/G/01/img11/apparel/UX/DP/icon_zoom._V138923886_.png","zoomIn":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoom-in._V184888790_.bmp","zoomLensBackground":"http://g-ecx.images-amazon.com/images/G/01/apparel/rcxgs/tile._V211431200_.gif","videoThumbIcon":"http://g-ecx.images-amazon.com/images/G/01/Quarterdeck/en_US/images/video._V183716339_SX38_SY50_CR,0,0,38,50_.gif","spinner":"http://g-ecx.images-amazon.com/images/G/01/ui/loadIndicators/loading-large_labeled._V192238949_.gif","zoomInCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomIn._V323082799_.cur","videoSWFPath":"http://g-ecx.images-amazon.com/images/G/01/Quarterdeck/en_US/video/20110518115040892/Video._V178668404_.swf","arrow":"http://g-ecx.images-amazon.com/images/G/01/javascripts/lib/popover/images/light/sprite-vertical-popover-arrow._V186877868_.png","zoomOutCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomOut._V323082798_.cur"},"videos":[],"gPreferChildVideos":0,"altsOnLeft":1,"ivImageSetKeys":{"Coral Dreams":"6","Heather Grey":"3","Black":"1","initial":0,"White":"5","Deep Blue Sea":"2","Sea Glass":"4"},"useHoverZoomIpad":"","isUDP":1,"alwaysIncludeVideo":0,"widths":[1280,1440,1640,1800],"maxAlts":7,"useChromelessVideoPlayer":1,"mainImageHeightPartitions":null};
data["customerImages"] = eval('[]');
data["colorImages"] = {"Coral Dreams":[{"large":"http://ecx.images-amazon.com/images/I/41FGlhksmtL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41FGlhksmtL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UY500_.jpg":["385","500"]}},{"large":"http://ecx.images-amazon.com/images/I/41XR9o0cV-L.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41XR9o0cV-L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UY550_.jpg":["423","550"]}}],"Heather Grey":[{"large":"http://ecx.images-amazon.com/images/I/41f-8R8Eu-L.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41f-8R8Eu-L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX342_.jpg":["342","445"]}},{"large":"http://ecx.images-amazon.com/images/I/41gLiFBbcdL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41gLiFBbcdL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX466_.jpg":["466","606"]}}],"Black":[{"large":"http://ecx.images-amazon.com/images/I/41BxSpfEM7L.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41BxSpfEM7L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/41Gf%2BW-cPTL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41Gf%2BW-cPTL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UY550_.jpg":["423","550"]}}],"White":[{"large":"http://ecx.images-amazon.com/images/I/41tElK2wPKL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41tElK2wPKL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/31lEDIs4cqL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/31lEDIs4cqL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UY550_.jpg":["423","550"]}}],"Deep Blue Sea":[{"large":"http://ecx.images-amazon.com/images/I/41oNq3KmSGL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41oNq3KmSGL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/41AJgd1OuYL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41AJgd1OuYL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UY550_.jpg":["423","550"]}}],"Sea Glass":[{"large":"http://ecx.images-amazon.com/images/I/418vg-re8oL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/418vg-re8oL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UY550_.jpg":["423","550"]}},{"large":"http://ecx.images-amazon.com/images/I/41lcpC41VSL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41lcpC41VSL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UY550_.jpg":["423","550"]}}]};
data["heroImage"] = {};
data["landingAsinColor"] = 'Coral Dreams';
data["shouldApplyResizeFix"] = false;
return data;
});
</script>
The filenames I want to grab don't have src (i.e. http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UY500_.jpg) In this case, the array is called data["colorImages"]. But I can't hard-code anything because the same thing happens on eBay.
The filenames I need here are in enImgCarousel.
On a side note, when I use the following JavaScript bookmarklet for each URL to get images, I'm able to get the correct images:
a='';
for (b=0;b<document.images.length;b++){
a+='<img src='+document.images[b].src+'><br>'};
ifa=''){
document.writea+'</center>');
void(document.close())
}else{
alert('No images!')
}
Back to Nokogiri and XPath, I've also tried:
tmp2.xpath("//img").each do |src|...
and
tmp2.xpath("html//img").each do |src|
Any ideas how I should do this or which direction to go in?
This is alternative way to solve what you want; you can use Capybara and Poltergeist.
I assume you don't have to dive into JavaScript with this solution.
If you scrape, I recommend that you consider Capybara with Poltergeist, you can find many sources to reference.
This is the code I tried:
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist_debug do |app|
Capybara::Poltergeist::Driver.new(app, inspector: true)
end
Capybara.javascript_driver = :poltergeist_debug
Capybara.current_driver = :poltergeist_debug
# Amazon Case
visit_site('https://www.amazon.com/dp/B00T46V758/?tag=stackoverfl08-20')
doc_amazon = Nokogiri::HTML.parse(page.html)
doc_amazon.xpath("//img/#src").each do |src|
p src.value
end
#ebay case
visit_site('https://www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-Batwing-Short-Sleeve-Loose-T-Shirt-Blouse-/351411949784?pt=LH_DefaultDomain_0&var=&hash=item51d1c8d0d8')
doc_ebay = Nokogiri::HTML.parse(page.html)
doc_ebay.xpath("//img/#src").each do |src|
p src.value
end
If you want to dig into it:
doc.xpath("//div[#id='imgTagWrapperId']/img").attribute('src').value
# => "https://images-na.ssl-images-amazon.com/images/I/81%2BTW8762BL._UX453_.jpg"
doc.xpath("//div[#id='mainImgHldr']/img[#id='icImg']").attribute('src').value
# => "https://i.ebayimg.com/images/g/dtAAAOSwpdpVZuU~/s-l300.jpg"
Are you trying to generate a database of competitors items with pricing, etc.?
Are you trying to grab entire categories or individual sellers?
The reason why I ask is you can get an RSS feed of items each seller lists if they have turned that feature on. This way, you do not have to waste time scraping a page when you can get the central data from an RSS feed.
When parsing webpages, depending upon where you are in the webpage (you mentioned carousel) the indices you are encountering are from the stash of thumbnails representing the larger images.
I recommend looking at the eBay API and the Amazon API and finding the RSS feeds for the sellers first.
As far as getting past any Javascript issues, the webpage loads rotating slideshows and carousels dynamically, so you will have to use Mechanize (as RAJ suggested above) or Beautiful Soup or Selenium to get fully rendered web pages in which all images are in a scrapable state.
Feel free to post your source if there is anything else I can help with.
Sorry, as I am posting the answer from mobile phone, I can't write full code right away, however, I can give you a way. You should use Mechanize with selenium-webdriver & watir instead of only Nokogiri.
Using Mechanize, you will be able to handle elements coming from JavaScript. You can mock the actual moves on browser i.e. you can code for clicking on links/buttons, you can wait for image load and then can scrape it. And all this can be done using Mechanize very easily.

Multiple input file tags with IOS 6+ on iPad/iPhone

EDIT :
I've been able to resolve this, after finding that iOS uploads every image as "image.jpg". In freeASPUpload.asp, I changed the following :
For Each fileItem In UploadedFiles.Items
filePath = path & fileItem.FileName
Set streamFile = Server.CreateObject("ADODB.Stream")
streamFile.Type = adTypeBinary
streamFile.Open
StreamRequest.Position=fileItem.Start
StreamRequest.CopyTo streamFile, fileItem.Length
streamFile.SaveToFile path & FileName, adSaveCreateOverWrite
streamFile.close
Set streamFile = Nothing
fileItem.Path = filePath
Next
To :
For Each fileItem In UploadedFiles.Items
FileName = GetFileName(path, fileItem.FileName)
fileItem.FileName = FileName
filePath = path & fileItem.FileName
Set streamFile = Server.CreateObject("ADODB.Stream")
streamFile.Type = adTypeBinary
streamFile.Open
StreamRequest.Position=fileItem.Start
StreamRequest.CopyTo streamFile, fileItem.Length
streamFile.SaveToFile path & FileName, adSaveCreateOverWrite
streamFile.close
Set streamFile = Nothing
fileItem.Path = filePath
Next
This is within the save function of the script. The GetFileName function iterates over what is in the folder already, and adds a number until it is unique. It then updates the file key to its new name. I've left an "image.jpg" file in the temp folder, so that it always finds one. That works for now.
What I've found, however, is for some reason, it flips the second photo out of the five that the form allows for..
I will try to fix that / look for others with the same issue. I've always had problems with iPhones and iPads rotating pictures, and it's never consistent.
I'm trying to grab multiple files from a mobile fleet of iPads and iPhones. Currently I have 5
<input type="file" name="fileX" accept="image/*">
This works fine in a browser (using free asp upload / classic ASP). In testing, it works with IOS, but only if I send one file. If I send more then one via the form, I get the error 'file not found'. It seems like the files aren't being sent at all.
The file not found kicks back when trying to manipulate the file via FSO.
Uploads here :
Dim Upload, fileName, fileSize, ks, i, fileKey
Set Upload = New FreeASPUpload
SaveFiles = ""
uploadsDirVar = "directory"
Upload.Save(uploadsDirVar)
ks = Upload.UploadedFiles.keys
Errors out here :
Set f=fs.GetFile(uploadsDirVar & "\" & Upload.UploadedFiles(fileKey).FileName)
Any ideas on how I can get around this? Everything I've read points towards the HTML5 multiple tag on a single input. However, in my scenario, this won't work. I'm also tying descriptions for the images to the fields based on field name, restricting the upload to 5, re-sizing the images on the action page before passing it to its final destination and attaching them to an email.
Any advice welcome, and thank you for your time!
**Edit
The issue seems to be that IOS names all uploads "image", coming from either the camera or the library. Now the problem is how to rename the image before the script attempts to upload it. From what I'm understanding, it needs to happen within the asp upload component before it places the file.
See the above edit for the answer. As for the flipped images - turns out they are flipped straight from the iPad. I checked the exif data and I'm not sure how to account for this in my system. If I find the answer, I'll post here.

Tracking Upload Progress of File to S3 Using Ruby aws-sdk

Firstly, I am aware that there are quite a few questions that are similar to this one in SO. I have read most, if not all of them, over the past week. But I still can't make this work for me.
I am developing a Ruby on Rails app that allows users to upload mp3 files to Amazon S3. The upload itself works perfectly, but a progress bar would greatly improve user experience on the website.
I am using the aws-sdk gem which is the official one from Amazon. I have looked everywhere in its documentation for callbacks during the upload process, but I couldn't find anything.
The files are uploaded one at a time directly to S3 so it doesn't need to load it into memory. No multiple file upload necessary either.
I figured that I may need to use JQuery to make this work and I am fine with that.
I found this that looked very promising: https://github.com/blueimp/jQuery-File-Upload
And I even tried following the example here: https://github.com/ncri/s3_uploader_example
But I just could not make it work for me.
The documentation for aws-sdk also BRIEFLY describes streaming uploads with a block:
obj.write do |buffer, bytes|
# writing fewer than the requested number of bytes to the buffer
# will cause write to stop yielding to the block
end
But this is barely helpful. How does one "write to the buffer"? I tried a few intuitive options that would always result in timeouts. And how would I even update the browser based on the buffering?
Is there a better or simpler solution to this?
Thank you in advance.
I would appreciate any help on this subject.
The "buffer" object yielded when passing a block to #write is an instance of StringIO. You can write to the buffer using #write or #<<. Here is an example that uses the block form to upload a file.
file = File.open('/path/to/file', 'r')
obj = s3.buckets['my-bucket'].objects['object-key']
obj.write(:content_length => file.size) do |buffer, bytes|
buffer.write(file.read(bytes))
# you could do some interesting things here to track progress
end
file.close
After read the source code of the AWS gem, I've adapted (or mostly copy) the multipart upload method to yield the current progress based on how many chunks have been uploaded
s3 = AWS::S3.new.buckets['your_bucket']
file = File.open(filepath, 'r', encoding: 'BINARY')
file_to_upload = "#{s3_dir}/#{filename}"
upload_progress = 0
opts = {
content_type: mime_type,
cache_control: 'max-age=31536000',
estimated_content_length: file.size,
}
part_size = self.compute_part_size(opts)
parts_number = (file.size.to_f / part_size).ceil.to_i
obj = s3.objects[file_to_upload]
begin
obj.multipart_upload(opts) do |upload|
until file.eof? do
break if (abort_upload = upload.aborted?)
upload.add_part(file.read(part_size))
upload_progress += 1.0/parts_number
# Yields the Float progress and the String filepath from the
# current file that's being uploaded
yield(upload_progress, upload) if block_given?
end
end
end
The compute_part_size method is defined here and I've modified it to this:
def compute_part_size options
max_parts = 10000
min_size = 5242880 #5 MB
estimated_size = options[:estimated_content_length]
[(estimated_size.to_f / max_parts).ceil, min_size].max.to_i
end
This code was tested on Ruby 2.0.0p0

upload an RMagick-generated file from Heroku to Amazon S3

I am creating a Rails app which is hosted on Heroku and that allows the user to generate animated GIFs on the fly based on an original JPG that's hosted somewhere in the web (think of it as a crop-resize app). I tried Paperclip but, AFAIK, it does not handle dynamically-generated files. I am using the aws-sdk gem and this is a code snippet of my controller:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
str1 = fr1.to_blob
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
str2 = fr2.to_blob
list = Magick::ImageList.new
list.from_blob(str1)
list.from_blob(str2)
list.delay = #animation.delay
list.iterations = 0
That is for the basic creation of a two-frame animation. RMagick can generate a GIF in my development computer with these lines:
list.write("#{Rails.public_path}/images/" + #animation.filename)
I tried uploading the list structure to S3:
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
But I don't have a size method in RMagick::ImageList that I can use to specify that. I tried "precompiling" the GIF into another RMagick::Image:
anim = Magick::Image.new(#animation.width, #animation.height)
anim.format = "GIF"
list.write(anim)
But Rails crashes with a segmentation fault:
/path/to/my_controller.rb:103: [BUG] Segmentation fault ruby 1.8.7 (2010-01-10 patchlevel 249) [universal-darwin11.0]
Abort trap: 6
Line 103 corresponds to list.write(anim).
So right now I have no idea how to do this and would appreciate any help I receive.
As per #mga's request in his answer to his original question...
a non-filesystem based approach is pretty simple
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3_bucket.objects['my_thumbnail.jpg'].write(a_thumbnail.to_blob, {:acl=>:public_read})
Voila! reading an uploaded file, manipulating it with RMagick, and writing it to s3 without ever touching the filesystem.
Since this project is hosted in Heroku I cannot use the filesystem so that is why I was trying to do everything via code. I found that Heroku does have a temporary-writable folder: http://devcenter.heroku.com/articles/read-only-filesystem
This works just fine in my case since I don't need the file after this request.
The resulting code:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
list = Magick::ImageList.new
list << fr1
list << fr2
list.delay = #animation.delay
list.iterations = 0
# gotta packet the file
list.write("#{Rails.root}/tmp/#{#animation.filename}.gif")
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:file => "#{Rails.root}/tmp/#{#animation.filename}.gif")
It would be interesting to know if a non-filesystem-writing solution is possible.
I am updating this answer for AWS SDK Version 2 which should be:
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3 = Aws::S3::Resource.new
bucket = s3.bucket('mybucket')
obj = bucket.object('filename')
obj.put(body: background.to_blob)
I think there's a few things going on here. First, the documentation for RMagick is sub-par, and its easy to get side-tracked. The code you're using to generate the gif can be a little simpler. I cooked up a very contrived example here:
#!/usr/bin/env ruby
require 'rubygems'
require 'RMagick'
# read in source file
im = Magick::Image.read('foo.jpg').first
# make two slightly different frames
fr1 = im.crop(0, 100, 300, 300, true)
fr2 = im.crop(0, 200, 300, 300, true)
# create an ImageList
list = Magick::ImageList.new
# add our images to it
list << fr1
list << fr2
# set some basic values
list.delay = 100
list.iterations = 0
# write out an animated gif to the filesystem
list.write("foo.gif")
This code works -- it reads in a jpg I have locally, and writes out a 2-frame animation. Obviously I've hardcoded some values here, but there's no reason this shouldn't work for you, although I am running ruby 1.9.2 and probably a different version of RMagick, but this is basic code.
The second issue is totally unrelated -- is it possible to upload an image generated in IM to S3 without actually hitting the filesystem? Basically, will this ever work:
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
I'm not sure if it is or not. I experimented with calling list.to_blob, but it only outputs the first frame, and it's as a JPG, although I didn't spend much time on it. You might be able to fool list.write into outputting somewhere, but rather than going down that road, I would personally just output the file unless that is impossible for some reason.

Resources