Ruby RegEx to locate image assets in an html/erb file - ruby-on-rails

My end goal is to write a script that will loop through all my app/views folders and find any image assets being used within them (jpg, png, svg, gifs) and I can't quite get it but I feel I am close but need a little assistance.
This is how I am getting all my assets
assets_in_assets = []
# I searched for image asset names in this folder
image_asset_path = './app/assets/images'
# I haven't made use the below global variables yet
assets_in_use = []
# I plan to loop through the below folders to see if and where the image
# assets are being used
public_folder = './public'
app_folder = './app'
Find.find(image_asset_path) do |path|
# returns path and file names of all files extensions recursively
if !File.directory?(path) && path =~ /.*[\.jpg$ | \.png$ | .svg$ | \.gif$]/
&& !(path =~ /\.DS_Store/)
new_path = File.basename(path) # equiv to path.to_s.split('/').last
assets_in_assets << new_path
end
end
# The above seems to work, it gives me all the asset image names in an array.
This is how i am trying read a html.erb file to find if and where images are being used.
Here is a sample of part of the page:
<div class="wrapper">
<div class="content-wrapper pull-center center-text">
<img class="pattern-stars" src="<%= image_path('v3/super/pattern-
stars.png') %>" aria-hidden="true">
<h2 class="pull-center uppercase">Built by the Obsessed People at the
Company</h2>
<p class="top-mini">Our pets needed a challenge.</p>
<p class="italicize">So we made one.</p>
<img class="stroke" src="<%= image_path('v3/super/stroke.png') %>"
aria-hidden="true">
</div>
</div>
# The assets I am expecting to find, in this small section, are:
#- pattern-stars.png
#- stroke.png
And my code (I tried two different ways, here is the first):
# My plan is start with one specific file, then expand it once the code works
lines = File.open('./app/views/pages/chewer.html.erb', 'r')
lines.each do |f|
if f =~ / [\w]+\.(jpe?g | png | gif | svg) /xi
puts 'match: ' + f # just wanted to see what's being returned
end
end
# This is what gets returned
# match: <img class="pattern-stars" src="<%= image_path('v3/super
# /pattern-stars.png') %>" aria-hidden="true">
# match: <img class="stroke" src="<%= image_path('v3/super/stroke.png')
# %>" aria-hidden="true">
Not what I was hoping for. I also tried the following:
lines = File.open('./app/views/pages/chewer.html.erb', 'r')
lines.each do |f|
new_f = File.basename(f)
puts 'after split' + new_f # I wanted to see what was being returned
if new_f =~ / [\w]+\.(jpe?g | png | gif | svg) /xi
puts 'match: ' + new_f
end
end
# This is what gets returned
# after split: pattern-stars.png') %>" aria-hidden="true">
# match: pattern-stars.png') %>" aria-hidden="true">
# after split: stroke.png') %>" aria-hidden="true">
# match: stroke.png') %>" aria-hidden="true">
And here I remain blocked. I have searched through S.O. and tried a few things but nothing I have found has helped but it could be that I implemented the solutions incorrectly. I also tried look-behind (using the single ' as a end point) and look-ahead (using a / as a starting point)
If this is a dup or similar to another question, please let me know. I'd appreciate the help (plus an brief explanation, I really want to get a better understanding to improve my skills.

(?:['"])([^'"]+\.(?:png|jpe?g|gif|svg)) seems to work in the one test case you supplied us. It relies on the image paths always being within a string as the 'this is the start of the image path' delimiter and terminates at the extension so even if the string is unclosed should stop at an appropriate place.

Using the above, I eventually got to the following solution;
Find.find(app_folder, public_folder) do |path|
if !File.directory?(path)
&& !(path =~/\.\/app\/assets\/images/)
&& !(path =~ /\.DS_Store/)
&& !(path =~ /\.\/app\/assets\/fonts/)
asset_file = File.read(path)
image_asset = asset_file.scan(/ (?:['"|\s|#])([^'"|\s|#]+\.(?:png | jpe?g |gif | svg)) /xi).flatten
image_asset.each do |image_name|
assets_in_use << [path, File.basename(image_name)]
end
end
end

Related

Convert html to text in ROR

HTML
<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>
I have already tried 'strip tags' which gives me the following output :
"Hellothis is a test message"
The output I want:
Hello
this is
a
test message
html = "<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>"
strip_tags
strip_tags helper seems to work fine :
puts ActionController::Base.helpers.strip_tags(html)
# =>
# Hello
# this is
# a
# test message
Nokogiri
Nokogiri is included by default in Rails, so you could also use :
doc = Nokogiri::HTML(html)
puts doc.xpath("//text()").to_s
It outputs :
Hello
this is
a
test message
Convert newlines to spaces
If you want to remove newlines :
ActionController::Base.helpers.strip_tags(html).gsub(/\s+/,' ')
#=> "Hello this is a test message"
The HTML is rendered by a browser like:
Hello
this is
a
test message
This isn't quite correct though, because the HTML contains trailing <br> tags in the <p> tags, which should be a string like:
this is \n\n\n
which is normally considered a paragraph plus a new-line. But, browsers play games when rendering text in order to make it more readable, and gobble blank lines and spaces. For example, this HTML:
<p>foo</p>
<p></p>
<p></p>
<p></p>
<p>bar</p>
renders as:
foo
bar
and:
<p>foo bar</p>
renders as:
foo bar
So, you have to decide do you want to render the text using Nokogiri like the browser for readability, or do it accurately?
This does it like the browser:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>
EOT
doc.search('br').remove
text = doc.search('p').map { |p| p.text + "\n\n" }
puts text
# >> Hello
# >>
# >> this is
# >>
# >> a
# >>
# >> test message
# >>
It removes the breaks, then converts the <p> contained text by appending two new-lines.
Doing it accurately, as per how the markup shows, is a little different:
doc.search('br').map { |br| br.replace("\n") }
text = doc.search('p').map { |p| p.text + "\n\n" }
puts text
# >> Hello
# >>
# >> this is
# >>
# >>
# >> a
# >>
# >>
# >> test message
# >>
This is just a simplified way of doing it to get you started. Rails does the opposite of this in ActionView's simple_format method.
Browsers have a lot more rules used to determine when and how to display the text and their rendering can be influenced by CSS and JavaScript which won't necessarily translate to text, especially plain text.

Dashing (Ruby) Nokogiri LoadError

I've been working on a dashboard on the Dashing framework, and I'm currently trying to make a little crawler to collect specific data on Jenkins-CI, and pass it to the Number widget. Here's the crawler (it's just a stub, it counts the number of "p" elements on a stub html page):
require 'nokogiri'
require 'open-uri'
class ActiveBuilds
def initialize()
#jenkins_page = nil
#build_count = nil
end
# !STUB! Gets the jenkins page to parse to XML on Nokogiri
#jenkins_page = Nokogiri::HTML(open("http://localhost:80"))
# !STUB! Counts the number of 'p' items found on the page
#build_count = #jenkins_page.css("p").length
# !STUB! Returns the amount of active builds
def amountOfActiveBuilds
return #build_count
end
end
and for reference, not really necessary, is the HTML page:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Number Stub | Project</title>
</head>
<body>
<h1>Test</h1>
<ul>
<!-- Count these -->
<li> <div> <p>Item 1 </div>
<li> <div> <p>Item 2 </div>
<li> <div> <p>Item 3 </div>
<li> <div> <p>Item 4 </div>
<li> <div> <p>Item 5 </div>
<!-- Stop counting -->
<li> <div> Item 6 </div>
<li> <div> Item 7 </div>
</ul>
</body>
</html>
and now, the jobs/sample.rb file from dashing, modified (the only thing that matters is the builds/valuation stuff):
require './ActiveBuilds.rb'
active_builds = ActiveBuilds.new
current_valuation = active_builds.amountOfActiveBuilds
current_karma = 0
SCHEDULER.every '2s' do
last_valuation = current_valuation
last_karma = current_karma
current_karma = rand(200000)
send_event('valuation', { current: current_valuation, last: last_valuation })
send_event('karma', { current: current_karma, last: last_karma })
send_event('synergy', { value: rand(100) })
end
The thing is, before I had it working, it would get the page on localhost, count the number of "p" items and print it on a file, and then the dashing file would read it and display it correctly, but it wasn't updating the value on the dashboard unless I'd restart it, which defeats the purpose of this framework.
now to the errors:
When attempting to compile sample.rb (the dashing file):
$ ruby sample.rb
sample.rb:12:in '<main>': uninitialized constant SCHEDULER (NameError)
When attempting to run the dashing server:
$ dashing start
/home/yadayada/.rvm/gems/ruby-2.2.0/gems/backports-3.6.4/lib/backports/std_lib.rb:9:in 'require': cannot load such file -- nokogiri (LoadError)
from /home/yadayada/.rvm/gems/ruby-2.2.0/gems/backports-3.6.4/lib/backports/std_lib.rb:9:in 'require_with_backports'
from /home/yadayada/Desktop/dashing/project/jobs/ActiveBuilds.rb:2:in '<top (required)>'
(...)
I could also post the HTML/CSS/CoffeScript components of the Number widget, but I believe the problem lies on the sample.rb, and the Number widget is completely default.
In case the code wasn't clear enough, what I'm trying to do is to get the localhost page, count the number of "p" items (later it'll be the active builds when I switch to jenkins, didn't switch yet because i'm dealing with the certificates), then send it over to sample.rb, which will get the data and update it every 2 seconds on the dashboard display.
Any suggestions are welcome! Thanks in advance!
Found the solution:
uninstall/reinstall nokogiri gem (without sudo)
put my crawler into the lib folder and require it inside the jobs
on the job itself, placed everything into the SCHEDULER function, like this:
# This job provides the data of the amount of active builds on Jenkins using the Number widget
# Updates every 2 seconds
SCHEDULER.every '2s' do
# Invokes the crawlers from the lib folder
Dir[File.dirname(__FILE__) + '/lib/*rb'].each { |file| require file }
# Create the ActiveBuilds reference
builds = ActiveBuilds.new
# Attributes the amount of active builds to the current valuation
current_valuation = builds.get_amount_of_active_builds
# Pass the current valuation to the last to present the change percentage on the dashboard
last_valuation = current_valuation
# Sends the values to the Number widget (widget id is valuation)
send_event('valuation', { current: current_valuation, last: last_valuation })
end

Find within the first 10?

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.
title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874
Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error
Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

Scrape image "alt" tag and export to CSV

I'm trying to scrape the "alt" tags from several hundred images on a webpage, then output them to a CSV file. This is essentially the entire lump of HTML I'm looking to scrape:
<div class="product-card"
id="product-35492907"
data-element="product-card"
data-owner="some-data-owner"
data-product-slug="some-data-product-slug"
data-product_id="35492907"
data-stock-status="available"
data-icon-enabled="false"
data-retailer-id="2248">
<a class="product-card-image-link"
href="some href"
data-lead-popup
data-lead-popup-url="/track/lead/21716944/?ctx=2383"
>
<img class="product-card-image draggable"
data-pin-no-hover="true"
src="some src"
data-height="250" data-width="200"
height="250" width="200"
alt="SCRAPE ME" # <<<<< here's the guy I'm after
data-product_id="35492907"
/>
</a>
Below is some code I have been using to scrape elements:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page"
page = Nokogiri::HTML(open(url))
CSV.open("productResults.csv", "wb") do |csv|
page.css('.product-card-image draggable').each do |scrape| #???
alt_name = scrape.at_css('alt').text #???
scrapedProducts = "#{alt_name}"
csv << [scrapedProducts]
end
end
Start simple and get more complex if necessary:
require 'nokogiri'
require 'csv'
page = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<div class="product-card"
id="product-35492907"
data-element="product-card"
data-owner="some-data-owner"
data-product-slug="some-data-product-slug"
data-product_id="35492907"
data-stock-status="available"
data-icon-enabled="false"
data-retailer-id="2248">
<a class="product-card-image-link"
href="some href"
data-lead-popup
data-lead-popup-url="/track/lead/21716944/?ctx=2383"
>
<img class="product-card-image draggable"
data-pin-no-hover="true"
src="some src"
data-height="250" data-width="200"
height="250" width="200"
alt="SCRAPE ME" # <<<<< here's the guy I'm after
data-product_id="35492907"
/>
</a>
EOT
Search for the appropriate <img> tag and output its 'alt' parameter's value:
page.css('img.product-card-image').each do |img|
puts img['alt']
end
# >> SCRAPE ME
Modifying it to output to the CSV file:
CSV.open("productResults.csv", "wb") do |csv|
page.css('img.product-card-image').each do |img|
csv << [img['alt']]
end
end

Determining if image exists in Rails 2

I am trying to check if a file exists so that I can either display the image or a placeholder but the placeholder is always shown. If the conditional statement is removed then the logo is displayed fine.
<% if File.exists?(Rails.root + '/public/images/portal/logos/' + #organisation_id + '.png') %>
<img src="/images/portal/logos/<%= #organisation_id %>.png" alt="<%= #person.organisation.name %>">
<% else %>
<img src="http://placehold.it/300x83&text=Please+upload+your+company+logo">
<% end %>
I've read a few questions but most seem to relate to Rails 3 but seeing as I don't get any errors I thought this would work.
Rails.root is working in rails 2? it may be RAILS_ROOT
Rails.root returns a Pathname. Adding an absolute path to a Pathname removes the existing path in the Pathname.
Ie.
Rails.root #=> #<Pathname:/foo/bar>
Rails.root + "baz" #=> #<Pathname:/foo/bar/baz>
Rails.root + "/baz" #=> #<Pathname:/baz>
If you do
Rails.root + 'public/images/portal/logos/#{#organisation_id}.png'
it should work. Or perhaps even better:
Rails.root.join("public/images/portal/logos", "#{#organisation_id.png')
Compare this to RAILS_ROOT which returns a String:
RAILS_ROOT #=> "/foo/bar"
RAILS_ROOT + "baz" #=> "/foo/barbaz"
RAILS_ROOT + "/baz" #=> "/foo/bar/baz"

Resources