I'm using Ruby on Rails 2.3.8 and Hpricot plugin for parsing HTML.
I would like to get embedded videos thumbnails, and searching on the internet I figured out that youtube and vimeo at least uses OG (open graph) protocol, which provides meta tags that contains the video info (url, thumbnail, etc).
For example, if I had this video, I could read the following meta tag, using Hpricot plugin:
<meta property="og:image" content="http://b.vimeocdn.com/ts/101/345/101345354_200.jpg" />
So, using Hpricot I should be able to parse it as follows:
video_url = "http://vimeo.com/16430948"
video_page = Hpricot.parse(open(video_url))
element = video_page.search("//meta[#property='og:image']")
But I get an empty element instead.
Note: if you searched for video_page.search("//meta"), it will find the one I want on the list...but using the previous syntax it won't.
Could anybody tell me how can I solve this?
I came across this question whilst having a similar problem with Hpricot and meta data.
In the end I had to change the xpath from //meta to /html/head to get my scraping working. Trying the same here seems to work.
video_page.at('/html/head/meta[#property="og:image"]')['content']
Returns your image's URL.
Related
I have implemented a Json-ld dynamic creation process to boost my SEO. The JSON is created through the use of Jbuilder ( code is in a partial), rendered in a script tag with a type of "application/ld+json". All of it is wrapped up in a content_for, so that I can reuse the logic.
Once it has been implemented, I started to get this error in my console: "[Facebook Pixel] - Unable to parse JSON-LD tag. Malformed JSON found: ' "
I tested my Json-LD on the google structured data tool and everything came back ok.
I've added an hand written JSON-LD in my script tag, instead of my aforementioned logic,
everything looked ok. No error was displayed in the console, and Chrome Facebook Pixel
Helper was able to find my JSON-LD.
Bottom line, it appears that using my dynamic logic with the partials create a random " ' ", which makes no sense for me.
Any of you ever had the same issue, or something similar ?
May be templating engine is messing you up. You might consider using the json-ld gem to validate the output as part of continuous integration (you can also semantically validate the content using other gems).
I’ve had success using JSON-LD in Haml, but I just use to_json from a Hash hierarchy which has always worked well for me.
How can I insert an existing PDF into a Prawn generated document? I am generating a pdf for a bill (as a view), and that bill can have many attachments (png, jpg, or pdf). How can I insert/embed/include those external pdf attachments in my generated document? I've read the manual, looked over the source code, and searched online, but no luck so far.
The closest hint I've found is to use ImageMagick or something similar to convert the pdf to another format, but since I don't need to resize/manipulate the document, that seems wasteful. The old way to do it seems to be through templates, but my understanding is that the code for templating is unstable.
Does anyone know how to include PDF pages in a Prawn generated PDF? If Prawn won't do this, do you know of any supplementary gems that will? If someone can point me towards something like prawn-templates but more reliable, that would be awesome.
Edit: I am using prawnto and prawn to render PDF views in Rails 4.2.0 with Ruby 2.2.0.
Strategies that I've found but that seem inapplicable/too messy:
Create a jpg preview of a PDF on upload, include that in the generated document (downsides: no text selection/searching, expensive). This is currently my favorite option, but I don't like it.
prawn-templates (downside: unstable, unmaintained codebase; this is a business-critical application)
Merge PDFs through a gem like 'combine-pdf'–I can't figure out how to make this work for rendering a view with the external PDFs inserted at specific places (the generated pdf is a collection of bills, and I need them to follow the bill they're attached to)
You're right about the lack of existing documentation for this - I found only this issue from 2010 which uses the outdated methods you describe. I also found this SO answer which does not work now since Prawn dropped support for templates.
However, the good news is that there is a way to do what you want with Ruby! What you will be doing is merging the PDFs together, not "inserting" PDFs into the original PDF.
I would recommend this library, combine_pdf, to do so. The documentation is good, so doing what you want would be as simple as:
my_prawn_pdf = CombinePDF.new
my_prawn_pdf << CombinePDF.new("my_bill_pdf.pdf")
my_prawn_pdf << CombinePDF.new("attachment.pdf")
my_prawn_pdf.save "combined.pdf"
Edit
In response to your questions:
I'm using Prawn to render a pdf view in Rails, which means that I don't think I get that kind of post-processing
You do! If you look at the documentation for combine_pdf, you'll see that loading from memory is the fastest way to use the gem - the documentation even explicitly says that Prawn can be used as input.
I'm not just tacking the PDFs to the end: a bill attachment must directly follow the generated page(s) for a bill
The combine_pdf gem isn't just for adding pages on the end. As the documentation shows, you can cycle through a PDF adding pages when you want to, for example:
my_pdf # previously defined
new_pdf = CombinePDF.new
my_pdf.pages.each.do |page|
i += 1
new_pdf << my_pdf if i == bill_number # or however you want to handle this logic
end
new_pdf.save "new_pdf.pdf"
I am connecting to a webpage using HtmlUnit and I want to read the information inbetween the tags. I will demonstrate using some code. Lets suppose I have the following link:
Hello!
I would like to read the Hello that's in between, preferably saved into a String variable. Here is the code essential for the task
// Simulating a Chrome browser
WebClient webClient = new WebClient(BrowserVersion.CHROME);
loggedIn = webClient.getPage("random-page.com");
HtmlAnchor anchorLink = loggedIn.getAnchorByHref("/private-messages/inbox");
Now if I use anchorLink.toString() I get <a href="www.anypage.com"> from the previous example but nothing about the characters inbetween the tags. I have gone through the API and I can't seem to find anything useful. Any workarounds?
Would getTextContent() be what you are looking for?
I have problem getting statistics information from youtube data api. I make a request to http://gdata.youtube.com/feeds/api/videos?q=video_id&alt=json, it works for some, but for some video id, the response does not contain 'entry', 'yt$statistics', 'gd$rating' for example:
zLcbznigfs missing 'entry', aVfN6XjACDY missing 'yt$statistics', fjhQ9Kf4iHk missing 'gd$rating'
After moving around, i found out the solution for this: use &alt=atom instead of using &alt=json, which means that we better read from Atom feed than JSON (and feedparser is an excellent module for doing this). I have checked this with several video id, it works fine.
Hope that help. Thanks.
I'm attempting to parse Media RSS feeds that contain media:* elements, but it seems as though all of the standard RSS parsing libraries for Ruby only support enclosures, not MRSS elements.
I've tried:
SimpleRSS
RSS::Parser
Syndication:RSS::Parser
Ideally, I'd like something that makes it simple to extract elements such as media:thumbnail, similar to how I can extract an entry's enclosure.
http://github.com/cardmagic/simple-rss seems to support Media RSS to some degree.
For example:
pp rss.entries.last
{
...
:media_content_url=>"...",
:media_content_type=>"image/jpeg",
:media_content_height=>"426",
:media_content_width=>"640",
:media_thumbnail_url=>"...",
:media_thumbnail_height=>"133",
:media_thumbnail_width=>"200"}
}
(Unfortunately, with the feed I'm testing it with, it seems to be only taking the first media:content tag inside of the media:group, even though the media:group has 2 media:content tags.)