Inconsistent results trying to parse og:image tag from a webpage manually and programmatically - html-parsing

I first manually browse to the below URL:
Mounting injuries won't stop Germany's path to World Cup
Then if view the page source and look for og:image meta tags I find the following:
<meta property="og:image" content="http://l.yimg.com/bt/api/res/1.2/JjwtkhIEdT9nKxLp8p0LFQ--/YXBwaWQ9eW5ld3M7cT04NTt3PTYwMA--/http://media.zenfs.com/en_us/News/Reuters/2013-10-08T122032Z_1_CBRE9970YAZ00_RTROPTP_2_SOCCER-WORLD.JPG"/>
However, if I try to parse the same url programmatically, I get a generic Yahoo stock icon. Here is the code that I am using:
string url = "http://sports.yahoo.com/news/mounting-injuries-wont-stop-germanys-path-world-cup-122032650--sow.html";
WebClient wc = new WebClient();
var doc = new HtmlAgilityPack.HtmlDocument();
string newsPageSource = wc.DownloadString(sourceUri.ToString());
doc.LoadHtml(newsPageSource);
...
(I have removed the rest fro brevity).
If I debug here and inspect the newsPageSource string that contains the content of the target web page and look for og:image tag, its contents are different:
<meta property="og:image" content="http://l.yimg.com/bt/api/res/1.2/81I5U991YW6EEaB2Cjd58g--/YXBwaWQ9eW5ld3M7cT04NTt3PTYwMA--/http://l.yimg.com/os/mit/media/m/social/images/social_default_logo-1481777.png"/>
So not sure what is going on here. I guess, when browsing manually, the original URL is probably redirecting to some other internal URL but when doing this programmatically, the code just grabs the first "snapshot" of page source, without waiting a bit longer and executing any redirects. Can anyone shed light here? Or better yet, how would I extract the real image (2013-10-08T122032Z_1_CBRE9970YAZ00_RTROPTP_2_SOCCER-WORLD.JPG) in this case instead of getting a Yahoo stock icon (social_default_logo-1481777.png).
Somehow Facebook and Google+ are smart enough to extract the correct image when I paste the same link.
Thanks,
Archil

Related

Google SDTT appending "#__sid=md3" to URL for mainEntityOfPage

Why is this happening?
HTML shows:
<meta content='http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html' itemprop='mainEntityOfPage' itemscope='itemscope'/>
Structured Data Testing Tool output shows:
http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html#__sid=md3
Update: It looks like it has to do with my breadcrumb list. But still, why is it happening, and is it wrong?
If the URL you want to provide is unique you can use the itemid property.
I was confronted with mainEntityOfPage by the tool after the latest update. And using Google's example I used the following code
<meta itemscope itemprop="mainEntityOfPage" itemType="https://schema.org/WebPage" itemid="https://blog.hompus.nl/2015/12/04/json-on-a-diet-how-to-shrink-your-dtos-part-2-skip-empty-collections/" />
And this show up correctly in the Structured Data Testing Tool results for my blog
I don’t know where the fragment #__sid=md3 is coming from, but as the SDTT had some quirks with BreadcrumbList in the past, it might also be a side effect of this.
But note that if you want to provide a URL as value for the mainEntityOfPage property, you must use a link element instead of a meta element:
<link itemprop="mainEntityOfPage" href="http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html" />
(See examples for Microdata markup that creates an item value, instead of a URL value, for mainEntityOfPage.)

iOS RSSFeed, How to fech feed automatic from website

I am working on news based application in which I want to fetch the dynamic feed with just typing website's name.
For example: If i want to fetch feed from CNN.com or BBCNEWS.com or etc , then i have to just write website name in textbox like "BBC.com" in place of it's rss urlname
http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml.
I know how to fetch feed from static link but i want to do it dynamically
I have searched a lot regarding this but havn't find any answer. I have seen this in feedly application. In which they have done like this.
so, if anybody know then help me regarding this issue.
RSS comes with a mechanism call Auto-Discovery which links RSS feeds to an HTML page.
It relies on the use of a <link> element in the <head> section of any HTML page.
The <link> tag includes 4 important elements:
rel should include alternate which tells the application that the linked document contains an alternate view of the current
document/page. You can also use the feed value, even though, in our
experience, this is much less frequent. Using both is probably a safe
bet
type indicates the MIME type of this alternate representation. RSS uses application/rss+xml while Atom uses application/atom+xml
title is a human description of the document. It’s good to re-use the page’s title. Do not add RSS as it’s meaningless for people :)
href is the most important attribute: it’s the URL (relative or absolute) of the feed.
Here’s, for example, the discovery for this page's very RSS feed:
<link rel="alternate" type="application/atom+xml" title="Feed for question 'iOS RSSFeed, How to fech feed automatic from website'" href="/feeds/question/32946522">
It's a great example!
In the HTML of the site, you'll find a snippet like this
<link rel='alternate' type='application/rss+xml' title='RSS' href='http://feeds.feedburner.com/martini'>
That's where the RSS URL comes from.

Google Earth KML - href fragment URL getting cut off at the # - won't open in browser

I have a KML file that includes a list of placemarks. In the placemark description I have links that point to a webpage I want users to open in a browser. The href points to a fragment URL, meaning it has a '#' as a delimiter, followed by a parameter related to the placemark. When I view the placemark balloon I see the clickable link, but when I click it sends the URL to the browser cutting off the '#' and the parameter that follows. However if I right-click on the link, copy link location, and paste it into a browser it works fine...I'd like to avoid those few extra steps though.
The link looks like this: mywebsite/directory#12345678
but it opens in the browser like this: mywebsite/directory
which doesn't work.
From some searching around I see the # is used to enable fly to features (see below). Is there a workaround or fix so that I can make google earth send the complete fragment URL to the browser, without cutting off the # and parameter?
--from Google Earth deveopers group
Other Behavior Specified Through Use of the Element
KML supports the use of two attributes within the element: href and type.
The anchor element contains an href attribute that specifies a URL.
If the href is a KML file and has a .kml or .kmz file extension, Google Earth loads that file directly when the user clicks it. If the URL ends with an extension not known to Google Earth (for example, .html), the URL is sent to the browser.
The href can be a fragment URL (that is, a URL with a # sign followed by a KML identifier). When the user clicks a link that includes a fragment URL, by default the browser flies to the Feature whose ID matches the fragment. If the Feature has a LookAt or Camera element, the Feature is viewed from the specified viewpoint.
The behavior can be further specified by appending one of the following three strings to the fragment URL:
•;flyto (default) - fly to the Feature
•;balloon - open the Feature's balloon but do not fly to the Feature
•;balloonFlyto - open the Feature's balloon and fly to the Feature
I'd greatly appreciate any ideas, suggestions, or workarounds!
If the target URL "mywebsite/directory" results in an HTML document with a target id defined with appropriate anchor (e.g. 1234578) then it normally can be accessed via clicking a link via KML from Google Earth.
There may be an issue with how the URL is escaped in the KML. Wrapping HTML in the feature description via CDATA block sometimes helps.
Here's where having a '#' in the URL for a link in KML works as you'd expect.
KML
<?xml version='1.0'?>
<kml xmlns='http://www.opengis.net/kml/2.2'>
<Placemark>
<description>
<![CDATA[
Visiting a linked resource.
See reference
]]>
</description>
</Placemark>
</kml>
Target HTML links.html
<html>
...
<h3><a name="h-12.1.1">12.1.1</a> Visiting a linked resource</h3>
...
</html>
The behavior might be different whether the web browser is configured as external or internal to Google Earth. In Tools/Options/General menu check/uncheck the option "Show web results in external browser" to see if the action changes.

Understanding og:url

I am working through the Facebook tutorial for iOS and am having trouble when a get to the final part with Publish Open Graph Story. I have gone through and set everything up as best I understand. When I try to test using the Object Debugger I get "Missing Required Property: The 'og:url' property is required, but not present." Can some one help me and explain this tag and how it should be set?
Thanks for the help.
Have a look at ogp.me they define og:url as :
og:url - The canonical URL of your object that will be used as its
permanent ID in the graph, e.g.,
"http://www.imdb.com/title/tt0117500/".
Basically as jeff sherlock of facebook explains in this post: https://stackoverflow.com/a/7831012/228741
That when you give the url of your action (the one containing meta tags) facebook ignores everything that is on that page (doesn't render it) . But it renders whatever you have given in the og:url.
What i do usually is have my og:url call the same page with the parameters. So facebook renders the same page for me. If you want to render some other page you give the link in the og:url.
This is set as a meta tag in the <head> section.
Example :
<meta property="og:url" content="your url">

How to retrieve web site favicons?

I am using Ruby on Rails v3.0.9 and I would like to retrieve the favicon.ico image of each web site for which I set a link.
That is, if in my application I set the http://www.facebook.com/ URL I would like to retrieve the Facebook' icon and use\insert that in my web pages. Of course I would like to do that also for all other web sites.
How can I retrieve favicon.ico icons from web sites in an "automatic" way (with "automatic" I mean to search for a favicon in a web site and get the link to it - I think no because not all web sites have a favicon named exactly 'favicon.ico'. I would like to recognize that in an "automatic" way)?
P.S.: What I would like to make is something like Facebook makes when to add a link\URL in your Facebook page: it recognizes the related web site logo and then appends that to the link\URL.
http://getfavicon.appspot.com/ works great for fetching favicons. Just give it the url for the site and you'll get the favicon back:
http://g.etfv.co/http://www.google.com
Recently I have written some similar solution.
If we want find favicon url, that can be not only .ico file and can be not in the root, we should parse target site html.
In Ruby on Rails, I have used nokogiri gem for html parsing.
First we parse all meta tags where itemprop attribute contains image keyword. It is necessary in situations where target site used https://schema.org/WebPage template, that more modern technology than just link tag.
If we found it, we can use content attribute as favicon url. But we should check it for really URL existence, just to be sure.
If we can't found some meta tags, then we search for standard link tags, where rel attribute contains icon keyword. This is W3C standard situation (https://www.w3.org/2005/10/howto-favicon)
And some code of my solution:
require 'open-uri'
def site_icon_link site
icon_link = nil
url = nil
doc = Nokogiri::HTML(open(site))
metas = doc.css("meta[itemprop*=image]")
if metas.any?
url = metas.first.attributes['content'].value
else
links = doc.css("link[rel*=icon]")
if links.any?
url = links.first.attributes['href'].value
end
end
if url =~ URI::regexp
icon_link = url
elsif (site + url) =~ URI::regexp
icon_link = site + url
end
icon_link
end
The favicons are being found by two ways. First, there is a 'hardcoded', traditional name of `http://example.com/favicon.ico'.
Second, the HTML pages may define the favicon in their <head> sections, by <link rel="icon"...> and a few other. (You may want to read the Wikipedia article about favicon)
So, your automat may fetch the main page of given website, parse it and check whether there are proper <link> tags, and then, as a fallback, try the "hardcoded" favicon.ico name.
I think I missed your question ...
you want to grab a favicon from another site and make it yours?
if that's what you want, you can get directly from the home icon and save it in your public folder.
thus: www.facebook.com favicon: www.facebook.com/favicon.ico
take that image and save with the name favicon in your public folder
done it should be sufficient
if you want it dinamicaly you can use jquery, but if you want that static you can put a image tag pointing to: [root url of the website]/favicon.ico
like this: <%= image_tag "#{website.url}/favicon.ico" %>
With javascript (jQuery), like this: http://jsfiddle.net/aX8T4/
Can't you just use a regular img tag with the src attribute pointing to the favicon?
<img src="http://www.facebook.com/favicon.icon">
This assumes a browser recognizes a .ico file as an image. Helped methods would probably work with this too.
You can do it easily with pismo gem.
Quick example to get the url of Facebook's favicon:
Pismo::Document.new('http://www.facebook.com/').favicon
Here's my ruby method, that will strip the end off a URL, append the favicon, and produce an image tag.
def favicon_for(url)
matches = url.match(/[^:\/]\/(.*)/)
image_tag url.sub(matches[1], '') + '/favicon.ico', {width: '16px', height: '16px'}
end

Resources