Parsing of video node fails - facebook-instant-articles

Facebook's Instant Article validator complains that /html/body/article/header/figure/video: Parsing this node failed. Please verify your formatting for the code below, and I can't really see anything wrong with it:
<header>
<h1>Politimester gikk av etter Monika-saken – har rett til millionlønn «på livstid»</h1>
<h3 class="op-kicker">«Grov uforstand i tjenesten», konkluderte gransking</h3>
<p>Geir Gudmundsen (61) er nå rådgiver i Politidirektoratet, men lønnes som politimester. POD bekoster både bolig i Oslo og hjemreisene til Bergen for eks-politimesteren, som er pendler.</p>
<time class="op-published" datetime="2016-08-30T20:30:37"></time>
<time class="op-modified" datetime="2016-08-31T09:47:13"></time>
<figure>
<video>
<source src="http://our.secretdomain.com/video.mp4">
</video>
<figcaption>Geir Gudmundsen (61) er nå rådgiver i Politidirektoratet, men lønnes som politimester. POD bekoster både bolig i Oslo og hjemreisene til Bergen for eks-politimesteren, som er pendler.</figcaption>
</figure>
<address>
John Doe
</address>
</header>
Can anyone spot anything wrong?

I'm getting the same problem, on all kinds of videos. Rechecked my html a lot of times. Is this a facebook-bug? Also in Norway.

Related

Nokogiri results different from brower inspect

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.
In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.
Here is the site:
https://www.ctgoodjobs.hk/jobs/part-time
Here is my code:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text
is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:
<div class="result-list-job current-view">
<input type="hidden" name="job_id" value="04375145">
<input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
<h2 class="job-title">Barista/ Senior Barista 咖 啡 調 配 員</h2>
<h3 class="job-company">PACIFIC COFFEE CO. LTD.</h3>
<div class="job-description">
<ul class="job-desc-list clearfix">
<li class="job-desc-loc job-desc-small-icon">-</li>
<li class="job-desc-work-exp">0-1 yr(s)</li>
<li class="job-desc-salary job-desc-small-icon">-</li>
<li class="job-desc-post-date">09/11/16</li>
</ul>
</div>
<a class="job-save-btn" title="save this job" style="display: inline;"> </a>
<div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
<div class="job-cat job-cat-de"></div>
</div>
then, you can retrieve each job_id from those inputs, like:
inputs = doc.search('//input[#name="job_id"]')
and then build the urls (i found the base url at joblist_preview.js:
urls = inputs.map do |input|
"https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
end
Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.
Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.
Here's a quick glimpse into what the raw HTML contains, generated using:
$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"
Nokogiri dropped me into IRB:
Your document is stored in #doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)
Counting the hits found by the selector returns:
>> #doc.search('.job-title > a').size
30
Displaying the text found shows:
>> #doc.search('.job-title > a').map(&:text)
[
[ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
[29] "Customer Services Representative (Part-time)"
]
Looking at the actual href:
>> #doc.search('.job-title > a').map{ |n| n['href'] }
[
[ 0] "javascript:void(0);",
...
[29] "javascript:void(0);"
]
You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.
This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.
Your code can be cleaned up and simplified. I'd write it much more simply as:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)
The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.
Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.
See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.

Anchor tag not working

For this website ( http://carbondirect.net/ ) on the homepage, bottom right I want Sewage Odor Control to link to the page Activated Carbon and jump to the Sewage Odor Control section, but it is not working. It links to the Activated Carbon page, ignoring the anchor.
Based on some research, I tried adding this CSS, but it is still not working:
#sewage-odor-control {
position: fixed;
z-index: 10001;
display: inline-block;
}
Did I do anything obviously wrong? If you are able to take a look, thank you.
Here is the relevant code from your site:
<div class="et_pb_toggle et_pb_toggle_close">
<h5 class="et_pb_toggle_title">SEWAGE ODOR CONTROL</h5>
<div class="et_pb_toggle_content clearfix">
<span><a id="sewage-odor-control"></a></span>
Wet wells and sewage-pumping stations need to be vented to atmosphere, the vapor that is exchanged is contaminated with H2S and mercaptan. They are highly odorous, pungent and extremely objectionable. Specialist impregnated carbon are used to remove H2S to the level below the odor threshold value. The patented products (US patent#6,858,192) invented by our factory are unique catalytic activated carbon not only with high capacity on absorption, but also hazard free during the handling and disposal of spent carbon.</p>
<p>IMP-KOH, HiCOR
</div> <!-- .et_pb_toggle_content -->
</div> <!-- .et_pb_toggle -->
A couple weird things going on:
Line #5: have a closing </p>, but no opening <p>
Line #6: have an opening <p>, but no closing <\p>
Also, your anchor is not 'visible' when you first go to the site because the section is toggled closed (et_pb_toggle_close), so maybe that is causing a problem. What happens if you add the anchor the the header (which is readily visible)?
<h5 id="sewage-odor-control" class="et_pb_toggle_title">SEWAGE ODOR CONTROL</h5>
The only thing you need is adding id="sewage-odor-control" to the target element:
<div id="sewage-odor-control" class="et_pb_toggle et_pb_toggle_close">

Parsing Description tag in Rss Feed in iOS

Am facing Problem in handling the description tag of RSS feed in iOS.
I have given an example of RSS feed i have received.
I can not handle this description field without knowing the feed beforehand, so I can not make this parser generic.
my question is, can we make a generic RSS feed parser? If yes, then how? i have tried using NSScanner, but somehow i felt it was not much efficient. do we get a better alternative?
EDIT:
i have already parsed the feed using NSXMLParser, i am getting the description field including the html tags, i want to get the original values extracted from there
<item>
<title>End slavery in the U.S., world</title>
<guid isPermaLink="false">http://www.cnn.com/2013/10/23/opinion/myles-slavery/index.html</guid>
<link>http://rss.cnn.com/~r/rss/cnn_topstories/~3/Z13FFqE4z54/index.html</link>
<description>The extraordinary new film "12 Years a Slave" immerses us in the reality of historical slavery at a deep level of complexity and nuance. The film is an opportunity to honor all who were held in chattel slavery, treated like property, and subjected to levels of violence, torture, and control that no human should ever endure.<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=Z13FFqE4z54:pYCgKZFqbkU:yIl2AUoC8zA"><img
src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img></a> <a
href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=Z13FFqE4z54:pYCgKZFqbkU:7Q72WNTAKBA"><img
src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img></a> <a
href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=Z13FFqE4z54:pYCgKZFqbkU:V_sGLiPBpWU"><img
src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=Z13FFqE4z54:pYCgKZFqbkU:V_sGLiPBpWU" border="0"></img></a>
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=Z13FFqE4z54:pYCgKZFqbkU:qj6IDK7rITs"><img
src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img></a> <a
href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=Z13FFqE4z54:pYCgKZFqbkU:gIN9vFwOqvQ"><
img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=Z13FFqE4z54:pYCgKZFqbkU:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/Z13FFqE4z54" height="1" width="1"/>
</description>
<pubDate>Wed, 23 Oct 2013 09:05:27 EDT</pubDate>
<feedburner:origLink>http://www.cnn.com/2013/10/23/opinion/myles-slavery/index.html</feedburner:origLink>
</item>
RSS is just XML and is a well-defined format, so you can use NSXMLParser to parse the feed and extract the information you need.

Escape html elements in blackberry

I was wondering if there is something for blackberry to escape html values, basically I want to show just plain text that's coming from and rss. However the rss is returning values likes this:
<item><guid isPermaLink="true"><![CDATA[http://www4.elcomercio.com/deportes
/Vettel_F1_China.aspx]]>
</guid>
<title><![CDATA[ Vettel domina primer día de ensayos en China]]></title>
<description><![CDATA[El alemán Sebastian Vettel, de Red Bull, realizó el mejor tiempo en la segunda sesión de entrenamientos libres del Gran Premio de China de Fórmula 1, el viernes en el circuito de Shanghai, tercera prueba del campeonato, tras haber dominado el primer ensayo.<br />
<br />
I can sucesffuly retrieve the title and description tags content, but now I would like to remove all the CDATA, <br /> or any possible html tags that I could find.
I tried using JSoup however it uses JVM 1.5+ classes like Enum, and as result I couldn't preverify the jar to use it on Blackberry-JavaME. Also I haven't found any class in the RIM API that could help on this task, maybe I missed a class or a library that I could use. This is just to avoid writing code that's probably already done on several libraries.
Thanks a lot.
Have you tried using SAX Parser and just getting the values for the characters(...) method for each endElement ?
Here is a brief tutorial on SAX Parser for Blackberry:
http://jsinghfoss.wordpress.com/2009/09/06/sax-parsing-revising/
Well, couldn't find a prerolled class, however there is a library that allows us to use regex in Blackberry projects, it's called regexp-me. Helped me to remove the tags in an easy way. SAX Parser is also a solution, but if you want something more simple like in this case, I think regexp-me is the best option.
Thanks.

Stream video from website, and support modern browsers (incl. IE) *and* iPad

My boss wants the following:
Requirements: Stream m4v videos from our Web-server to clients including standard web browsers (IE7, FF, Chrome, etc) and iPad!
I'm not really sure why he wants m4v...he mentioned efficiency but it may also have to do with iPad compatibility?? Anyway, I'm stuck with m4v.
I've browsed some related questions on SO, and this page is very useful as well:
http://henriksjokvist.net/archive/2009/2/using-the-html5-video-tag-with-a-flash-fallback
So if I understand correctly, HTML5 with <video> tag will take care of all my requirements (browsers & iPad) except IE up to and including IE8.
So in my code:
<div id="demo-video-flash">
<video id="demo-video" poster="snapshot.jpg" controls>
<source src="video.m4v" type="video/mp4" /> <!-- MPEG4 for Safari -->
<source src="video.ogg" type="video/ogg" /> <!-- Ogg Theora for Firefox 3.1b2 -->
</video>
</div>
<script type="text/javascript">
$(document).ready(function() { // ... a dash of jQuery.
var v = document.createElement("video"); // Are we dealing with a browser that supports <video>?
if ( !v.play ) { // If no, use Flash.
var params = {
allowfullscreen: "true",
allowscriptaccess: "always"
};
var flashvars = {
file: "video.f4v",
image: "snapshot.jpg"
};
swfobject.embedSWF("player.swf", "demo-video-flash", "480", "272", "9.0.0", "expressInstall.swf", flashvars, params);
}
});
</script>
As the link above explains, test if the browser supports <video>, and if not, fall back to flash. If the browser supports <video>, I don't need to worry about the player as the browser handles that. If it doesn't support <video>, I need to provide:
(a) A flash player.
(b) A flash-compatible copy of my .m4v video
Questions:
1) Will this solution work for my requirements?
2) Is .m4v a good format to stream to iPad? (I'm guessing yes as it's an Apple proprietary format!)
3) Is .m4v "flash-comatabile"? That is, if I send it to my flash player will it work? I've read conflicting reports on this. If it's not, then I guess I need to have a copy of my video converted to a flash-compatable format...any recommendations? (.f4v seems common but we already have a .mov file will that work?)
4) Last but not least, what's a good flash player. I'm leaning toward flowplayer (http://flowplayer.org/), however, we already have a swf player installed (http://code.google.com/p/swfobject/). Seems this latter one would work...any advantages to one or the other??
Apologies if some parts of this question don't make sense...there's alot of info about video out there and it's hard to piece it all together...hoping some answers here may help. I can refine my question as needed.
Thanks in advance!
Peter
As far as I know..., IE does not support HTML5 so the tag would be unrecognized in IE...

Resources