Beginner question about using YQL to get html webpage in json - yql

I am trying to scrape a webpage using yql. I thought outputing it as json would give me all the content as one object. However if there are any html tags like < strong > that text is not included in "content". Is there any way around this or should I just get it as xml and regex out the tags?

It should return all elements from the page if your YQL statement is select * from html where url="http://www.cnn.com".
select * from html where url="http://www.cnn.com".

Related

Rendering HTML in a GSP

I have some content in database table (blog post) that is trusted content and I want to display on screen. This content is HTML and has some code samples using Prism.js for syntax highlighting. Because of the HTML econding on a gsp page I need to use the raw method to output the content as is
${raw(post.content)}
This works great except for when I get to the code that is wrapped in a tags for my code samples. Instead of showing it as code its outputting the raw html which is not what I want. I somehow need to encode the text that is inside of there because If I don't I end up with something that looks like this.
I know that I could do the encoding on save but I already have hundreds of posts where this is not the case. Any ideas?
In my case I had to grab the raw content in the view
${raw(post.getEscapedContent())}
and then in the domain object I escaped anything inside of the code blocks
/**
* I will return the content of a post with the necessary html escaping. To render html in code blocks we
* need to escape any html inside of <code></code>
* #return String
*/
def getEscapedContent(){
content.replaceAll(/(?ms)(<code.*?>)(.*?)(<\/code>)/) { it, open, code, close ->
open + code.encodeAsHTML() + close
}
}

Showing html data from database to front end in mvc

I have upgraded my application from asp.net to mvc4. I am using html5 and displaying data in an html table. My database column contains html tags, but it is getting rendered as a plain text. Please help.
**
> Removable hard drive carrier only (DataPort)<html><br><b><font color
> ="red"> Part 444873-001 is no longer supplied. Please order the replacement, 580620-001</font></b></br></html>
**
Above line is a sample of how my data gets displayed in the column. I want to make the html tags to be rendered as html itself.
When you render your model data, use the Html.Raw() helper to render the HTML data unencoded. Razor automatically encodes HTML inputs in order to help prevent XSS attacks on websites.
<td>#Html.Raw(Model.MyProperty)</td>

Twittertext bug for auto_link_usernames_or_lists when deal with html

The auto_link_usernames_or_lists can't deal with html well.
when I am trying to generate link using html tag. when you have mention string inside single html element then it not generate link but if you have more than two tag it generate link correctly
auto_link_usernames_or_lists('<p>#blankyao</p>') O/P - No link generated
auto_link_usernames_or_lists('<span><p>#blankyao</p><span>') O/P - Link generated properly

Rendering a response from a URL to RSS format

I am creating a controller that receives certain parameters from an application, then accesses a hard coded URL. Upon receiving a response from the URL my controller should render this response to RSS format.
In doing this I decided to use XPath to sort of create the xml tags, I then used StringBuilder to append these tags and then rendered the result as text. This is able to show on the browser just how I want it.
However when I try to view the page source it does not show any tags or headers, it just shows it as normal text on a page. I need help with what to do so that the headers and tags can appear in the page source.
I would suggest having your controller send the data as JSON, and then creating a template that renders the JSON as rss2/xml. For best results, make your JSON structure easy to loop over to create the RSS feed by looking at how feeds are organized.
Here's the rss2 spec
make sure that this line is at the top of your file with NO leading spaces
<?xml version="1.0"?>
Also make sure your content is served with "text/xml" as its content-type. In php, one would set this as such:
header('Content-Type: text/xml');
See http://www.electrictoolbox.com/rss-php-content-type/

Getting anchor text from a webpage using xpath within YQL

SELECT content FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//a/text()"
does not work, whereas
SELECT * FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//a/text()"
does.
SELECT content FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//a"
also works, it seems YQL has a bug, or am I missing something?
Is this what you are looking for?
SELECT content FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//a"
SELECT href
FROM html
WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state"
AND xpath="//a"
Try it on the YQL console.

Resources