How to parse HTML? - html-parsing

I have a table
id txt
1 <html> ... a lot of different html tags
2 <html> ... a lot of different html tags
3 <html> ... a lot of different html tags
How can I parse txt so that I get plain text without all these tags?

If you're on TD14 you might use REGEXP_REPLACE.
REGEXP_REPLACE(txt, '<[^>]*>', ' ', 1, 0, 'i')
This will return wrong results if you got '<' and '>' within, you should search for a better RegExp then.

You can use the REPLACE UDF which can be downloaded from https://downloads.teradata.com/download/extensibility/teradata-udfs-for-popular-oracle-functions
If you are on TD 14, then it has an inbuilt REPLACE function for the same purpose. (www.info.teradata.com/eDownload.cfm?itemid=113480017)

Related

Pass variable from content to layout in Nanoc using Slim

I basically want to know the easiest way to pass a ruby variable from a content page to its layout using Nanoc and Slim. I am thinking of something like this:
content/content.slim:
---
title: Writeups
layout: /layout.slim
---
- age = get_age
layout/layout.slim:
doctype html
html
head
== yield
p I am #{#item[:title]} and am #{#item[:age]} years old
I know how to access values via frontmatter, but frontmatter values are fixed and what I want is a ruby function to find that value for me.
Nanoc provides a capturing helper, which makes it possible to “capture” content in one place and use it somewhere else.
content/content.slim:
---
title: Mister Tree
---
p Hello there!
- content_for :age
| hundreds of years
layout/layout.slim:
doctype html
html
body
== yield
p I am #{#item[:title]} and am #{content_for(#item, :age)} years old
lib/default.rb (or any file in lib/ of your choosing):
use_helper Nanoc::Helpers::Capturing
This generates the following output:
<!DOCTYPE html>
<html>
<body>
<p>Hello there!</p>
<p>I am Mister Tree and am hundreds of years years old</p>
</body>
</html>

Spaces being turned into in angular

I'm using ng_repeat to display text from an object. On the rails backend I call strip_tags(text) to remove html. When looking at the output it looks fine. Even when looking at the object in 'view source' it looks fine.
It only looks weird when you look at the text that is actually rendered from the ng_repeat - after a certain point (200 words in the example below) every space is replaced by an
This is causing the text to overflow the div. Any suggestions for dealing with this?
Edit: Some of the code (simplified)
JS:
$scope.init = function(id){
$scope.episodes = gon.episodes
Haml:
.episode-edit{ng_repeat:"episode in episodes"}
%p {{episode.sanitized_summary}}
You should try ng-bind-html. Your snippet would look like
<p ng-bind-html="YourObject"></p>
You can use it in ng-repeat as well.
If you want to secure the data first then include $sce service in your controller. Your snippet would be like
var ExampleCtrl = function($scope, $sce) {
$scope.YourObject = $sce.trustAsHtml($scope.YourObject); // that's it
}
Sorry, turns out it had nothing to do with angular, more to do with Ruby.
Ruby's whitespace regex doesn't capture unicode non-breaking space.
Instead of str.gsub(/\s/m, ' ') you have to use str.gsub(/[[:space:]]/m, ' ')

html node parsing with ASP classic

I stucked a day's trying to find a answer: is there a possibility with classic ASP, using MSXML2.ServerXMLHTTP.6.0 - to parse html code and extract a content of a HTML node by gived ID? For example:
remote html file:
<html>
.....
<div id="description">
some important notes here
</div>
.....
</html>
asp code
<%
...
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP.6.0")
objHTTP.Open "GET", url_of_remote_html, False
objHTTP.Send
...
%>
Now - i read a lot of docs, that there is a possibility to access HTML as source (objHTTP.responseText) and as structure (objHTTP.responseXML). But how in a world i can use that XML response to access content of that div? I read and try so many examples, but can not find anything clear that I can solve that.
First up, perform the GET request as in your original code snippet:
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
http.Open "GET", url_of_remote_html, False
http.Send
Next, create a regular expression object and set the pattern to match the inner html of an element with the desired id:
Set regEx = New RegExp
regEx.Pattern = "<div id=""description"">(.*?)</div>"
regEx.Global = True
Lastly, pull out the content from the first submatch within the first match:
On Error Resume Next
contents = regEx.Execute(http.responseText)(0).Submatches(0)
On Error Goto 0
If anything goes wrong and for example the matching element isn't found in the document, contents will be Null. If all went to plan contents should hold the data you're looking for.

BeautifulSoup: parse only part of the page

I want to parse a part of html page, say
my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
Link1
Link2
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""
I pass this string to BeautifulSoup:
soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template
But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don't need those in my code. The only way I've found up to now to avoid this is to use html.parser.
I wonder if there is a way to get rid of redundant tags using lxml - the quickest parser.
UPDATE
Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.
Use
soup.body.renderContents()
lxml will always add those tags, but you can use Tag.extract() to remove your <div> tag from inside them:
comment = soup.body.div.extract()
I could solve the problem using .contents property:
try:
children = soup.body.contents
string = ''
for child in children:
string += str(item)
return string
except AttributeError:
return str(soup)
I think that ''.join(soup.body.contents) would be more neat list to string converting, but this does not work and I get
TypeError: sequence item 0: expected string, Tag found

How keep groovy/XMLSlurper from stripping html tags from a node?

I'm reading an HTML file from a POST response and parsing it with XMLSlurper. The textarea node on the page has some HTML code put into it (non-urlencoded - not my choice) and when I read that value, Groovy strips all the tags.
Example:
<html>
<body>
<textarea><html><body>This has html code for some reason</body></html></textarea>
</body>
</html>
When I parse the above and then find(...) the "textarea" node, it returns to me:
This has html code for some reason
and none of the tags. How do I keep the tags?
I think you're getting the right data, but printing it out wrong... Can you try using StreamingMarkupBuilder to convert the node back to a piece of xml?
def xml = '''<html>
| <body>
| <textarea><html><body>This has html code for some reason</body></html></textarea>
| </body>
|</html>'''
def ta = new XmlSlurper().parseText( xml ).body.textarea
String content = new groovy.xml.StreamingMarkupBuilder().bind {
mkp.yield ta.children()
}
assert content == '<html><body>This has html code for some reason</body></html>'

Resources