Extract text from specific HTML location across multiple pages - html-parsing

I have been experimenting with Jericho HTML Parser and Selenium IDE for the purpose of extracting text from a specific location inside HTML across multiple pages.
I have not found a simple example of how to do this and I don't know java.
I would like to find in a folder all HTML pages in the 1st table, 4th row, 1st div any string of text:
</table>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I want</div></td></tr>
</table>
And print the selected text to a txt file in a list like this:
The Text I want
Another Text I want
All the source files are stored locally and may contain bad HTML, so figured Jericho might be best for this purpose. However I'm happy to learn any method to achieve the desired result.

Well in the end I went with beautifulsoup and used a python script with something like this:
# open source html file
with open(html_pathname, 'r') as html_file:
# using BeautifulSoup module search html tag's tree
soup = BeautifulSoup(html_file)
# find according your criteria "1st table, 6th tr, 1st td, 1st div"
trs = soup.html.body.table.tr.findNextSiblings('tr')[4].td.div
# write found text to result txt
print ' - writing to result txt'
result_file.write(''.join(trs.contents) + '\n')
print ' - ok!'

Related

Translation of HTML content in Google Sheets

I am using the following Google Sheets formula to translate some fields containing HTML tags:
=GOOGLETRANSLATE(A2, "en", "de")
However, the translation results in a messed up HTML and extra spaces between tags opening or closing, especially if there are many nested tags.
For example:
<div> <p>paragraph text</p> </div>
will result in:
<div> <p> Absatztext </ P> </ Div>
Sometimes, the translator changes the tags opening and closing and put extra spaces between some attributes also the closing tags letters are in uppercase.
Issues like:
<p> Absatztext <P />
<a href = " # "> Link </ A>
Sometimes, text are added before the tag closing
<h2 Was> ist Pilates? </h2>
it should be:
<h2> Was ist Pilates? </h2>
Demo here:
https://docs.google.com/spreadsheets/d/11MOZjTknFGdwuAp6g3VUa0o5OQaW44hxN2uEvqnL3jw/edit?usp=sharing
How can I fix those problems?
try simple fix like:
=LOWER(SUBSTITUTE(GOOGLETRANSLATE(A1, "en", "de"), "/ ", "/"))
UPDATE:
=SUBSTITUTE(A1, TRIM(REGEXREPLACE(A1, "</?\S+[^<>]*>", )),
GOOGLETRANSLATE(TRIM(REGEXREPLACE(A1, "</?\S+[^<>]*>", )), "EN", "DE"))
If you don't mind doing it in a single formula and just want to solve the problem, you could try splitting it and only translating what's not an HTML tag.
Put all HTML tags and closing tags in a separate sheet, so you can check for them. I'll put mine in 'tags'!A1:B128.
Considering you have your original text in A1, you can split it up by < and >:
=SPLIT(A1,"<>")
then on a line below (or elsewhere, for me it'll be A2) you can check if the first word in each cell is found among the tags with:
NOT(COUNTIF(tags!$A$1:$B$128,INDEX(SPLIT(A2," "),1,1)))
translate it if it's true with
GOOGLETRANSLATE(A2, "en", "de")
or add the brackets back with
"<"&A2&">" so the whole formula will look like
=IF(NOT(COUNTIF(tags!$A$1:$B$128,INDEX(SPLIT(A2," "),1,1))),GOOGLETRANSLATE(A2, "en", "de"), "<"&A2&">")
then on a line below, just join the whole row back to a single cell with
=JOIN("",A3:L3)
You can hide rows 2 and 3 for convenience, or even put them on a separate sheet along with the tags. You can also add a condition not to add < and > if it's empty, so you can join up the whole row without looking at how long it is.
If you'd like to do this in a single formula, you'd have to write a script for it, as some formulas act strangely with arrayformula and sometimes are barely usable.
I think the most most convenient solution nowerdays is to use a custom JS-Function (Extensions >> AppScripts):
var spanish = LanguageApp.translate('This is a <strong>test</strong>',
'en', 'es', {contentType: 'html'});
// The code will generate "Esta es una <strong>prueba</strong>".
LanguageApp.translate (apidoc) accepts as fourth option a contentType, which can be text or html.
For huge tables be aware that there are daily limits (quotas)!

Why this Xpath not working?

For example this HTML
<div>
<span></span> I want to find this <b>this works ok</b>.
</div>
I want to find a DIV with I want to find this in it and then grab the whole text inside that DIV including child elements
My XPATH, //*[contains(text(), 'I want to find this')] does not work at all.
If I do this //*[contains(text(), 'this works')] it works but I want to find any DIV based on I want to find this text
However, if I remove the <span></span> from that HTML, it works, why is that?
text() only gets the text before the first inner element. You can replace it with . to use the current node to search.
//div[contains(., 'I want to find this')]
This will search in a string concatenation of all text nodes inside the current node.
To grab all text you can use node.itertext() to iterate all inner texts if you are using lxml:
from lxml import etree
html = """
<div>
<span></span> I want to find this <b>this works ok</b>.
</div>
"""
root = etree.fromstring(html, etree.HTMLParser())
for div in root.xpath('//div[contains(., "I want to find this")]'):
print(''.join([x for x in div.itertext()]))
# => I want to find this this works ok.
Try using //*[text()=' I want to find this '] , this will select the div tag and then for text you can use the getText() method to get the text
You can try Replace text() with string():
//div[contains(string(), " I want to find this")]
Or, you can check that span's following text sibling contains the text:
//div[contains(span/following-sibling::text(), " I want to find this")]

HtmlPurifier - Codeblock

I was looking in HtmlPurifier documentation, but I can't see nothing about that.
Let's say I have
<div class="codebox">
All html tags here - Even <div class="codebox">another code box</div>
</div>
I want to parse the content of the first <div class="codebox"> so it can be readable as plaintext.
Can htmlpurifier do that ?
Out of the box HTMLPurifier can't do that and there is no config setting, that I know of, that can convert only the first <div> tag to plain text without converting the entire document. And even for converting the entire document to text the HTMLPurifier is neither needed nor recommended.
You can extend functionality of HTMLPurifier but unless you are an expert coder, I wouldn't recommend doing that.
However if you want to convert a part of the HTML document to text then break it into parts and run the part which you want to convert to text through
strip_tags()
PHP Manual page on strip_tags
You could convert all the div tags in your document to plain text with this configuration directive:
$config->set(HTML.ForbiddenElements, 'div'); //This will black list 'div' tag
And if you absolutely insist on converting your entire document to text using HTMLPurifier then here is the config directive that will do that.
$config->set('HTML.Allowed', ''); //This will white list NO tags ''

Searching for contents between two specified tags

I installed Nokogiri into a Rails project and it can currently run "Nokogiri HTML Parser Example" with no issues.
I'm trying to create a Rails project that will parse a movie script from IMDB, conduct a word count, then display the most occurring words from that section. I've identified that the scripts are kept in a "table":
<table width=100% border=0 cellpadding=5 class=scrtext><tr><td class=scrtext><pre><html><head></head><body>
<b>PERSON1</b>
They say some dialogue
<b>PERSON2</b>
They say some more
</pre></table>
I would like to exclude the text within the <b>/<b> brackets as well.
I've been setting this up like the example above in the controller, and have gotten as far as taking in the URL:
#Save as a temp. file
tmp_file = open('http://www.imsdb.com/scripts/Authors-Anonymous.html')
#Parse the temp. file
doc = Nokogiri::HTML(tmp_file)
I'm having difficulty understanding how to set the CSS constraints to grab this table. I understand that it's between those <pre>/<pre> tags, and I've followed a number of tutorials for this but I still don't understand how to set up those constraints.
I feel that the code following this should be something like this, but I'm not awfully sure:
doc.search("//pre")
How do I set up Nokogiri's CSS constraints to pull the content between two tags such as <pre></pre>, and then filter out irrelevant tags such as <b></b> that will occur within the output?
You can use the css method selector: doc.css('pre b') which will get every <b> tag(s) inside every <pre> tag(s):
doc.css('pre b').each do |b_tag|
# b_tag will be a String containg like `<b>this text is bold</b>`
end
It might not be the most elegant solution but it did the trick for me.
In the controller, I defined the following:`
def index
page = [THE_URL]
doc = Nokogiri::HTML(open(page))
#content = doc.css('b').remove
#content = doc.css('pre')
puts #content
end
and then in the View;
<%=#content %>

render html in a rich text box in active reports software

I have a string with basic html markup which I want to put into a rich textbox
string ab = #"<b> a b </b>"
I want it to render as it would appear in a browser ie:
a b
how can I do this in active reports 7? According to http://www.datadynamics.com/forums/77664/ShowPost.aspx, a richtextbox supports these tags. Do I have to specify a property to allow it to render html? How should I approach this?
Thanks,
Sam
More information (Update 8/11):
I'm binding the data from a database field - an oracle nclob. The field repeats within the detail section (with different information each time).
If I bind the field directly to a textbox or label it renders the string, but doesnt encode the html
<b> a b </b>
but it encode the string.
Solution Summary
Solution (as suggested by #activescott)
Bind rtx directly to the datafield
'Reformat' the text into html in
the script
public void detail_Format()
{
rtxBox.Html = rtxBox.Text;
}
result: renders the html field with some degree of html formatting
notes:
binding directly in the script doesnt work,
ie. rtxBox.Html = pt.Fields["CONT_ID"].ToString(); yields some wierd meta data string
the Datafield only binding approach doesn't work
(it will yield it as text)
there are some extra spacing that occurs with p tags. It may be worth regexing them out or somehow providing some formatting control.
The actual property you are looking for is the Html Property. You can also load a file into that control using the step-by-step walkthrough here.
I am assuming you are using Section Reports and not Page Reports.
To use HTML from the database in a bound report, you should be able to use the DataField property of the RichTextBox control (set it to the name of the corresponding Data field at design time). However, I noticed this "Render HTML tags in DB in ActiveReport pdf or HTML" article which kind of implies that doesn't work since it loads the HTML from a database programatically. One of the two should work.

Resources