parsing html to get data

parsing html to get data - html-parsing

i am having a problem with parsing html from which i would like to get the data
<td id="Company" style="border-bottom-width: 0px; padding-left: 5px">
<strong>ABC</strong>
</td>
so the data i need is of course "ABC" only, i have tried the following parsing rule but it does not work
/<td id=\"Company\" style=\"border-bottom-width: 0px; padding-left: 5px\">
<strong>(.*)<\/strong>
<\/td>/i
anyone can help and is familiar with this?

You really should not use regular expressions to parse html. It always ends up in an convoluted tangled mess.
Use a library which has the functionality of tidy like Beautiful Soup, JTidy, nekohtml,.... and walk the DOM tree (or handles the sax events) to get at the contents of the tags.
Regex-es are then beautiful to get the nuggets from the rocks once the HTML/XML parsing is done however.

You can try this regex to get text in STRONG tag nested in cell:
/<td\s*id="Company"[^>]*>\s*<strong>(.*?)</strong>\s*</td>/ms

Simple use HtmlAgilityPack
HtmlDocumet doc= new HtmlDocument();
doc.loadHtml("<td id="Company" style="border-bottom-width: 0px; padding-left: 5px">
<strong>ABC</strong>
</td>");
HtmlNode node= doc.DocumentNode.selectSingleNode("//strong");
if(node!=null)
String value= node.innerText;// value have ABC
if you have to get html from web use
var request = (HttpWebRequest)WebRequest.Create("URL");
var response= (HttpWebResponse)request.getresponse();
using (var stream = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(1252))) // you could change encoding
{
output = stream.ReadToEnd(); // output now have html in string form
}
outpul variable contains html in string foam you can use this string to pass to doc.loadHtml(output);
if want more info google 'htmlagilitypack' and 'HtmlDocument' :)

Related

Thymeleaf restrictions on onmouseover handler

I'm translating an ancient Struts/JSP application to use Spring 5 and Thymeleaf. The original application had a logic:iterate tag over the variable tt for rows in a table, and the cell was displaying a timestamp formatted on the back-end into the user's time zone, with a hover-over for UTC, like this:
<td style="cursor: pointer; cursor: hand"
onmouseover="return escape('<bean:write name="tt" property="ts_UTC" />' + ' UTC')">
<bean:write name="tt" property="ts_User" /></td>
It generates output that looks like this:
<td style="cursor: pointer; cursor: hand"
onmouseover="return escape('04/06/2020 11:14:50 AM' + ' UTC')">
04/06/2020 07:14:50 AM</td>
After a few attempts and reading https://github.com/thymeleaf/thymeleaf/issues/705 and https://github.com/thymeleaf/thymeleaf/issues/707, I translated it to thymeleaf as follows:
<td style="cursor: pointer; cursor: hand"
th:onmouseover="return escape( '[[${tt.ts_UTC}]] UTC');"
th:text="${tt.ts_user}"></td>
The problem is the generated output looks like this:
<td style="cursor: pointer; cursor: hand"
onmouseover="return escape( '"05\/04\/2015 08:05:24 PM" UTC');"
>05/04/2015 04:05:24 PM</td>
I have no idea where the " is coming from, and I really want the &#39's to turn back into apostrophes. I'm stumped. How do I do this?

I don't know if this is a full solution - because I don't know how the text ends up being displayed by the mouseover event. But...
I suggest moving the event handler to a separate JavaScript function, to keep things a bit cleaner & more flexible.
Start with this:
<div style="cursor: pointer; cursor: hand"
th:onmouseover="showMouseoverText( /*[[${tt.ts_UTC}]]*/ );"
th:text="${tt.ts_user}">
</div>
What is that /*[[${tt.ts_UTC}]]*/ doing? It uses the escaped form of JavaScript inlining - the double-bracket notation. But it also wraps it in a comment, which makes use of Thymeleaf's JavaScript ntural templating. This ensures there are no syntax errors when processing the template.
Then somewhere in your <head>...</head> section, add this:
<script th:inline="javascript">
function showMouseoverText( ts ) {
console.log(ts + ' UTC');
return escape(ts + ' UTC');
}
</script>
The console line is just there to test. For me, I get my static test data printed as follows:
04/06/2020 11:14:50 AM UTC
I don't know if that final line return escape(ts + ' UTC') will work the way you need. I'm not sure what it does.
What you get in your HTML page will be the following:
The div:
<div style="cursor: pointer; cursor: hand"
onmouseover="showMouseoverText( "04\/06\/2020 11:14:50 AM");">John Doe
</div>
You will see the escaped / characters - and single quotes represented as ". But the JavaScript function should handle these (as shown in the console output above). If not, then at least you can manipulate the data in your function, as needed.

Using th:classappend based on request param

I have the following navigation using Thymeleaf and I want to append a CSS class when the link is selected:
<div class="nav-links">
<a th:href="#{/somepage(someId=${someId},filter='filterA')}" href="/somepage"
class="subnav-item"
th:classappend="${selected}">Filter A</a>
<a th:href="#{/somepage(someId=${someId},filter='filterB')}" href="/somepage"
class="subnav-item"
th:classappend="${selected}">Filter B</a>
<a th:href="#{/somepage(someId=${someId},filter='filterC')}" href="/somepage"
class="subnav-item"
th:classappend="${selected}">Filter C</a>
</div>
Assuming some style like:
.subnav-item.selected, .subnav-item.selected:hover, .subnav-item.selected:focus {
background-color: #FFF;
border-color: #000;
}
Looking at this question, it can be easily done based on the URI of the page (in this case, somepage but I need to have it work selectively based on the request parameter (in this case, filter). Is there an easy way to do this?
I tried adding the selected value to the model on the server side and separately tried using the request itself, but it does not differentiate based on the filter param (just on the somepage page).
Is the only way to do some hacky stuff with request.getQueryString()?

Found the answer. There's a utility for the HttpServletRequest that can easily be accessed as such:
th:classappend="${#request.getParameter('filter') == 'filterA' ? 'selected' : ''}"

Line Breaks not working in Textarea Output

line breaks or pharagraph not working in textarea output? for example i am using enter for pharagraph in textarea but not working in output? How can i do that?
$("#submit-code").click(function() {
$("div.output").html($(".support-answer-textarea").val());
}).next().click(function () {
$(".support-answer-textarea").val($("div.output").html());
});
.support-answer-textarea{width:100%;min-height:300px;margin:0 0 50px 0;padding:20px 50px;border-top:1px solid #deddd9;border-bottom:1px solid #deddd9;border-left:none;border-right:none;box-sizing:border-box;letter-spacing:-1px;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<textarea id="support-answer-textarea" class="support-answer-textarea" placeholder="Destek Konusunu Cevapla!"></textarea>
<button type="submit" id="submit-code" class="btn btn-success">Submit Your Code</button>
<div class="output"></div>

The best and easy way to fix line breaks on the output use these simple css:
.support-answer-textarea {
white-space: pre-wrap;
}

When you hit enter in a <textarea>, you're adding a new line character \n to the text which is considered a white space character in HTML. HTML generally converts the sequence of all white spaces to a single space. This means that if you enter a single or a dozen of whitespace characters (space, new line character or tab) in a row, the only effect in resulting HTML is just a single space.
Now the solution. You can substitute the new line character (\n) to <br> or <p> tag using replace() method.
$("#submit-code").click(function() {
$("div.output").html($(".support-answer-textarea").val().replace(/\n/g, "<br>"));
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<textarea id="support-answer-textarea" class="support-answer-textarea"></textarea>
<button type="submit" id="submit-code">Submit Your Code</button>
<div class="output"></div>

for me, I had a e.preventDefault() for only Enter keypress on a parent element, this prevents a new line from adding.

If you are capturing an input from a textarea, sending it via ajax (saving to database, e.g. mysql) and then want to display the result in a textarea (e.g. by echoing via php), use the following three steps in your JS:
#get value of textarea
var textarea_value = $('#id_of_your_textarea').val();
#replace line break with line break input
var textarea_with_break = textarea_value.replace(/(\r\n|\n|\r)/gm, '
');
#url encode the value so that you can send it via ajax
var textarea_encoded = encodeURIComponent(textarea_with_break);
#now send via ajax
You can also perform all of the above in one line. I did it in three with separate variables for easier readability.
Hope it helps.
Posting this here as it took me about an hour to figure this out, fumbling together the solutions from the answers below (see for more details):
The .val() of a textarea doesn't take new lines into account
New line in text area
URL Encode a string in jQuery for an AJAX request

How to pass html parameter in Google Closure Template

Guys I want to pass a parameter that contains html characters in Google Closure Template, but all I get is literal html texts. How to do this?
What I have tried so far is this :
{template .modal autoescape="strict" kind="html"}
{$html_content}
{/template}
I have been reading this but it's not very helpful. Thanks

{template .modal}
{$html_content |noAutoescape}
{/template}
Is going to print your HTML. But consider that using |noAutoescape in your templates is discouraged.
Discouraged: It's easy to accidentally introduce XSS attacks when the assertion
that content is safe is far away from where it is created. Instead,
wrap content as sanitized content where it is created and easy to
demonstrate safety.– Google Closure Templates Functions and Print Directives
Or if you are sure $html_content is "safe" HTML, you can ordain it right where you pass parameters to the template:
goog.require('soydata.VERY_UNSAFE');
goog.require('template.namespace');
var container = document.getElementById('modal');
var html = '<strong>HTML you trust!</strong>';
container.innerHTML = template.namespace.modal({
html_content: soydata.VERY_UNSAFE.ordainSanitizedHtml(html);
});
Then your initial template is going to print HTML as it is:
/**
* #param html_content HTML markup
*/
{template .modal autoescape="strict" kind="html"}
{$html_content}
{/template}

BeautifulSoup: parse only part of the page

I want to parse a part of html page, say
my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
Link1
Link2
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""
I pass this string to BeautifulSoup:
soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template
But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don't need those in my code. The only way I've found up to now to avoid this is to use html.parser.
I wonder if there is a way to get rid of redundant tags using lxml - the quickest parser.
UPDATE
Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.

Use
soup.body.renderContents()

lxml will always add those tags, but you can use Tag.extract() to remove your <div> tag from inside them:
comment = soup.body.div.extract()

I could solve the problem using .contents property:
try:
children = soup.body.contents
string = ''
for child in children:
string += str(item)
return string
except AttributeError:
return str(soup)
I think that ''.join(soup.body.contents) would be more neat list to string converting, but this does not work and I get
TypeError: sequence item 0: expected string, Tag found

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

parsing html to get data - html-parsing

You can try this regex to get text in STRONG tag nested in cell: /<td\sid="Company"[^>]>\s<strong>(.?)</strong>\s*</td>/ms

Related

Thymeleaf restrictions on onmouseover handler

Using th:classappend based on request param

Line Breaks not working in Textarea Output

How to pass html parameter in Google Closure Template

BeautifulSoup: parse only part of the page

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

parsing html to get data - html-parsing

You can try this regex to get text in STRONG tag nested in cell: /<td\s*id="Company"[^>]*>\s*<strong>(.*?)</strong>\s*</td>/ms

Related

Thymeleaf restrictions on onmouseover handler

Using th:classappend based on request param

Line Breaks not working in Textarea Output

How to pass html parameter in Google Closure Template

BeautifulSoup: parse only part of the page

Categories

Resources

You can try this regex to get text in STRONG tag nested in cell: /<td\sid="Company"[^>]>\s<strong>(.?)</strong>\s*</td>/ms