BeautifulSoup: parse only part of the page - html-parsing

I want to parse a part of html page, say
my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
Link1
Link2
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""
I pass this string to BeautifulSoup:
soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template
But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don't need those in my code. The only way I've found up to now to avoid this is to use html.parser.
I wonder if there is a way to get rid of redundant tags using lxml - the quickest parser.
UPDATE
Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.

Use
soup.body.renderContents()

lxml will always add those tags, but you can use Tag.extract() to remove your <div> tag from inside them:
comment = soup.body.div.extract()

I could solve the problem using .contents property:
try:
children = soup.body.contents
string = ''
for child in children:
string += str(item)
return string
except AttributeError:
return str(soup)
I think that ''.join(soup.body.contents) would be more neat list to string converting, but this does not work and I get
TypeError: sequence item 0: expected string, Tag found

Related

How to display URLs from an HTML string in Angular Dart?

I'm trying to get Angular Dart to display a link in a tag from an HTML string.
At first, I tried to just set the inner HTML of the container to be the HTML string, but that didn't work, so I then I tried to use Dart's DomSanitizationService class, but that also doesn't seem to work.
What I have so far is
Dart:
class SomeComponent {
final DomSanitizationService sanitizer;
SafeUrl some_url;
SomeComponent(this.sanitizer) {
some_url = this.sanitizer.bypassSecurityTrustUrl('https://www.google.com');
}
String html_string = '''
<a [href]="some_url">Hi</a>
''';
String get Text => html_string;
}
HTML:
<div [innerHTML]="Text"></div>
The error I'm getting is Removing disallowed attribute <A [href]="some_url">. The text Hi seems to show, but there is no link anymore.
Just as you bypassed URL sanitanization, you have to bypass HTML sanitanization as well using bypassSecurityTrustHtml to return markup.
https://angular.io/api/platform-browser/DomSanitizer#bypassSecurityTrustHtml

Line Breaks not working in Textarea Output

line breaks or pharagraph not working in textarea output? for example i am using enter for pharagraph in textarea but not working in output? How can i do that?
$("#submit-code").click(function() {
$("div.output").html($(".support-answer-textarea").val());
}).next().click(function () {
$(".support-answer-textarea").val($("div.output").html());
});
.support-answer-textarea{width:100%;min-height:300px;margin:0 0 50px 0;padding:20px 50px;border-top:1px solid #deddd9;border-bottom:1px solid #deddd9;border-left:none;border-right:none;box-sizing:border-box;letter-spacing:-1px;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<textarea id="support-answer-textarea" class="support-answer-textarea" placeholder="Destek Konusunu Cevapla!"></textarea>
<button type="submit" id="submit-code" class="btn btn-success">Submit Your Code</button>
<div class="output"></div>
The best and easy way to fix line breaks on the output use these simple css:
.support-answer-textarea {
white-space: pre-wrap;
}
When you hit enter in a <textarea>, you're adding a new line character \n to the text which is considered a white space character in HTML. HTML generally converts the sequence of all white spaces to a single space. This means that if you enter a single or a dozen of whitespace characters (space, new line character or tab) in a row, the only effect in resulting HTML is just a single space.
Now the solution. You can substitute the new line character (\n) to <br> or <p> tag using replace() method.
$("#submit-code").click(function() {
$("div.output").html($(".support-answer-textarea").val().replace(/\n/g, "<br>"));
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<textarea id="support-answer-textarea" class="support-answer-textarea"></textarea>
<button type="submit" id="submit-code">Submit Your Code</button>
<div class="output"></div>
for me, I had a e.preventDefault() for only Enter keypress on a parent element, this prevents a new line from adding.
If you are capturing an input from a textarea, sending it via ajax (saving to database, e.g. mysql) and then want to display the result in a textarea (e.g. by echoing via php), use the following three steps in your JS:
#get value of textarea
var textarea_value = $('#id_of_your_textarea').val();
#replace line break with line break input
var textarea_with_break = textarea_value.replace(/(\r\n|\n|\r)/gm, '
');
#url encode the value so that you can send it via ajax
var textarea_encoded = encodeURIComponent(textarea_with_break);
#now send via ajax
You can also perform all of the above in one line. I did it in three with separate variables for easier readability.
Hope it helps.
Posting this here as it took me about an hour to figure this out, fumbling together the solutions from the answers below (see for more details):
The .val() of a textarea doesn't take new lines into account
New line in text area
URL Encode a string in jQuery for an AJAX request

HTML tags showing in page title

I am setting page titles dynamically, based on the on a property in the page's model. Sometimes that property contains html tags, such as <i>Article Title</i>. When the property contains HTML tags, they are showing in the page title. I tried surrounding the property with #Html.Raw, but it didn't help. How can I make sure the tags don't show?
Code:
<head>
<title>#Html.Raw(Model.Title)</title>
</head>
Working code based on #Santiago's answer:
<head>
#{
string titleRaw = Model.Title;
string htmlTag = "<[^>]*>";
Regex rgx = new Regex(htmlTag);
string titleToShow = rgx.Replace(titleRaw, blank);
}
<title>#titleToShow</title>
</head>
Could you try to do it with regular expressions?
<[^>]*>
Took that from:
Regular expression to remove HTML tags from a string
String target = someString.replaceAll("<[^>]*>", "");
Assuming your non-html does not contain any < or > and that your input
string is correctly structured.
A more complete explanation is available in above SO link

XPath Node selection

I am using HtmlAgilityPack to parse data for a Windows Phone 8 app. I have managed four nodes but I am having difficulties on the final one.
Game newGame = new Game();
newGame.Title = div.SelectSingleNode(".//section//h3").InnerText.Trim();
newGame.Cover = div.SelectSingleNode(".//section//img").Attributes["src"].Value;
newGame.Summary = div.SelectSingleNode(".//section//p").InnerText.Trim();
newGame.StoreLink = div.SelectSingleNode(".//img[#class= 'Store']").Attributes["src"].Value;
newGame.Logo = div.SelectSingleNode(".//div[#class= 'text-col'").FirstChild.Attributes["src"].Value;
That last piece of code is the one I am having problems with. The HTML on the website looks like this (simplified with the data I need)
<div id= "ContentBlockList" class="tier ">
<section>
<div class="left-side"><img src="newGame.Cover"></div>
<div class="text-col">
<img src="newGame.Logo http://url.png" />
<h3>newGame.Title</h3>
<p>new.Game.Summary</p>
<img src="newGame.StoreLink" class="Store" />
</div>
</div>
</section>
As you can see, I need to parse two images from this block of HTML. This code seems to take the first img src and uses it correctly for the game cover...
newGame.Cover = div.SelectSingleNode(".//section//img").Attributes["src"].Value;
However, I'm not sure how to get the second img src to retrieve the store Logo. Any ideas?
newGame.Cover = div.SelectSingleNode(".//img[2]").Attributes["src"].Value;
You didn't post the entire thing but, this should do the trick.
You can try this way :
newGame.Cover = div.SelectSingleNode("(.//img)[2]")
.GetAttributeValue("src", "");
GetAttributeValue() is preferable over Attributes["..."].Value because, while the latter throws exception, the former approach returns the 2nd parameter (empty string in the example above) when the attribute is not found.
Side note : your HTML markup is invalid as posted (some elements are not closed, <section> for example). That may cause confusion.

How to display a string with new lines as a string with <br />'s in AngularDart?

Given that I have a string being displayed on the page in AngularDart.
... <strong>Notes: </strong> {{cmp.selectedStudent.notes}} ...
How can I make it display multi-line? In the string I have newline characters, I want them to be encoded as <br /> characters in the html output.
You can replace the '\n' in your string with <br/> and use something like the proposed my-bind-html directive shown in my answer here How to add a component programatically in Angular.Dart? (the code might be a bit outdated due to a lot of recent changes in Angular)
You could use ng-repeat and repeat over your notes lines but first you need to split them by '\n' so you get an array of lines.
List<String> _notesList = null;
List<String> get notesList {
if (_notesList==null) _notesList = notes.split("\n").toList(); return _notesList;
}
.
<span ng-repeat="note in cmp.selectedStudent.notesList">{{note}}<br /></span>
By default, angular doesn't interpret HTML balise to avoid some unpredictible behavior or others bad thing, but you can disable this verification with
ng-bind-html
link to the official doc : NgHtmlBind
So you can replace directly the '\n' character by the 'br' html node.
So you can do :
// ...
String getHtmlBrNote() {
return this.notes.replaceAll("\n", "<br />");
}
// ...
and after in angular
... <strong>Notes: </strong> <span ng-bind-html="cmp.selectedStudent.getHtmlBrNote()"></span> ...
And it will be ok

Resources