Get only text from html with html agility

Get only text from html with html agility - html-parsing

I'm trying to remove from the html everything that is concerned to html with html agility, but I need to keep the text. For example, from this tag:
<TR><TD>
<B>Survival</B><BR>
<I>Be Suspicious, Be Worried, Be Prepared</I><BR>
<TD>
I want to keep only "Be suspicious..."
I have this method, but doesn't work very well:
private static HtmlDocument RemoveHTML(HtmlDocument document)
{
HtmlDocument textOfDoc = new HtmlDocument();
foreach (var node in document.DocumentNode.SelectNodes(".//p|.//title|.//body"))
{
var newNode = HtmlNode.CreateNode(node.InnerText+" ");
textOfDoc.DocumentNode.AppendChild(newNode);
}
return textOfDoc;
}
THANKS!

It looks like you're only extracting P, TITLE and BODY tags. If you want I tags as well, you need to do this:
document.DocumentNode.SelectNodes(".//p|.//title|.//body|.//i")

Related

If not closed then Auto close or auto format html tag in dynamic content in razor

below content has pulled from database
<div class="main"><div class="col1">content</div>
in above example main div has not closed in my database so I want to close it.
I have simply add in my razor view page
#Html.Raw(a.shortDesc)
but my page has disturbed. so please suggest me.

I would suggest using the HtmlAgilityPack (https://html-agility-pack.net/) to fix the HTML before you render it out using #Html.Raw in your view.
In the ShortDesc property of your ViewModel, you could do something like this:
private string _shortDesc;
public string ShortDesc
{
get
{
var doc = new HtmlDocument();
doc.LoadHtml(_shortDesc);
return doc.DocumentNode.OuterHtml;
}
set
{
this._shortDesc = value;
}
}

#{
var data = #a.shortDesc+"</div>";
}
#Html.Raw(#data)

changing Html.Raw into string without html markup in a razor view

Quite simple question. I have the following code
#Html.Raw(following.Description).ToString()
when this comes from database it has some markup in it (its a forum post but i want to show a snippet in the list without the markup
is there any way to remove this and replace this line or shall I just regex it from the controller?

Here is a utility class extension method that is able to strip tags from fragments without using Regex:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}

Sanitizing Input with JsonConvert.SerializeObject in MVC4?

Long story short, I'm trying to get the output from JsonConvert.SerializeObject to be sanitized without having to modify the contents of the saved data.
I'm working on an app that has the following markup in the view:
<textarea data-bind="value: aboutMe"></textarea>
If I save the following text, I run into problems:
<script type="text/javascript">alert("hey")</script>
The error I get in FF:
The relevant part of the offending rendered text:
$(document).ready(ko.applyBindings(new
MyProfileVm({"profileUsername":"admin","username":"Admin","aboutMe":"alert(\"hey\")","title":"Here's a
short self-bio!
:)","thumbnail":"https://i.imgur.com/H1HYxU9.jpg","locationZip":"22182","locationName":"Vienna,
VA"
And finally - at the bottom of my view:
<script type="text/javascript">
$(document).ready(ko.applyBindings(new MyProfileVm(#Html.Raw(JsonConvert.SerializeObject(Model, new JsonSerializerSettings() { ContractResolver = new CamelCasePropertyNamesContractResolver() })))));
</script>
Here, I'm passing the model that I get from the MVC controller into the js ViewModel for knockout to map into observable data. The Raw encoding seems to be the problem, but I'm not sure how to go about handling it.
To be clear, I'm getting data from the server, and outputting it to the client, which is mucking up the JSON/KO combo.

The problems is that you cannot have a closing </script> tag inside a JavaScript string literal because the browser interprets it as then end of the script block. See also: Script tag in JavaScript string
There is no builtin function in Asp.Net what could handle it on the server side you before outputting your generated script you need to replace the </script> to something else:
<script type="text/javascript">
$(document).ready(ko.applyBindings(new MyProfileVm(#Html.Raw(
JsonConvert.SerializeObject(Model,
new JsonSerializerSettings() {
ContractResolver = new CamelCasePropertyNamesContractResolver()
}).Replace("</script>", "</scripttag>")
))));
</script>
Of course if you will need this in multiple place you can move this logic into a helper/extension method, like:
public static class JavaScriptExtensions
{
public static string SerializeAndEscapeScriptTags(this object model)
{
return JsonConvert.SerializeObject(model,
new JsonSerializerSettings()
{
ContractResolver = new CamelCasePropertyNamesContractResolver()
}).Replace("</script>", "</scripttag>");
}
}
And use it with:
#using YourExtensionMethodsNamespace
<script type="text/javascript">
$(document).ready(ko.applyBindings(new MyProfileVm(#Html.Raw(
Model.SerializeAndEscapeScriptTags()))));
</script>
And on the JavaScript side in your Knockout viewmodel you need to replace back the </script> tag before the usage:
var MyProfileVm = function(data) {
//...
this.aboutMe = ko.observable(
// you need `"</scr"+ "ipt>"` because of the above mentioned problem.
data.aboutMe.replace(/<\/scripttag>/g, "</scr"+ "ipt>"));
}
Of course you can also create a helper function for this, like:
function fixScriptTags(data) {
for(var prop in data) {
if (typeof(data[prop]) == "string") {
data[prop] = data[prop].replace(/<\/scripttag>/g, "</scr"+ "ipt>");
}
//todo check for complex property values and call fixScriptTags recursively
}
return data;
}
And use it with:
ko.applyBindings(new ViewModel(fixScriptTags(data)));
Demo JSFiddle.

I've had a similar problem, it came from using knockout.js to get input from a <textarea> just like you did. Everything was fine on the "create" part, but once I put the data back into an action via #Html.Raw(...), it turned out to contain linefeed and carriage-return characters that broke the json string.
So I added something like this:
// Regex to replace all unescaped (single) backslashes in a string
private static Regex _regex = new Regex(#"(?<!\\)\\(?!\\)", RegexOptions.Compiled);
(I know it doesn't handle "\\\", but that doesn't appear from knockout)
Then I build my anonymous classes and do this:
var coJson = JsonHelper.Serialize(co);
var coJsonEsc = _regex.Replace(coJson, #"\\")
Maybe this can help you. I found it by breaking in the razor view and looking at the strings.
This problem also appears with unesacped tabs (\t) and possibly other escape sequences.

How to display textarea's data in a table

I am using ASP.MVC 3. I have a view that has a textarea on it. I captures data in it, and when I want a new paragraph I wil press enter twice. After all my data is entered I save the text to the database.
In my details view I would display the data in a like:
<tr>
<td valign="top"><label>Body:</label></td>
<td>#Model.Body</td>
</tr>
Now the text displays as 1 paragraph even though in my textarea (when I captured the data) it seemed liked paragraphs.
How would I get the data to display as paragraphs in my table like what I captured it in my textarea. I'm assuming that I have to search for carriage returns and replace them with break tags?

I'm assuming that I have to search for carriage returns and replace them with break tags?
Yes, your assumption is correct. You could use a custom HTML helper:
public static IHtmlString FormatBody(this HtmlHelper htmlHelper, string value)
{
if (string.IsNullOrEmpty(value))
{
return MvcHtmlString.Empty;
}
var lines = value.Split('\n'); // Might need to adapt
return htmlHelper.Raw(
string.Join("<br/>", lines.Select(line => htmlHelper.Encode(line)))
);
}
and then:
#Html.FormatBody(Model.Body)
UPDATE:
And here's an example of how this method could be unit tested:
[TestMethod]
public void FormatBody_should_split_lines_with_br_and_html_encode_them()
{
// arrange
var viewContext = new ViewContext();
var helper = new HtmlHelper(viewContext, MockRepository.GenerateStub<IViewDataContainer>());
var body = "line1\nline2\nline3<>\nline4";
// act
var actual = helper.FormatBody(body);
// assert
var expected = "line1<br/>line2<br/>line3<><br/>line4";
Assert.AreEqual(expected, actual.ToHtmlString());
}

Replace line break characters with <br /> in ASP.NET MVC Razor view

I have a textarea control that accepts input. I am trying to later render that text to a view by simply using:
#Model.CommentText
This is properly encoding any values. However, I want to replace the line break characters with <br /> and I can't find a way to make sure that the new br tags don't get encoded. I have tried using HtmlString but haven't had any luck yet.

Use the CSS white-space property instead of opening yourself up to XSS vulnerabilities!
<span style="white-space: pre-line">#Model.CommentText</span>

Try the following:
#MvcHtmlString.Create(Model.CommentText.Replace(Environment.NewLine, "<br />"))
Update:
According to marcind's comment on this related question, the ASP.NET MVC team is looking to implement something similar to the <%: and <%= for the Razor view engine.
Update 2:
We can turn any question about HTML encoding into a discussion on harmful user inputs, but enough of that already exists.
Anyway, take care of potential harmful user input.
#MvcHtmlString.Create(Html.Encode(Model.CommentText).Replace(Environment.NewLine, "<br />"))
Update 3 (Asp.Net MVC 3):
#Html.Raw(Html.Encode(Model.CommentText).Replace("\n", "<br />"))

Split on newlines (environment agnostic) and print regularly -- no need to worry about encoding or xss:
#if (!string.IsNullOrWhiteSpace(text))
{
var lines = text.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
<p>#line</p>
}
}
(remove empty entries is optional)

Omar's third solution as an HTML Helper would be:
public static IHtmlString FormatNewLines(this HtmlHelper helper, string input)
{
return helper.Raw(helper.Encode(input).Replace("\n", "<br />"));
}

Applying the DRY principle to Omar's solution, here's an HTML Helper extension:
using System.Web.Mvc;
using System.Text.RegularExpressions;
namespace System.Web.Mvc.Html {
public static class MyHtmlHelpers {
public static MvcHtmlString EncodedReplace(this HtmlHelper helper, string input, string pattern, string replacement) {
return new MvcHtmlString(Regex.Replace(helper.Encode(input), pattern, replacement));
}
}
}
Usage (with improved regex):
#Html.EncodedReplace(Model.CommentText, "[\n\r]+", "<br />")
This also has the added benefit of putting less onus on the Razor View developer to ensure security from XSS vulnerabilities.
My concern with Jacob's solution is that rendering the line breaks with CSS breaks the HTML semantics.

I needed to break some text into paragraphs ("p" tags), so I created a simple helper using some of the recommendations in previous answers (thank you guys).
public static MvcHtmlString ToParagraphs(this HtmlHelper html, string value)
{
value = html.Encode(value).Replace("\r", String.Empty);
var arr = value.Split('\n').Where(a => a.Trim() != string.Empty);
var htmlStr = "<p>" + String.Join("</p><p>", arr) + "</p>";
return MvcHtmlString.Create(htmlStr);
}
Usage:
#Html.ToParagraphs(Model.Comments)

I prefer this method as it doesn't require manually emitting markup. I use this because I'm rendering Razor Pages to strings and sending them out via email, which is an environment where the white-space styling won't always work.
public static IHtmlContent RenderNewlines<TModel>(this IHtmlHelper<TModel> html, string content)
{
if (string.IsNullOrEmpty(content) || html is null)
{
return null;
}
TagBuilder brTag = new TagBuilder("br");
IHtmlContent br = brTag.RenderSelfClosingTag();
HtmlContentBuilder htmlContent = new HtmlContentBuilder();
// JAS: On the off chance a browser is using LF instead of CRLF we strip out CR before splitting on LF.
string lfContent = content.Replace("\r", string.Empty, StringComparison.InvariantCulture);
string[] lines = lfContent.Split('\n', StringSplitOptions.None);
foreach(string line in lines)
{
_ = htmlContent.Append(line);
_ = htmlContent.AppendHtml(br);
}
return htmlContent;
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Get only text from html with html agility - html-parsing

It looks like you're only extracting P, TITLE and BODY tags. If you want I tags as well, you need to do this: document.DocumentNode.SelectNodes(".//p|.//title|.//body|.//i")

Related

If not closed then Auto close or auto format html tag in dynamic content in razor

changing Html.Raw into string without html markup in a razor view

Sanitizing Input with JsonConvert.SerializeObject in MVC4?

How to display textarea's data in a table

Replace line break characters with <br /> in ASP.NET MVC Razor view

Categories

Resources