Getting cleaned HTML in text from HtmlCleaner

Getting cleaned HTML in text from HtmlCleaner - html-parsing

I want to see the cleaned HTML that we get from HTMLCleaner.
I see there is a method called serialize on TagNode, however don't know how to use it.
Does anybody have any sample code for it?
Thanks
Nayn

Here's the sample code:
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode root = htmlCleaner.clean(url);
HtmlCleaner.getInnerHtml(root);
String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

Use a subclass of org.htmlcleaner.XmlSerializer, for example:
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(url);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String html = xmlSerializer.getAsString(rootTagNode);

XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String html = xmlSerializer.getAsString(rootTagNode);
the method above has a problem,it will trim content in html label, for example,
this is paragraph1.
will become
this is paragraph1.
and it is getSingleLineOfChildren function does the trim operation. So if we fetch data from website and want to keep the format like tuckunder.
PS:if a html label has children label,the parent label contetn will not be trimed,
for example <p> this is paragraph1. <a>www.xxxxx.com</a> </p> will keep whitespace before "this is paragraph1"

Related

Saxonica - .NET API - XQuery - XPDY0002: The context item for axis step root/descendant::xxx is absent

I'm getting same error as this question, but with XQuery:
SaxonApiException: The context item for axis step ./CLIENT is absent
When running from the command line, all is good. So I don't think there is a syntax problem with the XQuery itself. I won't post the input file unless needed.
The XQuery is displayed with a Console.WriteLine before the error appears:
----- Start: XQUERY:
(: FLWOR = For Let Where Order-by Return :)
<MyFlightLegs>
{
for $flightLeg in //FlightLeg
where $flightLeg/DepartureAirport = 'OKC' or $flightLeg/ArrivalAirport = 'OKC'
order by $flightLeg/ArrivalDate[1] descending
return $flightLeg
}
</MyFlightLegs>
----- End : XQUERY:
Error evaluating (<MyFlightLegs {for $flightLeg in root/descendant::FlightLeg[DepartureAirport = "OKC" or ArrivalAirport = "OKC"] ... return $flightLeg}/>) on line 4 column 20
XPDY0002: The context item for axis step root/descendant::FlightLeg is absent
I think that like the other question, maybe my input XML file is not properly specified.
I took the samples/cs/ExamplesHE.cs run method of the XQuerytoStream class.
Code there for easy reference is:
public class XQueryToStream : Example
{
public override string testName
{
get { return "XQueryToStream"; }
}
public override void run(Uri samplesDir)
{
Processor processor = new Processor();
XQueryCompiler compiler = processor.NewXQueryCompiler();
compiler.BaseUri = samplesDir.ToString();
compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
XQueryExecutable exp = compiler.Compile("<saxon:example>{static-base-uri()}</saxon:example>");
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream("testoutput.xml", FileMode.Create, FileAccess.Write));
Console.WriteLine("Output written to testoutput.xml");
eval.Run(qout);
}
}
I changed to pass the Xquery file name, the xml file name, and the output file name, and tried to make a static method out of it. (Had success doing the same with the XSLT processor.)
static void DemoXQuery(string xmlInputFilename, string xqueryInputFilename, string outFilename)
{
// Create a Processor instance.
Processor processor = new Processor();
// Load the source document
DocumentBuilder loader = processor.NewDocumentBuilder();
loader.BaseUri = new Uri(xmlInputFilename);
XdmNode indoc = loader.Build(loader.BaseUri);
XQueryCompiler compiler = processor.NewXQueryCompiler();
//BaseUri is inconsistent with Transform= Processor?
//compiler.BaseUri = new Uri(xqueryInputFilename);
//compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
string xqueryFileContents = File.ReadAllText(xqueryInputFilename);
Console.WriteLine("----- Start: XQUERY:");
Console.WriteLine(xqueryFileContents);
Console.WriteLine("----- End : XQUERY:");
XQueryExecutable exp = compiler.Compile(xqueryFileContents);
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream(outFilename,
FileMode.Create, FileAccess.Write));
eval.Run(qout);
}
Also two questions regarding "BaseURI".
1. Should it be a directory name, or can it be same as the Xquery file name?
2. I get this compile error: "Cannot implicity convert to "System.Uri" to "String".
compiler.BaseUri = new Uri(xqueryInputFilename);
It's exactly the same thing I did for XSLT which worked. But it looks like BaseUri is a string for XQuery, but a real Uri object for XSLT? Any reason for the difference?

You seem to be asking a whole series of separate questions, which are hard to disentangle.
Your C# code appears to be compiling the query
<saxon:example>{static-base-uri()}</saxon:example>
which bears no relationship to the XQuery code you supplied that involves MyFlightLegs.
The MyFlightLegs query uses //FlightLeg and is clearly designed to run against a source document containing a FlightLeg element, but your C# code makes no attempt to supply such a document. You need to add an eval.ContextItem = value statement.
Your second C# fragment creates an input document in the line
XdmNode indoc = loader.Build(loader.BaseUri);
but it doesn't supply it to the query evaluator.
A base URI can be either a directory or a file; resolving relative.xml against file:///my/dir/ gives exactly the same result as resolving it against file:///my/dir/query.xq. By convention, though, the static base URI of the query is the URI of the resource (eg file) containing the source query text.
Yes, there's a lot of inconsistency in the use of strings versus URI objects in the API design. (There's also inconsistency about the spelling of BaseURI versus BaseUri.) Sorry about that; you're just going to have to live with it.

Bottom line solution based on Michael Kay's response; I added this line of code after doing the exp.Load():
eval.ContextItem = indoc;
The indoc object created earlier is what relates to the XML input file to be processed by the XQuery.

Aspose: Text after Ampersand(&) not seen while setting the page header

I encountered a problem with setting the page header text containing ampersand like ‘a&b’. The text after ‘&’ disappears in the pdf maybe because it is the reserved key in Aspose. My code looks like this:
PageSetup pageSetup = workbook.getWorksheets().get(worksheetName).getPageSetup();
//calling the function
setHeaderFooter(pageSetup, parameters, criteria)
//function for setting header and footer
def setHeaderFooter(PageSetup pageSetup, parameters, criteria = [:])
{
def selectedLoa=getSelectedLoa(parameters)
if(selectedLoa.length()>110){
String firstLine = selectedLoa.substring(0,110);
String secondLine = selectedLoa.substring(110);
if(secondLine.length()>120){
secondLine = secondLine.substring(0,122)+"...."
}
selectedLoa = firstLine+"\n"+secondLine.trim();
}
def periodInfo=getPeriodInfo(parameters, criteria)
def reportingInfo=periodInfo[0]
def comparisonInfo=periodInfo[1]
def benchmarkName=getBenchmark(parameters)
def isNonComparison = criteria.isNonComparison?
criteria.isNonComparison:false
def footerInfo="&BReporting Period:&B " + reportingInfo+"\n"
if (comparisonInfo && !isNonComparison){
footerInfo=footerInfo+"&BComparison Period:&B " +comparisonInfo+"\n"
}
if (benchmarkName){
footerInfo+="&BBenchmark:&B "+benchmarkName
}
//where I encounterd the issue,selectedLoa contains string with ampersand
pageSetup.setHeader(0, pageSetup.getHeader(0) + "\n&\"Lucida Sans,Regular\"&8&K02-074&BPopulation:&B "+selectedLoa)
//Insertion of footer
pageSetup.setFooter(0,"&\"Lucida Sans,Regular\"&8&K02-074"+footerInfo)
def downloadDate = new Date().format("MMMM dd, yyyy")
pageSetup.setFooter(2,"&\"Lucida Sans,Regular\"&8&K02-074" + downloadDate)
//Insertion of logo
try{
def bucketName = parameters.containsKey('printedRLBucketName')?parameters.get('printedRLBucketName'):null
def filePath = parameters.containsKey('printedReportLogo')?parameters.get('printedReportLogo'): null
// Declaring a byte array
byte[] binaryData
if(!filePath || filePath.contains("null") || filePath.endsWith("null")){
filePath = root+"/images/defaultExportLogo.png"
InputStream is = new FileInputStream(new File(filePath))
binaryData = is.getBytes()
}else {
AmazonS3Client s3client = amazonClientService.getAmazonS3Client()
S3Object object = s3client.getObject(bucketName, filePath)
// Getting the bytes out of input stream of S3 object
binaryData = object.getObjectContent().getBytes()
}
// Setting the logo/picture in the right section (2) of the page header
pageSetup.setHeaderPicture(2, binaryData);
// Setting the script for the logo/picture
pageSetup.setHeader(2, "&G");
// Scaling the picture to correct size
Picture pic = pageSetup.getPicture(true, 2);
pic.setLockAspectRatio(true)
pic.setRelativeToOriginalPictureSize(true)
pic.setHeight(35)
pic.setWidth(Math.abs(pic.getWidth() * (pic.getHeightScale() / 100)).intValue());
}catch (Exception e){
e.printStackTrace()
}
}
In this case, I get only ‘a’ in the pdf header all other text after ampersand gets disappeared. Please suggest me with a solution for this. I am using aspose 18.2

We have added header on a PDF page with below code snippet but we did not notice any problem when ampersand sign is included in header text.
// open document
Document document = new Document(dataDir + "input.pdf");
// create text stamp
TextStamp textStamp = new TextStamp("a&bcdefg");
// set properties of the stamp
textStamp.setTopMargin(10);
textStamp.setHorizontalAlignment(HorizontalAlignment.Center);
textStamp.setVerticalAlignment(VerticalAlignment.Top);
// set text properties
textStamp.getTextState().setFont(new FontRepository().findFont("Arial"));
textStamp.getTextState().setFontSize(14.0F);
textStamp.getTextState().setFontStyle(FontStyles.Bold);
textStamp.getTextState().setFontStyle(FontStyles.Italic);
textStamp.getTextState().setForegroundColor(Color.getGreen());
// iterate through all pages of PDF file
for (int Page_counter = 1; Page_counter <= document.getPages().size(); Page_counter++) {
// add stamp to all pages of PDF file
document.getPages().get_Item(Page_counter).addStamp(textStamp);
}
// save output document
document.save(dataDir + "TextStamp_18.8.pdf");
Please ensure using Aspose.PDF for Java 18.8 in your environment. For further information on adding page header, you may visit Add Text Stamp in the Header or Footer section.
In case you face any problem while adding header, then please share your code snippet and generated PDF document with us via Google Drive, Dropbox etc. so that we may investigate it to help you out.
PS: I work with Aspose as Developer Evangelist.

Well, yes, "&" is a reserved word when inserting headers/footers in MS Excel spreadsheet via Aspose.Cells APIs. To cope with your issue, you got to place another ampersand to paste the "& (ampersand)" in the header string. See the sample code for your reference:
e.g
Sample code:
Workbook wb = new Workbook();
Worksheet ws = wb.getWorksheets().get(0);
ws.getCells().get("A1").putValue("testin..");
String headerText="a&&bcdefg";
PageSetup pageSetup = ws.getPageSetup();
pageSetup.setHeader(0, headerText);
wb.save("f:\\files\\out1.xlsx");
wb.save("f:\\files\\out2.pdf");
Hope this helps a bit.
I am working as Support developer/ Evangelist at Aspose.

asp.net mvc razor string with double quotes

I am trying to use a string with double quotes but unable to get it work
#{
string disableMessage = "";
var disableAttr = "";
if (ViewBag.IsApplicable)
{
disableMessage = "You dont have permission to add new Item.";
disableAttr = "class=" + "disableItem" +" title="+"\""+ disableMessage +"\"";
}
}
expected: disableAttr as
class=disableItem title="You dont have permission to add new demand."
I got struck at getting double quotes for title attribute.

Why not deal with the two attributes separately:
#{
string disableTitle = null;
string disableClass = null;
if (ViewBag.IsApplicable)
{
disableTitle = "You dont have permission to add new Item.";
disableClass = "disableItem";
}
}
<div class="#disableClass" title="#disableTitle">Content</div>
Note that Razor V2 (in MVC4+) has a "conditional attribute" feature. When an attribute value is null, then Razor won't output anything at all for the attribute. So in the example above, if ViewBag.IsApplicable is false, the output will be:
<div>Content</div>

Ross's answer is much more elegant. However, keeping the line of your original code, you could do the following:
disableAttr = "class='disableItem'" +" title='"+ disableMessage +"'";
This will render the following text inside disableAttr:
class='disableItem' title='your_message_here';

Convert part of string to URL when displayed

I browsed around for a solution and I am sure it's a simple question but still not sure how to do that. So, I have a string that contains many words and some times it has links in it. For example:
I like the website http://somesitehere.com/somepage.html and I suggest you try it too.
I want to display the string in my view and have all links automatically converted to URLs.
#Model.MyText
Even StackOverflow gets it.

#Hunter is right.
In addition i found complete implementation in C#: http://weblogs.asp.net/farazshahkhan/archive/2008/08/09/regex-to-find-url-within-text-and-make-them-as-link.aspx.
In case original link goes down
VB.Net implementation
Protected Function MakeLink(ByVal txt As String) As String
Dim regx As New Regex("http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?", RegexOptions.IgnoreCase)
Dim mactches As MatchCollection = regx.Matches(txt)
For Each match As Match In mactches
txt = txt.Replace(match.Value, "<a href='" & match.Value & "'>" & match.Value & "</a>")
Next
Return txt
End Function
C#.Net implementation
protected string MakeLink(string txt)
{
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(txt);
foreach (Match match in mactches) {
txt = txt.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
return txt;
}

One way to do that would be to do a Regular Expression match on a chunk of text and replace that url string with an anchor tag.

Another regex that can be used with KvanTTT answer, and has the added benefit of accepting https urls
https?://([\w+?.\w+])+([a-zA-Z0-9\~!\##\$\%\^\&*()_-\=+\/\?.:\;\'\,]*)?
.net string representation:
"https?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?"

Naming a text file on CSHTML (Razor)

I have a page with a text area for title input and body input.
Saving a text file with those things is easy, the question is, how can I make the file to be named after whatever was placed in the title input?
I tried this:
#{
var result = "";
if (IsPost)
{
var title = Request["title"];
var body = Request["body"];
var filedata = title + "," + body + Environment.NewLine;
var dataFile = Server.MapPath("/App_Data/Request["title"]");
File.WriteAllText(#dataFile, filedata);
result = "Information saved.";
}
}
(Note that var title = Request["title"]; means that its requesting from a text input named "title"). What I want to get is that the input will also be the name of the file its saving.
But it seems that this area:
var dataFile = Server.MapPath("/App_Data/Request["title"]");
is not the correct way.
What is the correct way to do it?

Couple of pointers; firstly this sort of logic should be in a Controller, not in a View. Your Views are supposed to display information about your model, your Controllers carry out operations.
Secondly, the following should do the trick (in a Controller!):
[HttpPost]
public ActionResult SaveFile(string title, string body)
{
var fileData = title + "," + body + Environment.NewLine;
var fileSavePath = Path.Combine(
Server.MapPath("~/TextFiles"),
title.Replace(" ", "_") + ".txt");
File.WriteAllText(fileSavePath, fileData);
return this.RedirectToAction("SaveSuccessful");
}
Of note:
Server.MapPath("~/TextFiles") gives you the path to a TextFiles directory in the root of your web application where the files will be stored.
I've replaced spaces in the title which has been input with underscores.
This method redirects the user to an Action named SaveSuccessful on the same Controller
Of course you need error handling and all sorts of other things in there, but hopefully that helps.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Getting cleaned HTML in text from HtmlCleaner - html-parsing

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it? Thanks Nayn

Here's the sample code: HtmlCleaner htmlCleaner = new HtmlCleaner(); TagNode root = htmlCleaner.clean(url); HtmlCleaner.getInnerHtml(root); String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

Related

Saxonica - .NET API - XQuery - XPDY0002: The context item for axis step root/descendant::xxx is absent

Aspose: Text after Ampersand(&) not seen while setting the page header

asp.net mvc razor string with double quotes

Convert part of string to URL when displayed

Naming a text file on CSHTML (Razor)

Categories

Resources