How to downcase an entire HTML document parsed with Nokogiri - ruby-on-rails

I need to downcase all text in an HTML document that has been parsed with Nokogiri. Here my code:
agent = Mechanize.new
page = agent.get('http://www.example.com').parser.search('//*[translate(text(),"ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") = *]').to_html
There is not error as such in the code; it executes without an error. If I go in and check a random tag in the document, however, the case is still the same as before. Is there another/better way to downcase all text in a document?

You could use traverse to downcase all text nodes:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.example.com/"))
doc.traverse do |node|
node.content = node.content.downcase if node.text?
end
puts doc.to_html
Output:
<!DOCTYPE html>
<html>
<head>
<title>example domain</title>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
body { ... }
</style>
</head>
<body>
<div>
<h1>example domain</h1>
<p>this domain is established to be used for illustrative examples in documents. you may use this
domain in examples without prior coordination or asking for permission.</p>
<p>more information...</p>
</div>
</body>
</html>

Related

style tag is ignored by uiwebview

I'm trying to give my html an css inline style. But this style gets ignored.
This is the HTML string that I'm using for webView.loadHTMLString(htmlBelow, baseURL: Bundle.main.bundleURL)
let fontsize = 16 //This is an dynamic variable
<!DOCTYPE html>
<html lang=\"en\">
<head>
<meta charset=\"UTF-8\">
<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">
<meta http-equiv=\"X-UA-Compatible\"content=\"ie=edge\">
<title>Document</title>
<style>
​html {
font-size:\(fontsize)px; */THIS DOESN'T WORK/*
}
</style>
<link rel=\"stylesheet\" href=\"\style.css\">
<script src=\"jquery-3.4.1.min.js\"></script>
<script src=\"script.js\"></script>
</head>
<body>
<div class=\ "sqr-tree-level\"> \(restOfHtml) </div>
</body>
</html>
I'm trying to set the fontsize dynamically.
When I change the font-size in my css file it works, but I want to set it dynamically. That's why I wanna do it like this.
This is how my document.head.innerHTML looks like
For future reference the fix here was to put the styling in to the <body> tag instead of in the head <style> tag.
<body style=\"font-size:\(fontsize)px;\">

How to force a page break in HtmlRenderer.PdfSharp?

I'm using "HTML Renderer for PDF using PDFsharp" HtmlRenderer.PdfSharp (version 1.5.1-beta1). I'm trying to force a page break. But I can't get this to work. What I have now in my html is this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Test</title>
<style>
div { page-break-inside: auto; }
</style>
</head>
<body style="margin:0; padding:0;" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0">
<div>Page1</div>
<div>Page2</div>
</body>
</html>
Both div stay on the same page when I convert this html to PDF.
string contents = File.ReadAllText(#"C:\temp\test.html");
PdfDocument pdf = PdfGenerator.GeneratePdf(contents, PageSize.A4);
pdf.Save(#"C:\temp\pdfsharp.pdf");
How can I force the second div to a new page?
The only known solution to force page breaks is to split the html into parts, and generate a page for each html part. Solution from Grasher134 on GitHub: https://github.com/ArthurHub/HTML-Renderer/issues/49#issuecomment-251351431

How to use XPath in Ruby to get tag with a specific attribute?

For instance, I want to get the value inside this tag
<meta charset="euc-kr">
in this document:
<!DOCTYPE html>
<html lang="ko">
<head>
<meta charset="euc-kr">
</Head>
</HTML>
How do I use XPath in Ruby to get any meta tag that has the attribute "charset"?
Ruby REXML library let's use XPath. Below example shows how to use xpath in your case.
require 'rexml/document'
xml = %{<!DOCTYPE html>
<html lang="ko">
<head>
<meta charset="euc-kr"></meta>
<link charset="en-us"></link>
</head>
</html>}
doc = REXML::Document.new xml
REXML::XPath.match(doc, '//*[#charset]')
Result:
=> [<meta charset='euc-kr'/>, <link charset='en-us'/>]

grails header title not working with layout - sitemesh and layout

I have a base header layout (base-header-footer.gsp)
<!DOCTYPE html>
<html>
<head>
<title><g:layoutTitle default="${g.message(code: 'title.index.page')}"/></title>
</head>
... some common resources loading....
<body id="launch">
<g:layoutBody/>
...........................
<r:layoutResources />
</body>
</html>
And then 2 more header, one for logged-in user, and another for guest users, and both of these header layout are extending the base layout.
Guest users (anonymouys-header-footer.gsp) -
<g:applyLayout name="base-header-footer">
<!DOCTYPE html>
<html>
<head>
<g:layoutHead/>
</head>
<body>
... render guest user header
<g:layoutBody/>
</body>
</html>
Logged-in users (loggedin-header-footer.gsp) -
<g:applyLayout name="base-header-footer">
<!DOCTYPE html>
<html>
<head>
... some css
<g:layoutHead/>
</head>
<body>
... Render header for logged-in user
</body>
... load some JS file...
</html>
Now in specific pages I apply guest OR logged-in layout based on user's login state, hence I want to show the page specific title user is on, but it doesn't work.
This is how I am using those layout
OrderStatus.gsp -
<!DOCTYPE html>
<html>
<head>
<title>Order status | Some title</title>
<meta name="layout" content="logged-in-header-footer" />
<script type="text/javascript" src="${resource(dir:'js',file:'some.js')}"></script>
</head>
<body>
</body>
</html>
But I still see the title which is defined base-header-footer.gsp, not the one in OrderStatus.gsp
I have also tried using g:layoutTitle in OrderStatus.gsp but doesn't help.
Any help is highly appreciated.
Use
<meta name="layout" content="base-header-footer">
in your pages to load the layout, then add your title there,
<title>${whatever.something()}</title>
in your layout add this:
<title><g:layoutTitle/></title>
enjoy.
Try to use
<title><g:layoutTitle/></title>
in your layouts (base-header-footer and loggedin-header-footer.gsp). More info in the official documentation.

How do I omit script elements from HTML using XPath in Nokogiri in Ruby on Rails?

Say I start with everything inside the body element:
Nokogiri::HTML( doc ).xpath( "/html/body/node()" ).to_html
which contains some <script> and <noscript>. How do I get rid of these?
You might want to change your XPath expression to:
Nokogiri::HTML( doc ).xpath( "/html/body/node()[not(self::script or self::noscript)]" ).to_html
#!/usr/bin/env ruby
require 'nokogiri'
html = <<EOT
<html>
<head>
<script>
<!-- dummy script !>
</script>
</head>
<body>
<script><!-- dummy script !></script>
<noscript>dummy script</noscript>
</body>
</head>
EOT
doc = Nokogiri::HTML(html)
Here's the gist of it:
doc.at('body').search('script,noscript').remove
puts doc.to_xml
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
>> <script>
>> <!-- dummy script !>
>> </script>
>> </head>
>> <body>
>>
>> </body>
>> </html>
For simplicity, I'm using Nokogiri's ability to use CSS accessors, rather than XPath.
doc.at('body').search('script,noscript').remove
looks for the first occurrence of the <body> tag, then looks inside for all <script> and <noscript> tags, removing them.
The gap between the resulting <body> tags are the result of the carriage returns in text nodes that trailed the actual target tags.

Resources