I have the following HTML in a variable named html_data where I wish to replace <img> tags with <a> tags and the src parameters of the "img" tags becomes href of the "a" tags.
Existing HTML:
<!DOCTYPE html>
<html>
<head>
<title>Learning Nokogiri</title>
</head>
<body marginwidth="6">
<div valign="top">
<div class="some_class">
<div class="test">
<img src="apple.png" alt="Apple" height="42" width="42">
<div style="white-space: pre-wrap;"></div>
</div>
</div>
</div>
</body>
</html>
This is my solution A:
nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
a_tag["href"] = tag["src"]
tag.add_next_sibling(a_tag)
tag.remove()
}
puts 'nokogiri_html is', nokogiri_html
This is my solution B:
nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
tag.name= "a";
tag.set_attribute("href" , tag["src"])
}
puts 'nokogiri_html is', nokogiri_html
While solution A works fine, I am looking if there is a quicker/direct way to replace the tags using Nokogiri. With solution B, my "img" tag does get replaced with the "a" tag, but the properties of the "img" tag still remains inside the "a" tag. Below is the result of Solution B:
<!DOCTYPE html>
<html>
<body>
<p>["\n", "\n", " </p>
\n", "
<title>Learning Nokogiri</title>
\n", " \n", " \n", "
<div valign='\"top\"'>
\n", "
<div class='\"some_class\"'>
\n", "
<div class='\"test\"'>
\n", " <a src="%5C%22apple.png%5C%22" alt='\"Apple\"' height='\"42\"' width='\"42\"' href="%5C%22apple.png%5C%22"></a>\n", "
<div style='\"white-space:' pre-wrap></div>
\n", "
</div>
\n", "
</div>
\n", "
</div>
\n", " \n", ""]
</body>
</html>
Is there a way to replace the tags faster in HTML using Nokogiri? Also how can remove the "\n"s am getting in the result?
First, please strip your sample data (HTML) to the barest amount necessary to demonstrate the problem.
Here's the basics of doing what you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html>
<html>
<body>
<img src="apple.png" alt="Apple" height="42" width="42">
</body>
</html>
EOT
doc.search('img').each do |img|
src, alt = %w[src alt].map{ |p| img[p] }
img.replace("<a href='#{ src }'>#{ alt }</a>")
end
doc.to_html
# => "<!DOCTYPE html>\n<html>\n <body>\n Apple\n </body>\n</html>\n"
puts doc.to_html
# >> <!DOCTYPE html>
# >> <html>
# >> <body>
# >> Apple
# >> </body>
# >> </html>
Doing it this way allows Nokogiri to replace nodes cleanly.
It's not necessary to do all this rigamarole:
a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
a_tag["href"] = tag["src"]
tag.add_next_sibling(a_tag)
tag.remove()
Instead, create a string that is the tag you want to use and let Nokogiri convert the string to a node and replace the old node:
src, alt = %w[src alt].map{ |p| img[p] }
img.replace("<a href='#{ src }'>#{ alt }</a>")
It's not necessary to strip extraneous whitespace between nodes. It can affect the look of the HTML but browsers will gobble that extra whitespace and not display it.
Nokogiri can be told to not output the inter-node whitespace, resulting in a compressed/fugly output, but how to do that is a separate question.
Related
Users with different roles need to have different navbars at the top of the webpage. My topnavs are th:fragments, written in separate file, and that part of code works fine.
But when I use th:switch and th:replace, all topnavs are shown in the webpage instead of just one.
I was looking these questions for solutions, but with no help:
Thymleaf switch statement with multiple case
How to use thymeleaf conditions - if - elseif - else
Things I tried:
<span th:if = "${role.value} == 'role1' " th:replace = "fragments/topnav :: navbar_role1"></span>
<span th:if = "${role.value} == 'role2' " th:replace = "fragments/topnav :: navbar_role2"></span>
#2
<span th:if = "${role} == 'role1' " th:replace = "fragments/topnav :: navbar_role1"></span>
<span th:if = "${role} == 'role2' " th:replace = "fragments/topnav :: navbar_role2"></span>
#3
<span th:if = "${role} eq 'role1' " th:replace = "fragments/topnav :: navbar_role1"></span>
<span th:if = "${role} eq 'role2' " th:replace = "fragments/topnav :: navbar_role2"></span
#4
<th:block th:switch = "${role}">
<div th:case = "${role} eq 'role1' " th:replace = "fragments/topnav :: navbar_role1"></div>
<div th:case = "${role} eq 'role2' " th:replace = "fragments/topnav :: navbar_role2"></div>
</th:block>
In debugger I can see in Java code that field role of type String has value role2. Instead of just second navbar, it shows both navbars.
It must be that I missed something in Thymeleaf syntax?
The target fragment needs to be in a child element, in this case, for the replace to work as intended.
Assume we have the following fragments in fragments/topnav.html:
<div th:fragment="navbar_role1">
<div>This is navbar role 1</div>
</div>
<div th:fragment="navbar_role2">
<div>This is navbar role 2</div>
</div>
Then your main page can be structured as follows:
<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">
<head>
<title>Switch Demo</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<th:block th:switch="${role}">
<th:block th:case="'role1'">
<div th:replace="fragments/topnav :: navbar_role1"></div>
</th:block>
<th:block th:case="'role2'">
<div th:replace="fragments/topnav :: navbar_role2"></div>
</th:block>
</th:block>
</body>
</html>
You can obviously use th:block elements or something else such as <div> elements, depending on what you want the final structure to look like.
Note also that you do not need the following when using a th:switch:
th:case="${role} eq 'role1'" // NOT needed
You can simply use the value itself:
th:case="'role1'"
That is one of the benefits of using switch statements here.
I ran into this error when I updating OneNote page content. But let me explain my input HTML for OneNote first before I show you the issue.
Here is my input HTML template:
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<object data-id="markdown-file" data-attachment="markdown.md" data="name:markdown" type="text/markdown" />
<div data-id="content">{{ content goes here }}</div>
</body>
</html>
And I sent the following patch command to update content if div[data-id="content"] exists:
{
'target': generated id of div[data-id="content"],
'action': 'replace',
'content': '<div data-id="content">{{ actual content }}</div>'
}
otherwise, I use another command:
{
'target': 'body',
'action': 'append',
'content': '<div data-id="content">{{ actual content }}</div>'
}
Most of time, it works fine. But sometimes not. Suppose we have the following output html:
HTML
<html lang="en-US">
<head>
<title>2</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
<div id="div:{bbed3bc8-6ec5-4900-b1ee-a11259b4d796}{2}" data-id="_default" style="position:absolute;left:48px;top:120px;width:624px">
<object data-attachment="markdown.md" type="text/markdown" data="https://graph.microsoft.com/v1.0/users('195d63c8-4d1e-4073-b535-5d8a32b6f6ce')/onenote/resources/1-5ae390556d1e4351b358b6e1a667a226!1-051437c2-f608-445d-b537-e68aea2dfcd9/$value" data-id="markdown-file" />
<div data-id="content" id="div:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{69}:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{81}">
<table id="table:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{69}" style="border:1px solid;border-collapse:collapse">
<tr id="tr:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{70}">
<td id="td:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{72}" style="background-color:white;border:1px solid;text-align:center"><span style="font-family:BlinkMacSystemFont;color:#363636;font-weight:bold">Head</span></td>
</tr>
<tr id="tr:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{71}">
<td id="td:{ce16905b-a76b-4e35-86de-1a46b5f8a62f}{75}" style="background-color:white;border:1px solid"><span style="font-family:BlinkMacSystemFont;color:#363636">Column</span></td>
</tr>
</table>
</div>
</div>
</body>
</html>
Notice that div[data-id="content"] only contains a table. If I try to replace div[data-id="content"], it shows the error The PATCH target $value specified and page content related to the specified PATCH target cannot be located. The error message is not quite clear, so I cannot know which target is missing.
But if the output HTML contains not only tables, but also other elements, it can be replaced successfully. My code works with the following output HTML:
<html lang="en-US">
<head>
<title>3</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
<div id="div:{a61d7d65-215f-4936-a8c3-4dda9a805827}{249}" data-id="_default" style="position:absolute;left:48px;top:120px;width:624px">
<object data-attachment="markdown.md" type="text/markdown" data="https://graph.microsoft.com/v1.0/users('195d63c8-4d1e-4073-b535-5d8a32b6f6ce')/onenote/resources/1-c52b2dd5a8d74a89a0e038373b52b3f1!1-051437c2-f608-445d-b537-e68aea2dfcd9/$value" data-id="markdown-file" />
<div data-id="content" id="div:{25489f27-57fa-4798-b4ef-d229a5c5841f}{171}:{25489f27-57fa-4798-b4ef-d229a5c5841f}{187}">
<table id="table:{25489f27-57fa-4798-b4ef-d229a5c5841f}{171}" style="border:1px solid;border-collapse:collapse">
<tr id="tr:{25489f27-57fa-4798-b4ef-d229a5c5841f}{172}">
<td id="td:{25489f27-57fa-4798-b4ef-d229a5c5841f}{174}" style="background-color:white;border:1px solid;text-align:center"><span style="font-family:BlinkMacSystemFont;color:#363636;font-weight:bold">Head</span></td>
</tr>
<tr id="tr:{25489f27-57fa-4798-b4ef-d229a5c5841f}{173}">
<td id="td:{25489f27-57fa-4798-b4ef-d229a5c5841f}{177}" style="background-color:white;border:1px solid"><span style="font-family:BlinkMacSystemFont;color:#363636">Column</span></td>
</tr>
</table>
<p id="p:{25489f27-57fa-4798-b4ef-d229a5c5841f}{187}" style="margin-top:5.5pt;margin-bottom:5.5pt"><span style="font-family:BlinkMacSystemFont;color:#4a4a4a">hello</span></p>
</div>
</div>
</body>
</html>
The only difference of two output html is the second one has a p tag. This issue seems weird.
Here is my code to update page content:
def update_page(id):
original_content = _get_page_content(id)
original_document = PyQuery(original_content)
content_div = original_document('div[data-id="content"]')
page = request.json
new_document = PyQuery(page['content'])
commands = [
{
'target': 'title',
'action': 'replace',
'content': page['title']
},
{
'target': '#markdown-file',
'action': 'replace',
'content': MARKDOWN_FILE_OBJECT_HTML
}
]
content = '<div data-id="content">{0}</div>'.format(
OneNoteHtmlMapper(new_document).get_html()) # OneNoteHtmlMapper is not implemented, it simply calls new_document.outer_html()
if content_div:
commands.append({
'target': content_div.attr('id'),
'action': 'replace',
'content': content
})
else:
commands.append({
'target': 'body',
'action': 'append',
'content': content
})
files = {
'Commands': ('', io.StringIO(json.dumps(commands)),
'application/json'),
'markdown': ('markdown.md', io.StringIO(page['markdown']),
'text/markdown')
}
oauth_client = oauth.microsoft_graph
response = oauth_client.request(
'PATCH', 'me/onenote/pages/{0}/content'.format(id), files=files)
return response.content, response.status_code
Thanks in advance!
A temporary fix: append a 1px * 1px white image to div if it only contains tables.
<!DOCTYPE html>
<html>
<head>
<title>Mathquill</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<link rel="stylesheet" type="text/css" href="mathquill-0.9.4/mathquill.css">
<script src="jquery-1.9.1.min.js"></script>
<script src="mathquill-0.9.4/mathquill.min.js"></script>
<script>
function clickMe()
{
$('#taOne').mathquill('latex', 'x^2');
$('#taTwo').mathquill('latex', '\int x');
$('#taThree').mathquill('latex', '\left(x^2 + y^2 \right)');
}
</script>
</head>
<body style="height: auto">
<div id="MathOutput" style="display: none">$$ {} $$</div>
<div id="MathList" style="font-size:30px;background-color:LightSeaGreen;height: auto;line-height: 1.4;font-family: "Museo Sans",sans-serif; margin-bottom: 3px;" />
<div id="Ans1" class="mathquill-embedded-latex" style="background-color:yellow;text-align:left;font-size:30px;height: auto"></div>
<input type="button" value="ClickMe" onclick="clickMe();"/>
<textarea id="taOne" class="mathquill-editable" name="taOne" style="width:80%;vertical-align:top"></textarea>
<textarea id="taTwo" class="mathquill-editable" name="taTwo" style="width:80%;vertical-align:top"></textarea>
<textarea id="taThree" class="mathquill-editable" name="taThree" style="width:80%;vertical-align:top"></textarea>
</body>
</html>
In the above code I am trying to show the latex equation in the the textarea. And it is rendering as follows for each equation.
$('#taOne').mathquill('latex', 'x^2');
:-x2
$('#taTwo').mathquill('latex', '\int x');
:-intx
$('#taThree').mathquill('latex', '\left(x^2 + y^2 \right)');
:-left(x^2+y^2ight)
So, How to fix this Issue
Looks like you need to use double \\ instead of a single \.
Change this:
function clickMe()
{
$('#taOne').mathquill('latex', 'x^2');
$('#taTwo').mathquill('latex', '\int x');
$('#taThree').mathquill('latex', '\left(x^2 + y^2 \right)');
}
to:
function clickMe()
{
$('#taOne').mathquill('latex', 'x^2');
$('#taTwo').mathquill('latex', '\\int x');
$('#taThree').mathquill('latex', '\\left(x^2 + y^2 \\right)');
}
\ is used in Strings to escape special characters, so if you want a backslash in your string, you have to escape it via another backslash.
<div>
text1
</div>
<div>
text2
</div>
<div>
text3
</div>
hello,
i try to parse links in html code below."thread_title" tag has different numbers.but could not solve it
thanks
$html = new simple_html_dom();
$html->load($input);
foreach($html->find('a[id=thread_title_([^\"]*)]') as $link)
echo $link->outertext . '<br>';
$html = new simple_html_dom();
$html->load($input);
foreach($html->find('a[id^=thread_title]') as $link)
echo $link->outertext . '<br>';
I have this html
<div class="postrow firs">
<h2 class="title icon">
This is the title
</h2>
<div class="content">
<div id="post_message_1668079">
<blockquote class="postcontent restore ">
<div>Category</div>
<div>Authour: Kim</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div> <div class="postrow">
<h2 class="title icon">
This is the title
</h2>
<div class="content">
<div id="post_message_1668079">
<blockquote class="postcontent restore ">
<div>Category</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div>
I want to extract the following things from each div having class "postrow" and may also have another classes like <div class="postrow first">. So the class "first" is not my concern, just need to have "postrow" in the beginning.
The content inside the tag with class title
the HTML from the "blockquote" tag. But not any div withing this
tag.
Code I tried:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("http://localhost/vanilla/");
List<string> facts = new List<string>();
foreach (HtmlNode li in doc.DocumentNode.SelectNodes("//div[#class='postrow']"))
{
facts.Add(li.InnerHtml);
foreach (String s in facts)
{
textBox1.Text += s + "/n";
}
}
Your code has issue you have to give html as string not the path
doc.LoadHtml("http://localhost/vanilla/");
instead
var request = (HttpWebRequest)WebRequest.Create("http://localhost/vanilla/");
String response = request.GetResponse();
doc.loadHtml(response);
now iterate the parsed html