Extracting text with XPath in Python - parsing

I have a page. It's html structure is like this:
<p>
<strong style="mso-bidi-font-weight: normal;">
<span>
Text
</span>
</strong>
<span>
Text
</span>
<span>
Text
<em>
Text
</em>
Text
</span>
<span>
Text
<strong>
Text
</strong>
</span>
</p>
And I want to extract text from each p tag. They must be separated with new lines.
first p tag: TextTextText
next p tag: TextTextText
What I have done.
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id="+str(i)+"&Itemid="+str(i+1)
page = urllib.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
for element in xmldata.xpath('//p[#class="MsoNormal"]'):
joined_text=u''.join(element.xpath('descendant::text()'))
print joined_text
But it prints out only the last p and I don't understand why. Guys, I'll very glad for any help.

Related

How to make the result card/list responsive?

I am trying to make a filter based search using reactive search. However, I am stuck with the result card implementation. Currently, the result card only displays specific content from my elastic search database. However, I need the result card to be responsive in the sense that when a result card is clicked, a popup should be rendered on the screen displaying the additional details of the particular search result.
I have tried implementing a few CSS and Javascript popups, but am unable to render the contents of each search item.
<ResultCard
componentId="results"
dataField="original_title"
react={{
and: [
"mainSearch",
"RangeSlider",
"Brand-list",
"Segment-list",
"fuel-list"
]
}}
pagination={true}
className="Result_card"
paginationAt="bottom"
pages={5}
size={12}
Loader="Loading..."
noResults="No results were found..."
sortOptions={[
{
dataField: "Price__in_Lakhs_",
sortBy: "asc",
label: "Sort by Price (Low to High) \u00A0"
},
{
dataField: "Price__in_Lakhs_",
sortBy: "desc",
label: "Sort by Price (High to Low) \u00A0 \u00A0"
},
{
dataField: "Variants.keyword",
sortBy: "asc",
label: "Sort by Variant (A-Z) \u00A0"
}
]}
innerClass={{
title: "result-title",
listItem: "result-item",
list: "list-container",
sortOptions: "sort-options",
resultStats: "result-stats",
resultsInfo: "result-list-info"
}}
onData={function(res) {
return {
description: (
<div className="main-description">
<div className="ih-item square effect6 top_to_bottom">
<a target="#" href={"" + res.Index}>
<div className="img">
<img
src={"" + res.Index}
alt={""}
className="result-image"
/>
</div>
<div className="info colored">
<h3 className="overlay-title">
{res.Variants}
<div className="overlay-description">
{res.Model}
</div>
<div className="overlay-info">
<div className="rating-time-score-container">
<div className="sub-title Rating-data">
<b>
Price:
<span className="details">
{" "}
{res.Price__in_Lakhs_}
{" Lakhs"}
</span>
</b>
</div>
<div className="sub-title Score-data">
<b>
Segment:
<span className="details">
{" "}
{res.Segment}
</span>
</b>
</div>
</div>
<div className="revenue-lang-container">
<div className="revenue-data">
<b>
<span>Brand: </span>{" "}
<span className="details">
{" "}
{res.Brand}
</span>{" "}
</b>
</div>
<div className="sub-title language-data">
<b>
Mileage:
<span className="details">
{" "}
{res.Mileage__ARAI_} Kmpl
</span>
</b>
</div>
</div>
</div>
</div>
</a>
</div>
</div>
),
};
}}
/>
Maybe this thread might help you: https://stackoverflow.com/a/56685332/9119053 .
If this not what you are looking for can you, please give a proper context and also there seems to be some styling issue with the code as well.

Ruby on Rails clipboard-rails dynamic url copy issue

So I wanted to copy the url of each resource to the clipboard so I tried:
<% #posts.each do |post|%>
<script>
$(document).ready(function(){
var clipboard = new Clipboard('.clipboard-btn');
console.log(clipboard);
});
</script>
<textarea id="bar"><%= post_path(post)%></textarea>
<button class="clipboard-btn" data-clipboard-action="copy" data-clipboard-target="#bar">
Copy to clipboard
</button>
<% end %>
But the problem with that was it only copied the url of the first resource. So I tried this:
<% #posts.each do |post|%>
<script>
$(document).ready(function(){
var clipboard = new Clipboard('.clipboard-btn<%=post.id%>');
console.log(clipboard);
});
</script>
<textarea id="bar<%=post.id%>"><%= post_path(post)%></textarea>
<button class="clipboard-btn<%=post.id%>" data-clipboard-action="copy" data-clipboard-target="#bar<%=post.id%>">
Copy to clipboard
</button>
<% end %>
without any luck
You can move your script outside the iteration, in order to create just one, not one for each of your posts inside #posts, and to use a way to match each element in the DOM with class starting with clipboard-btn, so you don't need to add the id, like:
<% #posts.each do |post|%>
<textarea id="bar<%= post.id %>">
<%= post_path(post) %>
</textarea>
<button
class="clipboard-btn<%= post.id %>"
data-clipboard-action="copy"
data-clipboard-target="#bar<%= post.id %>">
Copy to clipboard
</button>
<% end %>
<script>
$(document).ready(function(){
var clipboard = new Clipboard('[class^="clipboard-btn"]');
});
</script>
As example:
$(document).ready(function() {
var clipboard = new Clipboard('[class^="clipboard-btn"]');
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.7.1/clipboard.min.js"></script>
<textarea id="bar1">Some content 1</textarea>
<button class="clipboard-btn1" data-clipboard-action="copy" data-clipboard-target="#bar1">
Copy to clipboard
</button>
<br>
<textarea id="bar2">Some content 2</textarea>
<button class="clipboard-btn2" data-clipboard-action="copy" data-clipboard-target="#bar2">
Copy to clipboard
</button>
<br>
<textarea id="bar3">Some content 3</textarea>
<button class="clipboard-btn3" data-clipboard-action="copy" data-clipboard-target="#bar3">
Copy to clipboard
</button>
<br>
You cannot copy multiple target elements at once. However, you can use some more advanced options in the imperative api for that:
new Clipboard('.clipboard-btn', {
text: function(trigger) {
var a = document.querySelector('#a').value;
var b = document.querySelector('#b').value;
return a + b;
}
});

in thymeleaf, how can write th:each to combine rows and columns?

I want to write 4 columns in a row like this
<div class="row">
<div class="span3">Something</div>
<div class="span3">Something</div>
<div class="span3">Something</div>
<div class="span3">Something</div>
</div>
<div class="row">
<div class="span3">Something</div>
<div class="span3">Something</div>
<div class="span3">Something</div>
<div class="span3">Something</div>
</div>
data sizes are dynamic, so it can be 4, 8 or more.
this is archived in other template engine
{{#each list}}
{{#if #index % 4 == 0}}
<div class="row">
{{/if}}
<div class="span3">{{this.name}}</div>
{{#if #index % 4 == 0}}
</div>
{{/if}}
{{/each}}
but how can I archive this in thymeleaf?
I can't find the way because th:each is in tag(<div class="row"> or <div class="span3">) as attribute.
Model code
List<String> data = new ArrayList<String>();
data.add("1");
data.add("2");
data.add("3");
data.add("4");
data.add("5");
data.add("6");
data.add("7");
data.add("8");
model.addAttribute("datas", data);
Thymeleaf view code
<div th:each="data, row: ${datas}" th:with="numList=${#strings.listSplit('3,2,1,0', ',')}" th:if="${row.current} % 4 == 0" class="span3">
<span th:each="num : ${numList}" th:with="dataIndex=(${row.index} - ${num})" th:text="${datas[dataIndex]}">data</span>
</div>
Result
<div class="span3">
<span>1</span><span>2</span><span>3</span><span>4</span>
</div>
<div class="span3">
<span>5</span><span>6</span><span>7</span><span>8</span>
</div>
I used an array to solve this problem.
I think you will find a better way.
This can be done using numbers.sequence too. Set colCount to whatever number of columns you'd like:
<th:block th:with="colCount=${4}">
<div th:each="r : ${#numbers.sequence(0, datas.size(), colCount)}" class="row">
<div th:each="c : ${#numbers.sequence(0, colCount - 1)}" th:if="${r + c < datas.size()}" th:text="${datas.get(r + c)}" class="span3"></div>
</div>
</th:block>
I just created an account here to correct the accepted answer. The accepted answer works great so long as the "datas" being passed in is an array of consecutive integers. However, to make it work with any kind of data structure, "row.current" needs to change to "row.count", as follows:
<div th:each="data, row: ${datas}" th:with="numList=${#strings.listSplit('3,2,1,0', ',')}" th:if="${row.count} % 4 == 0" class="span3">
<span th:each="num : ${numList}" th:with="dataIndex=(${row.index} - ${num})" th:text="${datas[dataIndex]}">data</span>
</div>
If you use row.current, then it uses the actual item in the list, which is great in the example shown, but not so great for any other kind of data structure. Hope this helps.
EDIT:
I have to further refine this because the accepted answer also does not work if the number of items in the list is not evenly divisible by 4. Here is a better (though probably not perfect) solution:
<div th:each="data, row: ${datas}" th:with="numList=${ {3,2,1,0} }" th:if="${row.count % 4 == 0 or row.last}" class="span3">
<!-- Show all rows except the leftovers -->
<span th:each="num : ${numList}" th:with="dataIndex=(${row.index} - ${num})" th:if="${row.count % 4 == 0}" th:text="${datas[dataIndex]}">data</span>
<!-- Show the remainders (eg, if there are 9 items, the last row will have one item in it) -->
<span th:each="num : ${numList}" th:with="dataIndex=(${row.index} - ${num})" th:if="${row.last} and ${row.count % 4 != 0} and ${num < row.count % 4}" th:text="${datas[dataIndex]}">data</span>
</div>
This may be able to be refactored to eliminate one of the spans, but I have to move on now.
th:each can be used on any element basically. So something like this:
<div class="row" th:each="row : ${rows}">
<div class="span3" th:each="name : ${row.names}" th:text="${name}">Something</div>
</div>
<div class="row" th:each="museum,step : ${museums}">
<span th:if="${step.index % 2 == 0}">
<div class="column" style="background-color:#aaa;" >
<h2 th:text="'Name: ' + ${museum.name}"></h2>
<p th:text="'Address: ' + ${museum.address}"></p>
<p th:text="'Capacity: ' + ${museum.capacity}"></p>
</div>
</span>
<span th:if="${step.index < 3 and step.index %2 == 0} ">
<div class="column" style="background-color:#bbb;">
<h2 th:text="'Name: ' + ${museums[step.index+1].name}"></h2>
<p th:text="'Address: ' + ${museums[step.index+1].address}"></p>
<p th:text="'Capacity: ' + ${museums[step.index+1].capacity}"></p>
</div>
</span>
</div>
This is how I would approach the problem. Using a simple odd or even trick
I had the same problem and I saw the accepted answer but it was not easy enough for me so I tried a new solution and it does the job with much less code and much easier to understand
this is what I came up with:
// rowNum your situation, but be aware that you have to change all its instances to
<div th:each="i: ${#numbers.sequence(1, rowNum)}" class="row">
<div th:each="j: ${#numbers.sequence(((i-1) * rowNum), ((i-1) * rowNum) + (rowNum - 1)) }"
th:if="${j < #lists.size(list)}"
class="col col-md-5">
<span th:text=${list.get(j)}>item</span>
</div>
</div>
These are all so complicated when the answer is so simple as answered here.
<div colCount=${4} class="row">
<div class="span3" th:each="data : ${data}">...</div>
</div>
It limits the for each to 4 element blocks. I'm guessing that's a relatively new feature. Pretty cool.

c# HtmlAgilityPack HTML parsing issue

I have this html
<div class="postrow firs">
<h2 class="title icon">
This is the title
</h2>
<div class="content">
<div id="post_message_1668079">
<blockquote class="postcontent restore ">
<div>Category</div>
<div>Authour: Kim</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div> <div class="postrow">
<h2 class="title icon">
This is the title
</h2>
<div class="content">
<div id="post_message_1668079">
<blockquote class="postcontent restore ">
<div>Category</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div>
I want to extract the following things from each div having class "postrow" and may also have another classes like <div class="postrow first">. So the class "first" is not my concern, just need to have "postrow" in the beginning.
The content inside the tag with class title
the HTML from the "blockquote" tag. But not any div withing this
tag.
Code I tried:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("http://localhost/vanilla/");
List<string> facts = new List<string>();
foreach (HtmlNode li in doc.DocumentNode.SelectNodes("//div[#class='postrow']"))
{
facts.Add(li.InnerHtml);
foreach (String s in facts)
{
textBox1.Text += s + "/n";
}
}
Your code has issue you have to give html as string not the path
doc.LoadHtml("http://localhost/vanilla/");
instead
var request = (HttpWebRequest)WebRequest.Create("http://localhost/vanilla/");
String response = request.GetResponse();
doc.loadHtml(response);
now iterate the parsed html

html parsing using htmlcleaner

i want to parse this type of html using html cleaner..
<div class="result-item yt-uix-tile yt-tile-default *sr">
<div class="thumb-container">
<a href="/watch?v=NZiEqhrIL_k" class="ux-thumb-wrap contains-addto result-item-thumb">
<span class="video-thumb ux-thumb yt-thumb-default-138 ">
<span class="yt-thumb-clip">
<span class="yt-thumb-clip-inner">
<img onload="tn_load(2)" alt="Thumbnail" src="//i3.ytimg.com/vi/NZiEqhrIL_k/default.jpg" width="138" >
<span class="vertical-align"></span>
</span>
</span>
</span>
<span class="video-time">2:40</span>
in it i only want to get href ( href="/watch?v=NZiEqhrIL_k" ) value. how can i achieve it. thanks in advance.
quick and dirty, in javascript,
so for each line in your return, set thisLine:
var thisLine = "<a href=\"/watch?v=NZiEqhrIL_k\" class=\"ux-thumb-wrap contains-addto result-item-thumb\">";
then find the start of the bit you want and the end:
var startPos = thisLine.indexOf("<a href=\"/watch?");
thisLine = thisLine.substring(startPos+2);
var endPos = thisLine.indexOf("class=");
thisLine = thisLine.substring(0,endPos-1);
There's probably a 1000 ways to do this... look at the Related questions on the right side, or do a search for parse html response.

Resources