JSoup to extract particular block from multiple block - html-parsing

I'm new to JSoup and my question here is how do I extract particular text from multiple blocks that share the same class and attributes?
For example here I want to extract the information on 3rd row of the HTML. How do I specified on my JSoup code to extract the information on 3rd row?
<tr>
<td align="center" colspan="2" class="maintitle">Active Stats</td>
</tr>
<tr>
<td class="row2" valign="top"><b>User's local time</b></td>
<td class="row1">Oct 22 2013, 07:23 PM</td>
</tr>
<tr>
<td class="row2" width="30%" valign="top"><b>Total Cumulative Posts</b></td>
<td width="70%" class="row1"><b>4</b>
<br />( 0 posts per day / 0.00% of total forum posts )
</td>
</tr>

Use the CSS-selector syntax to specify what row to select.
Element e = doc.select("tr:eq(2) td.row2").first();
System.out.println(e.text());
will result in
Total Cumulative Posts
A tip is to at least look through the Jsoup documentation before asking questions.
All this can easily be found in the API.
Jsoup - Use selector syntax

Related

Import table from html into google sheets

I have the following table element from a website.
Using this formula it only extracts the 1st td ie class=TTRow_left
I want to extract both class=TTRow_left and class=TTRow_right in a google sheet
Formula:
IMPORTHTML("https://www.bsesme.com/","table",6)
Html:
<table width="305" border="0" cellspacing="0" cellpadding="0">
<tbody><tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed on SME till Date</td>
<td class="TTRow_right" style="height:22px;" id="AL">386</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">Mkt Cap of Cos. Listed on SME till Date (Rs.Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="MCL">58,225.56</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">Total Amount of Money Raised till Date (Rs. Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="Td13">4,132.16</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Migrated to Main Board</td>
<td class="TTRow_right" style="height:22px;" id="MB">150</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed as of Date </td>
<td class="TTRow_right" style="height:22px;" id="CL"> 236</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Suspended</td>
<td class="TTRow_right" style="height:22px;" id="CS">32</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Eligible for Trading</td>
<td class="TTRow_right" style="height:22px;" id="CET">201</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Traded</td>
<td class="TTRow_right" style="height:22px;" id="CT">110</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">Advances/ Declines/ Unchanged</td>
<td class="TTRow_right" style="height:22px;" id="Adv">73/ 32/ 5</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">Mkt Cap of BSE SME Listed Cos. (Rs.Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="Dec">15,095.93</td>
</tr>
<!--<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of SME companies migrated to main board</td>
<td class="TTRow_right" style="height:22px;" >3</td>
</tr>-->
</tbody></table>
</td>
</tr>
</tbody></table>```
There is a way, You could extract that data with Google Apps Script - i.e. writing a function that reads the values (those are returned by a separated request).
You need to make a request to this url - which is the one that loads the data:
https://www.bsesme.com/markets/MarketStat.aspx?&292022849
Values are:
bse$#$237|32|202|104|58|37|9|15,110.69|12|3,364.25|150|387|58,387.68|4,144.97
And then, extract the data.
I check the page's source code and that page is using javascript for read the data and rearrange it on the main page (i.e. https://www.bsesme.com/).
Tip: Check the main page's source code and check a function called function GetNotices(str) - that function looks like has the logic for rearrange the data.
You will have to check deeper in order to figure out how you can extract this data on your spreadsheet.
In this case IMPORTHTML would be able to return table data as long as it's not JavaScript generated, I tried checking the web page you are trying to scrap data from and it seems that the exact content that is missing is generated through JavaScript as it's not shown when disabled from the browser:
As you can see when JavaScript is disabled the content in the page is not displayed however the Table content TTRow_left is hard coded that's why the function is able to get this information from the web page:
td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed on SME till Date'
You will notice that TTRow_right is not displayed therefore the function won't be able to scrap data from it.

Parallel loops in Thymeleaf

This is my code, it does not work. I know I can not put the loops that, but how should they be to get the logic done
<tr th:each="max:${top3max}", th:each="min:${top3min}">
<td th:text="${max.getName()}"></td>
<td th:text="${min.getName()}"></td>
</tr>
As long as both Lists are the same size, you can loop through one and use the status variable to access the other. Like this:
<tr th:each="max, i: ${top3max}">
<td th:text="${max.getName()}"></td>
<td th:text="${top3min[i.index]}"></td>
</tr>
If you want something more like a traditional for loop, this will work (as long as top3max is a List -- you'll have to use .length instead of .size() if you're dealing with an array.
<tr th:each="i: ${#numbers.sequence(0, top3max.size() - 1)}">
<td th:text="${top3max[i]}"></td>
<td th:text="${top3min[i]}"></td>
</tr>

Google Spreadsheet ImportHTML and ImportXML: imported content is empty

I am trying to import a table from a webpage into a google spreadsheet.
I have tried using the following two functions and both are giving me the error that the "imported content is empty".
=importhtml("http://financials.morningstar.com/ratios/r.html?t=AAPL","table",1)
And
=importxml("http://financials.morningstar.com/ratios/r.html?t=AAPL", "//*[#id='tab-profitability']/table[2]"
p.s. the imported data is for personal use only and will not be used against the websights policies.
It's not possible with your url (http://financials.morningstar.com/ratios/r.html?t=AAPL).
The command =importhtml() it's possible if the webpage has a html table.
I give you an example :
Example
In this URL : http://fr.wikipedia.org/wiki/Démographie_de_l'Inde
In this webpage , you can see a table . The table is a html table
Code in the page :
<table class="wikitable centre" style="text-align: center;">
<tr>
<th colspan="3" scope="col" width="60%">Évolution de la population</th>
</tr>
<tr>
<th>Année</th>
<th>Population</th>
<th><abbr title="Croissance démographique">%±</abbr></th>
</tr>
<tr>
<td>1951</td>
<td>361 088 000</td>
<td>—</td>
</tr>
<tr>
<td>1961</td>
<td>439 235 000</td>
<td>+ 21,6 %</td>
</tr>
<!-- Others value -->
<td colspan="3" align="center"><small>Source : <a rel="nofollow" class="external autonumber" href="http://indiabudget.nic.in/es2006-07/chapt2007/tab97.pdf">[1]</a></small></td>
</tr>
</table>
In your Google Spreadsheet you can show data
=IMPORTHTML("http://fr.wikipedia.org/wiki/Démographie_de_l'Inde"; "table";
In this Webpage ( http://financials.morningstar.com/ratios/r.html?t=AAPL ), you don't have any html table so you can extract values.

Is it possible to modify ASP.NET MVC WebGrid so that it doesn't look like a regular grid table?

I need to render a table that doesn't look like a regular grid table, where for one entry, which would be a row in a typical table, columns 1 and 2 values can be put in row 1, columns 3, 4 and 5 values will be in row 2, and so on like so:
<table id="display_searchresults_student">
<tr>
<th rowspan="4"><img src="../../Images/picture_temp.jpg" id="display_searchresults_studentpicture"/></th>
<td id="display_searchresults_studentname">**Name Column**
<img src="../../Images/picture_temp.jpg" height="25px" width="25px"/>
</td>
<td align="right" width="100px">Rating Column</td>
</tr>
<tr>
<td id="display_searchresults_studentlocation">City Column, State Column | University Column</td>
<td></td>
</tr>
<tr>
<td id="display_searchresults_studentinfo">Department Column | School Standing Column Graduation Date Column | GPA Column</td>
<td></td>
</tr>
<tr>
<td id="display_searchresults_studentdesc" colspan="2">Comments Column...</td>
</tr>
</table>
This table will have multiple entries. The reason why I thought of using a WebGrid is because I need the sorting, filtering and paging capability. I now that Telerik allows constructing tables like that. However, I'm not seeing how to do it through the WebGrid. Is there a way to change the Webgrid's HTML prior to it being rendered without breaking the sorting, filtering and paging capabilities? Does jqGrid or other open-source grids allow to do something like that?
Thanks a lot in advance.
It's hard to tell what you are talking about without an example, but what about Masonry?

Does HTML5 ban th cells from tbody?

I have the following markup as a part of a Razor view:
<table>
<caption>Presidents</caption>
<thead>
<tr>
<th scope="col">Name</th>
<th scope="col">Born</th>
<th scope="col">Died</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Washington</th>
<td>1732</td>
<td>1799</td>
</tr>
<!-- etc -->
</tbody>
</table>
With the "target schema for validation" set to HTML5, Visual Studio complains thusly:
Warning 1 Validation (HTML5): Element 'th' must not be nested within element 'tbody tfoot'.
Is this really true? If so, could someone link to the spec?
My understanding was that using <th> for row headers was not just legal but encouraged. It certainly seems fairly common, I could link dozens of tutorials explaining (seemingly sensibly) that it helps with accessibility.
Is this a VS bug? A real change coming with HTML5 (a good one? a bad one?)? What's the story?
My understanding was that using <th> for row headers was not just legal but encouraged
As far as I know, this was always legal in HTML 4 (and possibly its predecessors), and hasn't changed in HTML5.
W3C's HTML5 validator, while still experimental, reports no warnings or errors. Then again, I'm sure the HTML5 validation Visual Studio is using is experimental as well since HTML5 itself hasn't yet been finalized.
The HTML5 spec on marking up tabular data, specifically section 4.9.13, shows the use of <th> within <tbody> and <tfoot> to scope row data:
<table>
<thead>
<tr>
<th>
<th>2008
<th>2007
<th>2006
<tbody>
<tr>
<th>Net sales
<td>$ 32,479
<td>$ 24,006
<td>$ 19,315
<tr>
<th>Cost of sales
<td> 21,334
<td> 15,852
<td> 13,717
<tbody>
<tr>
<th>Gross margin
<td>$ 11,145
<td>$ 8,154
<td>$ 5,598
<tfoot>
<tr>
<th>Gross margin percentage
<td>34.3%
<td>34.0%
<td>29.0%
</table>
So it's perfectly legitimate to have <th> elements inside <tr> elements inside either a <tbody> or <tfoot>. As it should be anyway, since table headings aren't just found on table headers.
The HTML5 spec only requires that it be inside a tr, and the spec actually includes an example with a th nested inside a tbody.
Generally a TH in a THEAD will have a scope value of "col" while a TH in a TBODY will have a scope value of "row".

Resources