I am trying to scrape a table website with mechanize.
I want to scrape the second row.
When I run :
agent.page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }
I would expect it to scrape the whole row. But instead it only scrapes: ["2011-02-17", "0,00"]
Why isn't it scraping all of the columns in the row, but just the first and the last column?
Xpath:
/html/body/center/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td/table/tbody/tr[2]/td/table/tbody/tr[2]
CSS PATH:
html body center table tbody tr td table tbody tr td table tbody tr td table.ea tbody tr td.total
The page is similar to this:
<table><table><table>
<table width="100%" border="0" cellpadding="0" cellspacing="1" class="ea">
<tr>
<th>Date</th>
<th>One</th>
<th>Two</th>
<th>Three</th>
<th>Four</th>
<th>Five</th>
<th>Six</th>
<th>Seven</th>
<th>Eight</th>
</tr>
<tr>
<td>2011-02-17</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0,00</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">387</td>
<td align="right">0,00</td> <!-- FOV -->
<td align="right">0,00</td>
</tr>
<tr>
<td class="total">Ialt</td>
<td class="total" align="right">0</td>
<td class="total" align="right">40</td>
<td class="total" align="right">0,46</td>
<td class="total" align="right">2</td>
<td class="total" align="right">0</td>
<td class="total" align="right">0</td>
<td class="total" align="right">0</td>
<td class="total" align="right">3.060</td>
<td class="total" align="right">0,00</td>
<td class="total" align="right">18,58</td>
</tr>
</table>
</table></table></table>
Using the following Ruby code (https://gist.github.com/835603):
require 'mechanize'
require 'pp'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://binarymuse.net/table.html') do |page|
pp page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }
end
I get the following output:
["2011-02-17", "0", "0", "0,00", "0", "0", "0", "0", "387", "0,00", "0,00"]
I would recommend you to leave Mechanize to harder stuff than scraping a page.
You can use Nokogiri much more simple than using Mechanize(but ofcourse you can do it with it) since you can just query the page.
Try it out!
here is a link to an answer regarding nokogiri
Personally I used Mechanize when I needed to send forms and stuff like that albeit there are tons of other uses to it!
Related
I have the following table element from a website.
Using this formula it only extracts the 1st td ie class=TTRow_left
I want to extract both class=TTRow_left and class=TTRow_right in a google sheet
Formula:
IMPORTHTML("https://www.bsesme.com/","table",6)
Html:
<table width="305" border="0" cellspacing="0" cellpadding="0">
<tbody><tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed on SME till Date</td>
<td class="TTRow_right" style="height:22px;" id="AL">386</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">Mkt Cap of Cos. Listed on SME till Date (Rs.Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="MCL">58,225.56</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">Total Amount of Money Raised till Date (Rs. Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="Td13">4,132.16</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Migrated to Main Board</td>
<td class="TTRow_right" style="height:22px;" id="MB">150</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed as of Date </td>
<td class="TTRow_right" style="height:22px;" id="CL"> 236</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Suspended</td>
<td class="TTRow_right" style="height:22px;" id="CS">32</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Eligible for Trading</td>
<td class="TTRow_right" style="height:22px;" id="CET">201</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">No. of Companies Traded</td>
<td class="TTRow_right" style="height:22px;" id="CT">110</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">Advances/ Declines/ Unchanged</td>
<td class="TTRow_right" style="height:22px;" id="Adv">73/ 32/ 5</td>
</tr>
<tr>
<td class="TTRow_left" style="height:22px;">Mkt Cap of BSE SME Listed Cos. (Rs.Cr.)</td>
<td class="TTRow_right" style="height:22px;" id="Dec">15,095.93</td>
</tr>
<!--<tr>
<td class="TTRow_left" style="height:22px;" width="230px">No. of SME companies migrated to main board</td>
<td class="TTRow_right" style="height:22px;" >3</td>
</tr>-->
</tbody></table>
</td>
</tr>
</tbody></table>```
There is a way, You could extract that data with Google Apps Script - i.e. writing a function that reads the values (those are returned by a separated request).
You need to make a request to this url - which is the one that loads the data:
https://www.bsesme.com/markets/MarketStat.aspx?&292022849
Values are:
bse$#$237|32|202|104|58|37|9|15,110.69|12|3,364.25|150|387|58,387.68|4,144.97
And then, extract the data.
I check the page's source code and that page is using javascript for read the data and rearrange it on the main page (i.e. https://www.bsesme.com/).
Tip: Check the main page's source code and check a function called function GetNotices(str) - that function looks like has the logic for rearrange the data.
You will have to check deeper in order to figure out how you can extract this data on your spreadsheet.
In this case IMPORTHTML would be able to return table data as long as it's not JavaScript generated, I tried checking the web page you are trying to scrap data from and it seems that the exact content that is missing is generated through JavaScript as it's not shown when disabled from the browser:
As you can see when JavaScript is disabled the content in the page is not displayed however the Table content TTRow_left is hard coded that's why the function is able to get this information from the web page:
td class="TTRow_left" style="height:22px;" width="230px">No. of Companies Listed on SME till Date'
You will notice that TTRow_right is not displayed therefore the function won't be able to scrap data from it.
been struggling with this issue for a bit now and its really bugging me. Basically I have some email templates that I've been working on, they work fine on all clients (Litmus tests) except for Gmail specifically on iOS, Android works fine. The issue I'm having is that I want all my tables to me 100% width so they're all the same size, however gmail resizes the tables seemingly based off the content inside.
Heres a section of my code:
<tr class="module bg-white" style="background-color:#fff;color:#23282b">
<td>
<table class="container" cellpadding="0" cellspacing="0" border="0" role="presentation" width="100%"
style="margin:0 auto;width:100%!important;max-width:600px!important">
<tr>
<td class="card-wrapper" align="center" valign="top" style="padding:0 15px 10px">
<table cellpadding="0" cellspacing="0" border="0" role="presentation" width="100%">
<tr>
<td class="card-content bg-white border-lightgray"
style="padding:30px 20px 20px;background-color:#fff;color:#23282b;border:solid 1px #eee">
<h2
style="font-family:GTAmerica-Regular,Helvetica,Arial,sans-serif;margin:0 0 20px;font-size:18px;font-weight:700;line-height:22px">
YOUR DELIVERY DETAILS</h2>
<table class="delivery-details" cellpadding="0" cellspacing="0" border="0"
role="presentation" width="100%"
style="width:100%!important;max-width:600px!important">
<tr>
<td style="vertical-align:top;padding-right:8.5px;padding-left:0">
<h3
style="font-family:GTAmerica-Regular,Helvetica,Arial,sans-serif;margin:0 0 15px;font-size:16px;font-weight:700;line-height:22px">
Delivery Service</h3>
</td>
<td style="vertical-align:top;padding-right:0">
<p
style="font-family:GTAmerica-Regular,Helvetica,Arial,sans-serif;margin:0 0 10px;font-size:16px;font-weight:400;margin-bottom:15px;line-height:22px">
Next Day</p>
</td>
</tr>
<tr>
<td style="vertical-align:top;padding-right:8.5px;padding-left:0">
<h3
style="font-family:GTAmerica-Regular,Helvetica,Arial,sans-serif;margin:0 0 15px;font-size:16px;font-weight:700;line-height:22px">
Delivery Address</h3>
</td>
<td style="vertical-align:top;padding-right:0">
<p
style="font-family:GTAmerica-Regular,Helvetica,Arial,sans-serif;margin:0 0 10px;font-size:16px;font-weight:400;margin-bottom:15px;line-height:22px">
Fake Name <br>Fake House <br>Fake Street
<br>Fake Town <br>UK <br>Fake Postcode</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
On my phone it looks like this:
Email Result on iOS 15 gmail
Is there any way to fix this? On every other client it expands to 100% no issue, thank you!
This sounds like this might be due to this bug, where Gmail adds a .munged class to <table>s and <td>s with a width:auto!important.
A solution would be to add a min-width:100% to each <table> and <td> potentially impacted.
I am using geb spoke. For the below html structure I am not able to get the text from specified location:
Below is the html structure:
<div class="tab-pane ng-scope active" uib-tab-content-transclude="tab" ng-class="{active: tab.active}" ng-repeat="tab in tabs">
<div id="algemeen-tab-header" class="ng-tab-hdr ng-scope"></div>
<div id="algemeen-tab-body" class="ng-tab-bdy table-view ng-scope">
<table class="ng-tbl valign-top">
<tbody class="esuite-table-body">
<tr>
<td class="tp-label">
<label class="ng-binding">
Reference:
</label>
</td>
<td class="tp-field ng-binding">
I-5006-2015
</td>
<td class="tp-label ng-hide" ng-show="!zaak.anoniem"></td>
<td class="tp-field ng-binding ng-hide" ng-show="!zaak.anoniem"></td>
<td class="tp-label" ng-show="zaak.anoniem"></td>
<td class="tp-field ng-binding" ng-show="zaak.anoniem"></td>
<td class="tp-label"></td>
<td class="tp-field ng-binding"></td>
</tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
</tbody>
</table>
</div>
</div>
I wanted to verify the text "I-5006-2015". I am not able to do it. Also, second scenario is that, I just wanted to assert that initial word is "I" from that location. How I can do that.
I have tried below variable to get the location but got failed:
referenceNumberText(wait:true){$("td", class: contains("tp-field ng-binding"))}
Please help me on this. Thanks!
Since you use html ids rather rarely, the selector for you text would be:
$("#algemeen-tab-body > table > tbody > tr:nth-child(1) > td:nth-child(2)").text()
Did you consider naming <td>'s? Example:
<td id="referenceNumber" class="tp-field ng-binding">
I-5006-2015
</td>
So your selector would be as easy as $("#referenceNumber").text()...
I want only those children who are publish in content folder.
this is my below code:
<umbraco:Macro runat="server" language="cshtml">
#foreach (var item in Model.Children)
{
<h3 class="vacancyH">#item.jobTitle</h3>
<table class="vaccTbl">
<tr>
<td class="vaccDetailTitle">Salary & Benefits:</td>
<td class="vaccDetailDesc">#item.salaryBenefits</td>
</tr>
<tr>
<td class="vaccDetailTitle">Employment Type:</td>
<td>#item.employmentType</td>
</tr>
<tr>
<td class="vaccDetailTitle">Department:</td>
<td>#item.department</td>
</tr>
<tr>
<td class="vaccDetailTitle">Report to Position:</td>
<td>#item.reportToPosition</td>
</tr>
<tr>
<td class="vaccDetailTitle">Location:</td>
<td>#item.location</td>
</tr>
<tr>
<td class="vaccDetailTitle">Date of Description:</td>
<td>#item.businessArea</td>
</tr>
<tr>
<td class="vaccDetailTitle" valign="top">Summary:</td>
<td class="tablep">#item.vacancySummary</td>
</tr>
<tr>
<td colspan="2" valign="middle"><img src="/images/wordicon.jpg" alt="" class="docIcon" />Download the Full Job Description</td>
</tr>
</table>
<div class="vaccCloseDate">Application Deadline: #item.applicationDeadline.ToString("dd MMMM yyyy")</div>
<div class="vaccApplyForPosition">Click here to apply</div>
}
</umbraco:Macro>
By this i get the all children which are not published..
Now i want the only published children.
What do you mean by published? What you are doing will only display published items, this is how umbraco works. Using where("visible") relies on you having created a property on one of your doc types called umbracoNaviHide and setting it to true in order to hide items. If what you have is not working then there is another reason for it.
Are your unpublished items greyed out in the content tree?
Try right click in top level content node and republish entire site.
Make sure your browser isn't caching something so clear the cache.
Failing all this simply delete umbraco.config in your app_data folder.
Umbraco does not render unpublished items.
I am working with RSpec and Capybara and have encountered a problem while trying to select a specific row based on :textContent or :text attributes but regardless of the string entered in the test the first row is always selected.
The HTML code is as follows:
<table class="LearningAssetList admin" data-id="1">
<tbody>
<tr class="CategoryHeader">
<td class="expandCell" colspan="9">
<span>Admin Pro / Scheduling</span>
</td>
</tr>
<tr class="headerRow ui-droppable">
<td class="blank"></td>
<td></td>
<td>Name</td>
<td>Description</td>
<td class="center">Length</td>
<td class="center">User Rating</td>
<td style="width:20px;padding:0px;"></td>
<td style="width:20px;padding:0px;"></td>
</tr>
<tr class="assetRow ui-draggable ui-droppable" data-id="49">
<td class="blank"> </td>
<td class="assetPlay icon">
<td class="assetName">
<a onclick="openModal('http://www.youtube.com/v/C0DPdy98e4c','Learning Asset
Test Upload')" href="#">Learning Asset Test Upload</a>
</td>
<td class="assetDescription">
<td class="assetDuration">
<td class="assetRating icon">
<td class="assetFunctions center">
<td class="assetDrag center">
<td class="blank"> </td>
</tr>
</tbody>
</table>
My RSpec code is as follows:
it "should allow asset to be deleted by Admins" do
visit 'http://localhost:3000/'
click_link 'Admin'
within(:xpath, '//*[#class="LearningAssetList admin"]') do
#row = find('tr>td.assetName>a', :textContent => "Learning Asset Test Upload")
row = find('tr>td.assetName>a', :textContent => "Learning Asset Test Upload".to_s)
within(row) do
find(:xpath, '//*[#class="popupMenu"]').click
end
sleep 5
find(:xpath, '//*[#class="delete"]').click
popup = page.driver.browser.switch_to.alert
popup.text.should eq('Are you sure you would like to delete this asset?')
popup.accept
assetList = find(:xpath, '//*[#class="LearningAssetList admin"]')
assetList.should have_content('Learning Asset Test Upload')
sleep 5
end
end
I have another row in the table above this entry where the assetName is simply "Test" and regardless of whether I use text, textContext, or indeed change the string this row is always selected and the more options button is pressed in this row which subsequently ends up in the deletion of the wrong asset.
Can anyone see any problem with the RSpec code or the logic behind selecting the row, I had thought that the text in the assetName td would have to match for the row to be found but this does not seem to be happening.
Your HTML is completely invalid. You can't nest multiple <tr>s inside each other and you haven't closed any of the tags.