How to get extract from a table using jsoup? - html-parsing

I have an element nested in a couple tables.
I want to know if this number is greater than 0.
It is located here:
[#id="mainContent"]/div/table/tbody/tr/td[2]/div/table/tbody/tr[3]/td/table/tbody/
tr[1]/td/table/tbody/tr/td/table/tbody/tr[2]/td[7]/table/tbody/tr/td[1]
I used Chrome's copy xpath feature to get the path.
Any help would be appreciated. Thank you.

Related

Google Docs API and Dreaded Table Row Inserts

I've been playing around with Google Docs API and am stuck on being able to add a row to an existing table in a doc and fill that row (3 columns) with data.
Below is Pastebin file of Google Get which returns a huge JSON of pretty much everything in the doc (formatting, content etc.)
(Stack OVerflow has an issue with me including pastebin file so be ready for a huge file underneath here which probably won't fit)
This a sample doc - and if you check it out in a too like https://jsoneditoronline.org/ (which I just used) to see the document structure - you'll note that it has 3 tables in total.
I've written some code that puts the start indexes of all the tables in the document into an array but I can't for the life of me figure out a clear explanation of how I can:
a) Insert a row (at the bottom of the first table for example)
b) Insert data into the first, second and 3rd column of that new row
I have read the guides but it is all very confusing - because after I insert a row the document changes and the startIndexes and all that adjust - is that correct?
If anyone has any input on the code that would insert a new row AND populate the columns in that row in a one easy to use solution I would really appreciate any help (hopefully without having to query the whole JSON again after inserting the row).
Thank you
P.S. Tried to insert pastebin link but it wouldn't let me... tried to paste JSON directly and it was too big so... I'll have to leave the question with the most info I can for now - I will ask Google direct and include the JSON.
just updating that I've solved this by using the FPDF PHP library instead - and I just copy the Google Docs text into this Google converter (conerts to HTML) then passing all the HTML to the FPDF library.
So... question is no longer relevant.
For interested parties:
use DocumentService.BatchUpdateDocumentRequest()
request should be InsertTableRequest
for more information see:

How do i trace multiple XML elements with same name & without any id?

I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.
At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.
Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc

Google Sheets import multiple HTML table images

Summary
I'm looking to import a data table from a website that does not appear to have an API. The table is broken down to various images and text. The goal is to have all of the content available in a table to then reference for other sheets.
Issue
When I pull in the data, I get some of the text, none of the other images, and a reference to another table. I looked up some options, but none of them yielded anything but blank cells.
I also tried to use the =IMAGE() formula with a direct link to the images URLs, but there is a portion of the URL that is specific to the unit's release date, and as such, too dynamic to account for.
Excel Formula
=IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3)
Unfortunately without an API it is going to be difficult to achieve what you aim here. These are the main reasons why:
PROBLEMS AND WORKAROUNDS
This table has nested tables that therefore need to be accessed separately. If you take a look at: =IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",4)
you will see how the table 4 of this HTML page is the stats of a random character of the main table. If you go for 5 or 6 you will realise that the nested tables are not even numerically ordered and that you cannot access them by accessing to the main table (i.e mainTable[0].nestedTable). A hard working approach to do this is to go one by one finding their corresponding stat table and placing next to it. For this I recommend extracting only the name field of the main table to be able to align each stat to their character. You can simply do this using:=INDEX(IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3),0,1). You can find out more about INDEX here
IMPORTHTML cannot access images nor links so it will be very difficult to get the images in the last columns. A way to solve this is by using as you mentioned the image with its url like this: =IMAGE("https://gamepress.gg/pokemonmasters/sites/pokemonmasters/files/styles/30x30/public/2019-07/Electric.png?itok=fkRfkrFX"). You can find more info about inserting images here
CONCLUSION
To sum up, there is no easy way to solve this problem. The closest you can get is by:
Importing the name column.
Figuring out which tables belong to which character and placing them with next to their name.
Getting the image url of each weakness and type and add it to each character.
I am sorry this site does not have an API to make things smooth, good luck with your project and let me know if you need anything else or if you did not understand anything.
Here you can find more information about IMPORTHTML

How to show all values of an oledb data source in a table?

I'm new to livecylce designer and since I can't find an answer to this simple question or the provided answers end up unsuccessfully in my case, I hope I can get some help here.
So all I wanna do is, showing all the values of a OLEDB data sources in a table. My problem is that only one row is shown in the table. I tried to wrap it in a subform, choose content 'flowed', and set the rows to repeat for every item. But it won't change anything. And yes I'm sure the query provides 2 rows and 3 columns, or at least thats whats returned in mssql.
Maybe I missed some fundamental understanding how this is supposed to work. Any advice greatly appreciated.
Thanks Alex

Calulating and graphing data from a .pdf with ruby (+ rails)

I'm really stuck on this. Don't even know where to start. So I've got this .pdf, which has 2 columns, the first one is the lets say member ID. The second one is the number of purchases they have made. Is it possible to match the ID to the correct number and graph this data, and afterwards make calculations with the acquired and matched data (Calculate top 5% of buyers etc.)? Some numbers are not filled in, so that might be a problem. However, the pdf's are selectable and if copy&pasted will have the following structure: userid number userid number userid number userid number userid number.
EDIT: Making calculations with the data (calculating the top x%, ranks etc. will be the most important)
Any help, tips or links to tutorials that even might help me are appreciated!
Use prawn.
Here are some links to get you started:
Prawn github page
Using-Prawn-in-Rails
and, look for Prawn Templates.
EDIT:
Check out these links:
pdfescape
pdfedit
and Do look out for a templating solution, if it's there.
Also look here, you might find something useful:
whats-the-best-way-to-programmatically-edit-a-pdf-in-ruby
As I have not dealt with such problem mysqlf, I can only help you this much. You have to do the hard work yourself.

Resources