finding an accurate Xpath for Google Sheets ImportXML [duplicate] - google-sheets

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed 4 months ago.
So, I'm trying to use the ImportXML function in Google sheets to scrape some data from a website (https://www.cargurus.com/Cars/m-Bob-Johnson-Certified-Collection-sp402449), and I'm having trouble finding a path that works. This is the section I'm looking to pull.
I've tried using Chromes Inspect Element and using Copy X-path, Which gives me
//*[#id="ratingFilter_ContainerId"]/div
and returns #NA
I've used a Chrome plug-in called Scraper, which gives me //div[13]/div/div[2]/div[2]/div/label and returns #NA
I've even tried going through the code and making as direct a path as I could from scratch and came up with //body/div[1]/div[1]/main/div[1]/div[1]/div[11]/div[1]/div[1]/div[2]/div[2]/div[1]/div[1]/div[3]/div[1]/div[4]/div[2]/div[13]/div[1]/div[2]/div[2]/div
which also return #NA
So any tips for finding an accurate XPath would really be appreciated.

The expression
//*[#id="ratingFilter_ContainerId"]
executed on a fetched document selects a div element two levels above the one you show.
When extended by another step subexpression:
//*[#id="ratingFilter_ContainerId"]/div
it selects the div which contains the 'Deal Ratings' caption with the '(clear)' link at the right side, and the options list you need.
What you are interested in is rather
$fetched-document/descendant::div[#id="ratingFilter_OptionListContainer"]
EDIT
BTW, are you sure you fetch the page properly? When I load it into my browser, the page seems to load some additional data, which is noted with a 'Loading listings...' splash. Maybe you're trying to execute your query on an incomplete page...?

Related

ImportXML not returning entire table

I cannot get an entire table to populate with ImportXML. At best I get the first column and I cannot figure this out.
The website I am trying to scrape is: https://classic.warcraftlogs.com/character/us/kromcrush/chills
Do I have any options to retrieve the table rather it be column by column or as a whole?
I have tried all the following plus several others.
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//table[#id='boss-table-1010']/tbody/tr")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tbody/tr")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tbody/tr/td")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tr")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tr/td")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tr/td[1]")
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills","//tr/td[2]")
Anything outside of column one says Imported content is empty. Please help!
P.S. I have scoured this website and google for answers and every case I find seems to be a syntax error, starting at the table itself doesn't return the entire table which tells me I need a clever method.
It seems that's an issue with the website, because when you click on Inspect you can see the table with id "boss-table-1010" but if you click on View Source that ID is not available, so the table is dynamically rendered and Sheets doesn't find such id.
I've checked it and I can get the data by doing
=IMPORTXML("https://classic.warcraftlogs.com/character/us/kromcrush/chills", "//table/tbody//td")
But if you want a more robust solution, it'll be better doing it programmatically by using Python for web scraping

How to use ImportXML function to import content from financial stock webpages [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
So... 90% of the time ImportXML seems to work just fine for me, but now I'm struggling with the below 2 cases... I don't know if they are all the same problem or not, or if they are 2 different problems.
CASE ONE - YAHOO
Go to this page: https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL
The number I want to pull to my spreadsheet is "Free Cash Flow"
My first attempt:
=IMPORTXML("https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL","//*[#id='Col1-1-Financials-Proxy']/section/div[4]/div[1]/div[1]/div[2]/div[12]/div[1]/div[2]/span")
My second attempt:
=IMPORTXML("https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL","/html/body/div[1]/div/div/div[1]/div/div[3]/div[1]/div/div[2]/div/div/section/div[4]/div[1]/div[1]/div[2]/div[12]/div[1]/div[2]/span")
My third attempt:
=INDEX(IMPORTXML("https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL","//div[#class='Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(140px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)']"),1,1)
My fourth attempt:
=INDEX(IMPORTXML("https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL","//span[#data-reactid='277']"),1,1)
Nothing I do seems to work.
CASE TWO - MSN
Go to this page: https://www.msn.com/en-gb/money/stockdetails/analysis/fi-a1mou2
Click the "Price Ratios" link
The number I want to pull to my spreadsheet is "P/E Ratio 5-Year Low"
My first attempt:
=IMPORTXML("https://www.msn.com/en-us/money/stockdetails/analysis/nas-aapl/fi-a1mou2","//*[#id='main']/div[2]/div[2]/div[2]/div/div[3]/div/div/div[5]/div[1]/div[2]/div[4]/div[1]/div/div/div/ul[3]/li[2]/span[1]/p")
I only tried once with this case because I suspect that the number sitting on an internal page tab might be causing the issue?
ANY solutions that automatically will pull the above two numbers into my spreadsheet are welcome, I'm open to workarounds with scripts/macros if ImportXML just isn't able to do it.
The reason why you can't get the data in MSN is because the specific element you have mentioned has been inserted dynamically on the website. IMPORTXML can only retrieve static content of a website and therefore, it will not be able to retrieve this dynamic content.
To check which content is static and which is dynamic, you can disable Javscript on your browser (as JS is the responsible of inserting dynamic content) and reload the page : the remaining content is the one you can access with IMPORTXML. In the website you provided if you follow these indications you will see how if you click on Price Ratios nothing will change as this content is not static. This is a simple guide on how to disable Javascript in Chrome.
Therefore, you will need to find an alternative method to scrape dynamic data.

Attempting to import from a XPath, seems to always yield blank information

Currently in my google doc, i'm working on a database for my card worth, and it seems like it doesn't want to grab the information no matter what xpath i want to attempt.
Website i'm trying to take information available here. *This is the hyperlink i'm feeding
In the top right corner i'm attempting to grab the worth box information, here is current xpaths i've attempted
"//a[#id='worthBox']/h4"
"/html/body/div[4]/div[1]/div[2]/form/div[1]/div[2]/div/a/h4"
"/h4"
"/h4[0-20]"
"//a[#id='worthBox'][1]/h4"
"//div[#id='estimate-box']/a/h4"
"//div[#id='estimate-box']/a[1]/h4"
Can someone explain to me why it doesn't seem to wanna fetch, is it even possible?
Thank you so much for your time and help!
In the URL, the value is put using the Javascript. But IMPORTXML cannot retrieve the result after Javascript was run. IMPORTXML retrieves the HTML without running Javascript. I think that your xpath is the result after Javascript was run. By this, they cannot be used. But it seems that the value you expect can be retrieved other xpath.
Modified xpath:
//input[#id='medianHiddenField']/#value
Sample formula:
=IMPORTXML(A1,"//input[#id='medianHiddenField']/#value")
In this case, the URL of https://mavin.io/search?q=Lugia%20NM%209%2F111%20-PSA&bt=sold# put in the cell "A1".
Result:
Reference:
IMPORTXML

Google Spreadsheet getting text with importxml

I've tried this and other versions to no avail? Can anyone help please?
=IMPORTXML("http://performance.morningstar.com/fund/ratings-risk.action?t=MWTRX", "//*[#id='div_ratings_risk']/table/tbody/tr[4]/td[3]/text()")
As explained in the comments to your original question, initially the div Element with the id #div_ratings_risk is initially empty and does not consist of a table.
So Google spreadsheets is not able to parse content that is not there and yet needs to be loaded first.
The content (table) you try to fetch data from into your google spreadsheet is dynamically loaded using jQuery from another URL. You can get that URL using e.g. the chrome developer tools and filter for XHR request.
If you parse the content directly from that HTML it will work. So you would need to change your formula to that URL and adapt your XPath like so:
=IMPORTXML("http://performance.morningstar.com/ratrisk/RatingRisk/fund/rating-risk.action?&t=XNAS:MWTRX&region=usa&culture=en-US&cur=&ops=clear&s=0P00001G5L&ep=true&comparisonRemove=null&benchmarkSecId=&benchmarktype=", "//table/tbody/tr[4]/td[3]/text()")

Google XPATH importxml can find "show" but not "showcount" or "count" [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
Using this webpage as an example http://forums.macrumors.com/showthread.php?t=1688317
On a google spreadsheet, the following DO NOT work with importxml():
//a[contains(#href,"showpost")]/#href
//a[contains(#href,"showcount")]/#href
//*[#id="postcount18545482"]
The last one (//*[#id="postcount18545482"]) was copied directly from Chrome's element viewer.
The following DO work but exclude any results with the word "showcount", "postcount", or "showpost":
//div[contains(#id,"post_message")]/#id
//a[contains(#href,"show")]/#href
//a[contains(#href,"post")]/#href
Is there something special about the word "count" when working with importxml() or XPATH? How can I get the missing entries?
ImportXML function in Google Docs spreadsheet can not process data that is created in a two-step process. For example, when an authentication token must be retrieved first before making the url request, or when the URL tells the server to dynamically create an xml output after which the user is redirected to the output, even when the URL stays the same. You might want to look into Google Apps Scripts (http://code.google.com/googleapps/appsscript/index.html) to handle this case.
Taken from here
In your particular case the anchor parameters get set in the vbulletin_post_loader.js script called after the page container is loaded.
...
pc_obj=fetch_object("postcount"+this.postid);
openWindow("showpost.php?"+(SESSIONURL?"s="+SESSIONURL:"")
+(pc_obj!=null?"&postcount="+PHP.urlencode(pc_obj.name):"")+"&p="+A)
...
In other words, when importXML() scans the page, the nodes containing 'showpost' or 'postcount' in href are not yet on the page:
Looks like importXML() works with static pages only and not able to handle dynamically loaded content.
Try to find another way of obtaining the number of post in a thread.

Resources