Use IMPORTXML to scrape text and img url

Use IMPORTXML to scrape text and img url - google-sheets

https://www.autowp.ru/cadillac/elr/90868/pictures/e2xpg1
I can't figure out how to get this model of vehicle and it's image url through importxml in google sheets. In my concept you fill the url cell in sheet and you get model of the vehicle and image url in two separate cells but i can't write xpath correctly for this.

For writing the xpath of this image URL, you could try the following steps:
If you are using Google Chrome, you can open the developer tools (Control+Shift+I for Windows, Command+Option+I on MacOS).
Use the "Select Elements" function ①, click on the image ②, and then move the mouse to ③.
Right click, Copy > Copy XPath, and add #src to the end of the XPath.
Now you can get the Google Sheet formula below
=IMPORTXML("https://www.autowp.ru/cadillac/elr/90868/pictures/e2xpg1","/html/body/app-root/div[2]/div[1]/div/app-catalogue-vehicles-pictures-picture/app-picture/div[2]/div[1]/div[1]/span/img#src")
However, as suggested by player0 and Find the Xpath to get data with importxml function, IMPORTXML does not work on JavaScript content. So, in the end, the formula would give the #N/A error, and unfortunately you cannot get the model name and the image URL via IMPORTXML formula.

Related

X-Path to a library search engine

I am writing a short scrapper using Google Spreadsheets using Xpatch and IMPORTXML
on that page, I am trying to get in B3 and following all the titles of articles (class 'library-document-summary') and in C3 and follow all the URLS of said articles
however, I am getting nowhere as the returns of my XPATH are always empty. Could someone with knowledge in this area help?
B2= https://resources.norrag.org/categories/591,595
=IMPORTXML(B2,"//div//a[#class='library-document-summary']/text()")

I don't think the IMPORTXML function supports XPaths that select text nodes. But I think if your XPath selects the a elements themselves, then their text content will be imported. e.g.
//div[#id='article_search_results']//a
... and for the links:
//div[#id='article_search_results']//a/#href

IMPORTXML on google sheets error- Imported content is empty

I want to get the price from mercari, a japanese online shop.
For example, in this link, I like to get 1,488.
https://jp.mercari.com/item/m78226870756
when I copy the xpath of
<span class="number">
I get
//*[#id="item-info"]/section[1]/section[1]/div[1]/mer-price//span[2]
Now, using google sheet importxml
=IMPORTXML("https://jp.mercari.com/item/m78226870756","//*[#id='item-info']/section[1]/section[1]/div[1]/mer-price//span[2]")
I receive a
#N/A Imported content is empty.
I would really like to know how to get the price.
I am not familiar with this at all.
Any other way other than google sheet is also welcome.

you are getting #N/A error due to importxml (or any other import) formula does not support the scrapping of JavaScript elements. you can test this always by disabling JS for a given site and what's left can be usually imported into google sheets

Google Sheets importXML Returns Empty Value

Im trying to scrape this website (https://kamadan.gwtoolbox.com/) with google sheets for material costs for a game that I play. There are two tables; "Common Materials" and "Rare Materials" in a drop down in the top right corner. I am trying to pull the values for both as the prices update. I copied the full Xpath and used the function below in an empty cell on a sheet.
=importxml("https://kamadan.gwtoolbox.com/","/html/body/div[2]/div[1]/div/div[2]/table/tbody")
This returns a #N/A error saying it is returning an empty value.
I also tried it with the regular xpath...
=importxml("https://kamadan.gwtoolbox.com/","//*[#id='trader-overlay-items']")
Which just returns a blank cell. I have also tried both methods using the inspect function through chrome on the ancestors and children they return either of the two errors above.
Sorry if this is a really easy one. I am not familiar at all with Xpaths or html. I mostly dabble in VBA in excel.

Answer:
IMPORTXML can not retrieve data which is populated by a script, and so using this formula to retrieve data from this table is not possible to do.
More Information:
As you've already mentioned, you can attempt to get the data directly from the table using:
=IMPORTXML("https://kamadan.gwtoolbox.com/","//table[#id='trader-overlay-items']")
Which just gets a blank cell.
I went a step further and tried to reverse-engineer this by calling IMPORTXML on the HTML elements on the page in steps:
=IMPORTXML("https://kamadan.gwtoolbox.com/","html")
=IMPORTXML("https://kamadan.gwtoolbox.com/","html/body")
=IMPORTXML("https://kamadan.gwtoolbox.com/","html/body/div[1]")
=IMPORTXML("https://kamadan.gwtoolbox.com/","html/body/div[1]/div[0]")
...
html/body/div[1]/div[0] is the first path which gives no imported content, and we can see from importing html/body that the full body does not contain the imformation and only a template of it - in cell B1 we have references to 'Common materials' and 'Rare materials':
And in D1 we start to see JavaScript and JSON objects which are not called by IMPORTXML and so the results of which can not be retrieved:
As you can see if you disable JavaScript on the site, almost nothing is actually rendered and so can't be obtained using IMPORTXML:
References:
IMPORTXML - Docs Editors Help

Correct path for =IMPORTXML on Google Sheets

This URL: https://www.screwfix.com/p/makita-jr3050t-2-1010w-reciprocating-saw-240v/27338
Trying to use IMPORTXML on google Sheets to pull in the price (119.99 as of today)
Using the following formula:
(via Google Developer Tab, right-click Copy XPath)
=IMPORTXML(https://www.screwfix.com/p/makita-jr3050t-2-1010w-reciprocating-saw-240v/27338, "//*[#id='product_price']/text()")
Or
=IMPORTXML("https://www.screwfix.com/p/makita-jr3050t-2-1010w-reciprocating-saw-240v/27338","//meta[#itemprop='price']/#content")
Or
=importxml(https://www.screwfix.com/p/makita-jr3050t-2-1010w-reciprocating-saw-240v/27338, "//div[#class='pr__price']")
Plus a few other variations - Unfortunatley, they all come out as #N/A
Can anyone help me find the correct path?

It seems that in this case, when the URL is retrieved by IMPORTXML(), most values are included in head. When I tried this URL, body retrieved by IMPORTXML() was empty. So how about this workaround?
=REGEXEXTRACT(IMPORTXML(A1,"//head/*"),"(\d.+)INC")
Please put the URL of https://www.screwfix.com/p/makita-jr3050t-2-1010w-reciprocating-saw-240v/27338 to the cell "A1" and put the formula to other cell.
In this workaround, the value you want is retrieved from the values retrieved from head.
Result:
Note:
I'm not sure whether this formula can be used for other URL. If you want to use this for other URL, please confirm the values and set the xpath and regex.
If you use Google Apps Script, I think that the value can be retrieved from the body of URL.
If this was not what you want, I'm sorry.

Problems with Google Spreadsheets ImportXML function

I'm having some problems with ImportXML in my Google Spreadsheet. I currently have two sheets, each with their own ImportXML, retireving (basically) the same data - the server providing the data has updated their feed service to require the use of a user-specific "key" in the URL to track who is retrieving what. Prior to this change, my ImportXML worked just fine. They are about to turn off the non-key feeds, and my spreadsheets are about to break.
In the first (working) sheet, this is the feed.
I can import the data sucessfully by using the following syntax in cell A1:
=importXML(ʺhttp://atilla.hinttech.nl/fseconomy/xml?id=18649&key=M3LRG43T&query=GroupLogByMonth&month=10ʺ,ʺ//GroupLogByMonthʺ)
In the new (non-working) sheet, the URL to the feed (including my user-specific "keys") is here.
I am unable to create a working importXML on this sheet. None of my attempted Xpath queries worked, except "*"; but that resulted in all elements being lumped into a single cell.
I have shared my spreadsheet file (link is in the comments below - I am unable to post more than 2 links) with each of these sheets so that the above examples can be seen and played with. Any advise on the non-working sheet would be wonderful.

In the new XML feed there is no tag "GroupLogByMonth". This might explain why your Xpath query won't return anything when you look for that.
Did the format of the XML change too, next to the new URL?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Use IMPORTXML to scrape text and img url - google-sheets

Related

X-Path to a library search engine

IMPORTXML on google sheets error- Imported content is empty

Google Sheets importXML Returns Empty Value

Correct path for =IMPORTXML on Google Sheets

Problems with Google Spreadsheets ImportXML function

Categories

Resources