How to find the right xpath for Google sheets?

How to find the right xpath for Google sheets? - google-sheets

I would like to scrape data from a page, but cannot figure out the right xpath for Google sheets. I would like to extract the number 202 from https://www.belvilla.nl/zoeken/?land=nl&rgo=frie (on top of the page, "202
vakantiehuizen gevonden in Friesland")
If I take the xpath, I get: //*[#id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong
In Google sheets I have tried =IMPORTXML(A1;"//*[#id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong)") and some others like =IMPORTXML(A1;"//div[#class='search-numbers']"), but none of them are working. For the last one I get an error with 'Resource with URL content has exceeded the size limit.' but I'm guessing my xpath is wrong.
Can anyone help me out? Thanks!

IMPORTXML has its limitations especially on JS elements. However, if scripting is an option, try using UrlFetchApp.fetch() in Google Apps Script.
Code:
function fetchUrl(url) {
var html = UrlFetchApp.fetch(url).getContentText();
// startString and endString must be unique or at least the first result
// enclosing the result we want
var startString = 'search-result-div" ';
var endString = 'alternate-dates-filter-bar';
var startIndex = html.search(startString);
var endIndex = html.search(endString);
// regex for numbers and text content
var numbers = /strong>([^<]+)<\/strong/;
var text = /span>([^<]+)<\/span/;
// clean content then combine matches of numbers and text
var content = html.substring(startIndex, endIndex).replace(/\s\s+/g, ' ');
var result = numbers.exec(content)[1] + ' ' + text.exec(content)[1];
return result.trim();
}
Output:
Note:
Code above is specific to what you are fetching. You will need to update the script processing of the response if you want anything else.
You can reuse this on other url and will fetch the similar value located on your wanted xpath in your post.
This doesn't make use of the xpath.

google sheets do not support the scraping of JavaScript elements. you can check this if you disable JS for a given URL and you will be left with content you could import. in your case, this cant be achieved with IMPORTXML:

Related

Importxml() returned "empty cells" or "formula parse error"

I tried Importhtml ("https://nepsealpha.com/investment-calandar/dividend","table",) and then Importxml("https://nepsealpha.com/investment-calandar/dividend",xpath). I found out xpath from "selectorgadget" extension of googlechrome, but still couldn't import it. It shows either "empty content" or formula parse error".

You can retrieve quite all the informations this way
=importxml(url,"//div/#data-page")
and then parse the json.
By script : =getData("https://nepsealpha.com/investment-calandar/dividend")
function getData(url) {
var from='data-page="'
var to='"></div></body>'
var jsonString = UrlFetchApp.fetch(url).getContentText().split(from)[1].split(to)[0].replace(/"/g,'"')
var json = JSON.parse(jsonString).props.today_prices_summary.top_volume
var headers = Object.keys(json[0]);
return ([headers, ...json.map(obj => headers.map(header => obj[header]))]);
}
edit
to update periodically, add this script
function update(){
var chk = SpreadsheetApp.getActiveSpreadsheet().getSheets()[0].getRange('A1')
chk.setValue(!chk.getValue())
}
put a trigger as you wish on the update function and change as follows
=getData("https://nepsealpha.com/investment-calandar/dividend",$A$1)

I know that's not the answer you want to see.
It's impossible to get any content from this website using IMPORTXML or other tools included in Google Sheets.
It's generated using Javascript. Once Javascript is disabled no content is displayed:
It's done on purpose. Financial companies pay for live stock data and they don't want to share it with us for free.
So the site is protected against tools like importxml.

App Script - Exporting Sheets Hyperlinks to Docs

I have a google sheet - and when a new row appears I am writing the output into a Google Document using a predefined template via a merge.
All is working but as I could only work out how to use the .replaceText() function to achieve the merge, the hyperlinks in some of the sheet columns get exported as plain text.
After much fiddling and cribbing of code (thanks all) I managed to cobble together the following function:
function makeLinksClickable(document) {
const URL_PATTERN = "https://[-A-Za-z0-9+&##/%?=~_|!:,.;]+[-A-Za-z0-9+&##/%=~_|]"
const URL_PATTERN_LENGTH_CORECTION = "".length
const body = document.getBody()
var foundElement = body.findText(URL_PATTERN);
while (foundElement != null) {
var foundText = foundElement.getElement().asText();
const start = foundElement.getStartOffset();
const end = foundElement.getEndOffsetInclusive() - URL_PATTERN_LENGTH_CORECTION;
const url = foundText.getText().substring(start,end+1)
foundText.setLinkUrl(url)
foundElement = body.findText(URL_PATTERN, foundElement);
}
}
After writing out all the columns to the document I call this function on the created document to look for a hyperlink and make it hyper :)
As long as each cell only contains one hyperlink my function works.
It also works where there are multiple hyperlinks in the document.
However, some cells can have multiple hyperlinks and writes them out to the document with a new line for each one.
Although the function finds the multiple URLs correctly and makes them clickable in the document there is a problem.
For example, if there are 2 hyperlinks in the cell they get exported to 2 lines in the document, but after running them through the function - both hyperlinks will now link to the same image (the first) even though each hyperlink itself is the unique link from the original cell.
2 converted hyperlinks that link to the same image
(Note - If I don't run my function and leave the exported hyperlinks as text. Then go into the created document and manually add a space to the ends of the exported hyperlinks then they turn blue and become clickable and link to the correct image, I did try to add a space programmatically before this but couldn't work that out either)
I have exhausted my limited coding ability and can't see why my function which "seems" to work its way through each hyperlink correctly doesn't make it then link to the right image in the document.
Any help would be most appreciated.
Thanks
// ----------------------------------------------------------------------
Thank you for taking the time to look at this, I will try to explain the issues further. It is hard to show here as the links actually work properly when copied here they only misbehave in the google document.
A cell in the exported row has multiple hyperlinks separated by a comma.
they get exported from the cell to the document as text strings like this:
Links in single Sheets Cell for exporting:
"hyperlink-1-as-a-string", (links to image 1)
"hyperlink-2-as-a-string", (links to image 2)
"hyperlink-3-as-a-string", (links to image 3)
"hyperlink-4-as-a-string", (links to image 4)
"hyperlink-5-as-a-string" (links to image 5)
I then run my funtion to make them clickable again.
If there are two are more hyperlinks in the same cell when exported then I get the following issue after running the function.
Exported Text links converted by to clickable hyperlinks:
"hyperlink-1-as-a-string", (links to image 5)
"hyperlink-2-as-a-string", (links to image 5)
"hyperlink-3-as-a-string", (links to image 5)
"hyperlink-4-as-a-string", (links to image 5)
"hyperlink-5-as-a-string" (links to image 5)
I "think" what happens is that my function makes all 5 hyperlinks one big hyperlink that happens to use the last hyperlinks image.
If I copy and paste the URLs into a separate document like an email then they appear as one large hyperlink, not 5 separate ones.
// ---------------------------------------------------------------
The function searches for text patterns that are in fact google hyperlinks.
(starting https:// etc)
When it finds one it works out the length to the end of the text string and then uses setLinkUrl() to make the hyperlink - into a clickable hyperlink.
If there is only one text hyperlink then it works.
If there is more than one text hyperlink, separated by commas then it does not.

I worked something out. This is what I ended up with, it is basically put together from a few other questions & answers - It's not very clever but it works.
Thanks to the various posters who enabled me to figure this out.
function sortLinks(colId, mapPoint, myBody) {
var urls = [];
if (colId.includes(",")) { // IE theres more than one URL
var tmp = colId.split(",");
urls = urls.concat(tmp);
}
else {
urls[0] = colId; // 1 URL no "," add to array[0]
}
if (urls.length > 0) {
var tag = mapPoint;
var newLine = "\n";
var element = myBody.findText(tag);
if (element) {
var start = element.getStartOffset();
var text = element.getElement().asText();
text.deleteText(start, start + tag.length - 1);
urls.forEach((url, index) => {
url = url.trim();
var name = "Image-Video" + (index + 1);
text.appendText(name).setLinkUrl(start, start + name.length - 1, URL);
text.appendText(newLine);
start = start + name.length + newLine.length;
});
}

Using API in GoogleSheet

I want to import in google sheet data from https://www.coinspeaker.com/ieo/feed/
function callCoinSpeaker() {
var response = UrlFetchApp.fetch("https://www.coinspeaker.com/ieo/feed/");
Logger.log(response.getContentText());
var fact = response.getContentText();
var sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(1,1).setValue([fact]);
}
The script works fine, but I don't know how to format the output that is all in a single cell (A1).
I would like to create a code that automatically format the output splitting into column and row. Any example of formatting output from API request?Thanks ALL!

What I think about your issue in when you're making a GET request to your link, the response is back as a string.
To be able to use the data, you should parse your response with the method JSON.parse(fact)
Use Logger.Log(JSON.parse(fact)) to see what is happening.

AdWords PLACEMENT_PERFORMANCE_REPORT not pulling URLs

This should be extremely simple but for some reason it doesn't seem to work. I'm trying to pull the URLs of display placements using the DISPLAY_PERFORMANCE_REPORT but instead of URLs it's just returning "--".
The code I'm using is:
var report = AdWordsApp.report(
"SELECT CampaignName, Clicks, FinalAppUrls, FinalUrls " +
"FROM PLACEMENT_PERFORMANCE_REPORT " +
"WHERE Clicks > 0 " +
"DURING LAST_30_DAYS");
var rows = report.rows();
while (rows.hasNext()) {
var row = rows.next();
var url = row["FinalUrls"];
Logger.log(url);
}
I've tried logging the CampaignName and clicks and they're working as expected, so can't understand what the issue is here. The only thing I can think of is that in the reference guide it says:
List of final URLs of the main object of this row. UrlList elements
are returned in JSON list format
I'm not entirely sure what JSON list format is, but when I log the typeof url it says it's a string, so thought it shouldn't be an issue.

The FinalAppUrls and FinalUrls list the target URLs that you set on the individual managed placements.
If you're interested in the URL (domain, rather) of the placement itself, you'll have to request either the Criteria or the DisplayName field in your report——they both contain the domain of the placement.

Google App script: Stumped on command to extract 'title' from forum HTML page & paste into a spreadsheet (my code inside)

I'm Extremely new to this and I've been trying to get the title of each unique forum page (or topic) here is the code I have so far:
function GraalGet() {
//parses forums for ALL posts one by one, extract <title> from HTML webpage
var sheet = SpreadsheetApp.getActiveSheet();
var i = 31
var url = "http://www.graalians.com/forums/showthread.php?p="+i;
//var params = {method : "post"}; can this be used at all?
//The aim: loop this once you can get 1 result.
var geturl = UrlFetchApp.fetch(url).getContentText(); //maybe .getContentText should be elsewhere?
var parseurl = Xml.parse(geturl, true); //confirmed - this is true because it wont parse HTML if false
var titleinfo = parseurl.getElement().getElement("html"); //.getElement('body');//.getElements("title");
sheet.appendRow([titleinfo, i]);
}
In addition the script would write down the topic number in the adjoining cell.
There's a lot of answered questions about extracting XML data, and this example is about parsing HTML but I couldn't pull up any results - I'm honestly stumped and any help about finding and extracting the tag will be appreciated. (If you have the time, please feel free to explain as well, but I'll be thankful for any help really.)
For reference I have used these:
Google's Kevin Bacon Script
The authors comments on bugs with the script & some explanation
I'm sorry if I'm being pedantic, this is my first post & I don't want to anger anyone, please do tell me if I've broken any rules, I'll do my best to fix them. I've left the comments I made for myself for your perusal too.

You can use Logger.log to print out debugging information. I did this with your function and figured out that the title tag is embedded within the tag. So you should use something like this. Also, getElement returns an XmlElement object which you should convert to String using getText().
var titleinfo = parseurl.getElement().getElement('head').getElement('title');
sheet.appendRow([titleinfo.getText(), i]);

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to find the right xpath for Google sheets? - google-sheets

google sheets do not support the scraping of JavaScript elements. you can check this if you disable JS for a given URL and you will be left with content you could import. in your case, this cant be achieved with IMPORTXML:

Related

Importxml() returned "empty cells" or "formula parse error"

App Script - Exporting Sheets Hyperlinks to Docs

Using API in GoogleSheet

AdWords PLACEMENT_PERFORMANCE_REPORT not pulling URLs

Google App script: Stumped on command to extract 'title' from forum HTML page & paste into a spreadsheet (my code inside)

Categories

Resources