YQL two requests for paging/limit - yql

I'm playing around with a XML API where the search doesn't support paging/limit. The recommended way is to just request all the IDs and then in a second request get the data and handle paging on your own.
First request:
http://example.com?search=foobar&columns=ID
<results>
<item><id>1</id></item>
<item><id>2</id></item>
<item><id>3</id></item>
<item><id>4</id></item>
<item><id>5</id></item>
<item><id>6</id></item>
<item><id>7</id></item>
<item><id>8</id></item>
<item><id>9</id></item>
<item><id>10</id></item>
</results>
Second request:
http://example.com?search=1,2,3,4,5&columns=ID,title,description
<results>
<item><id>1</id><title>foobar</title><description /></item>
<item><id>2</id><title>foobar</title><description /></item>
<item><id>3</id><title>foobar</title><description /></item>
<item><id>4</id><title>foobar</title><description /></item>
<item><id>5</id><title>foobar</title><description /></item>
</results>
Is it possible with YQL to combine this into a single request with a search result count and paging support?

I don't have a straight forward way from the documentation but you could do this:
1) Create a YQL table A which queries http://example.com?search=foobar&columns=ID
2) Create a YQL table B which queries http://example.com?search=1,2,3,4,5&columns=ID,title,description
3) Now, create a YQL table C which does a y.query on join of A and B like so:
select * from B where search in (select ids from A where search="foobar")
Ofcourse the query syntax will change based on table name and keys defined in it. For more information on YQL join refer here
Hope this is clear and if you find something better in this case, do let me know :)

Create a YQL table with paging:
<paging model="offset" matrix="true">
<start id="internalIndex" default="0" />
<pagesize id="internalPerPage" max="250" />
</paging>
Use Javascript to handle the two fetches and the paging:
var internalIndex = parseInt( request.matrixParams['internalIndex'] );
var internalPerPage = parseInt( request.matrixParams['internalPerPage'] );
var interimURL = 'http://example.com?columns=ID';
interimURL += '&search=' + request.queryParams['search'];
var interimQueryParameter = {url:interimURL};
var interimQuery = y.query("SELECT * FROM xml WHERE url=#url", interimQueryParameter);
var rows = interimQuery.results.*;
// get subset
var xml = rows;//this.copy(); // clone XML
var from = internalIndex;
var to = ((from + internalPerPage) < xml.length()) ? (from + internalPerPage) : xml.length();
var sliced = [];
for (; from < to; from++) {
sliced.push(xml[from].#["ID"]);
}
var finalURL = 'http://example.com?';
finalURL += '&search=' + sliced.join(",");
finalURL += "&columns=ID,title,description"
var finalQueryParameter = {url:finalURL};
var finalQuery = y.query("SELECT * FROM xml WHERE url=#url", finalQueryParameter);
var finalResults = finalQuery.results.response;
finalResults.node += <internalCurserPositon>{internalIndex}</internalCurserPositon>
finalResults.node += <internalCount>{rows.length()}</internalCount>
response.object = finalResults;

Related

Parse XML Feed via Google Apps Script (Cannot read property 'getChildren' of undefined")

I need to parse a Google Alert RSS Feed with Google Apps Script.
Google Alerts RSS-Feed
I found a script which should do the job but I cant get it working with Google's RSS Feed:
The feed looks like this:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:idx="urn:atom-extension:indexing">
<id>tag:google.com,2005:reader/user/06807031914929345698/state/com.google/alerts/10604166159629661594</id>
<title>Google Alert – garbe industrial real estate</title>
<link href="https://www.google.com/alerts/feeds/06807031914929345698/10604166159629661594" rel="self"/>
<updated>2022-03-17T19:34:28Z</updated>
<entry>
<id>tag:google.com,2013:googlealerts/feed:10523743457612307958</id>
<title type="html"><b>Garbe Industrial</b> plant Multi-User-Immobilie in Ludwigsfelde - <b>Property</b> Magazine</title>
<link href="https://www.google.com/url?rct=j&sa=t&url=https://www.property-magazine.de/garbe-industrial-plant-multi-user-immobilie-in-ludwigsfelde-117551.html&ct=ga&cd=CAIyGWRmNjU0ZGNkMzJiZTRkOWY6ZGU6ZGU6REU&usg=AFQjCNENveXYlfrPc7pZTltgXY8lEAPe4A"/>
<published>2022-03-17T19:34:28Z</published>
<updated>2022-03-17T19:34:28Z</updated>
<content type="html">Die <b>Garbe Industrial Real Estate</b> GmbH startet ihr drittes Neubauprojekt in der Metropolregion Berlin/Brandenburg. Der Projektentwickler hat sich ...</content>
<author>
...
</feed>
I want to extract entry -> id, title, link, updated, content.
I used this script:
function ImportFeed(url, n) {
var res = UrlFetchApp.fetch(url).getContentText();
var xml = XmlService.parse(res);
//var item = xml.getRootElement().getChild("channel").getChildren("item")[n - 1].getChildren();
var item = xml.getRootElement().getChildren("entry")[n - 1].getChildren();
var values = item.reduce(function(obj, e) {
obj[e.getName()] = e.getValue();
return obj;
}, {});
return [[values.id, values.title, values.link, values.updated, values.content]];
}
I modified this part, but all i got was "TypeError: Cannot read property 'getChildren' of undefined"
//var item = xml.getRootElement().getChild("channel").getChildren("item")[n - 1].getChildren();
var item = xml.getRootElement().getChildren("entry")[n - 1].getChildren();
Any idea is welcome!
In your situation, how about the following modified script?
Modified script:
function SAMPLE(url, n = 1) {
var res = UrlFetchApp.fetch(url).getContentText();
var root = XmlService.parse(res.replace(/&/g, "&")).getRootElement();
var ns = root.getNamespace();
var entries = root.getChildren("entry", ns);
if (!entries || entries.length == 0) return "No values";
var header = ["id", "title", "link", "updated", "content"];
var values = header.map(f => f == "link" ? entries[n - 1].getChild(f, ns).getAttribute("href").getValue().trim() : entries[n - 1].getChild(f, ns).getValue().trim());
return [values];
}
In this case, when you use getChild and getChildren, please use the name space. I thought that this might be the reason of your issue.
From your script, I guessed that you might use your script as the custom function. In that case, please modify the function name from ImportFeed to others, because IMPORTFEED is a built-in function of Google Spreadsheet. In this sample, SAMPLE is used.
If you want to change the columns, please modify header.
In this sample, the default value of n is 1. In this case, the 1st entry is retrieved.
In this script, for example, you can put =SAMPLE("URL", 1) to a cell as the custom function. By this, the result value is returned.
Note:
If the above-modified script was not the direct solution of your issue, can you provide the sample value of res? By this, I would like to modify the script.
As the additional information, when you want to put all values by executing the script with the script editor, you can also use the following script.
function myFunction() {
var url = "###"; // Please set URL.
var res = UrlFetchApp.fetch(url).getContentText();
var root = XmlService.parse(res.replace(/&/g, "&")).getRootElement();
var ns = root.getNamespace();
var entries = root.getChildren("entry", ns);
if (!entries || entries.length == 0) return "No values";
var header = ["id", "title", "link", "updated", "content"];
var values = entries.map(e => header.map(f => f == "link" ? e.getChild(f, ns).getAttribute("href").getValue().trim() : e.getChild(f, ns).getValue().trim()));
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Sheet1"); // Please set the sheet name.
sheet.getRange(sheet.getLastRow() + 1, 1, values.length, values[0].length).setValues(values);
}
References:
XML Service
map()

Importing API data via importJSON

Having a bit of trouble using importJSON for the first time in Google Sheets. My data is importing as truncated and I can't find any way to really filter things the way I'd like.
API source: https://prices.runescape.wiki/api/v1/osrs/1h
I'm using the following command: =IMPORTJSON(B1;B2)
where B1 is the source link, and B2 references any filters I've applied. So far I have no filters.
My result is a truncated list that displays as such:
data/2/avgHighPrice 166
data/2/highPriceVolume 798801
data/2/avgLowPrice 162
data/2/lowPriceVolume 561908
data/6/avgHighPrice 182132
data/6/highPriceVolume 7
data/6/avgLowPrice 180261
data/6/lowPriceVolume 37
data/8/avgHighPrice 195209
data/8/highPriceVolume 4
data/8/avgLowPrice 192880
data/8/lowPriceVolume 40
In the examples I've seen and worked with (primarily the example provided by the Addon), it will naturally pivot into a table. I can't even achieve that, which would be workable although I'm really only looking to ping the markers avgHighPrice and avgLowPrice.
EDIT:
I'm looking for results along the lines of this:
2
6
8
/avgLowPrice
162
180261
192880
/avgHighPrice
166
182132
195209
EDIT2:
So I have one more thing I was hoping to figure out. Using your script, I created another script to pull the names and item IDs
function naming(url){
//var url='https://prices.runescape.wiki/api/v1/osrs/mapping'
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
var result = []
result.push(['#','id','name'])
for (let p in eval('data.data')) {
try{result.push([p,data.item(p).ID,data.item(p).Name])}catch(e){}
}
return result
}
Object.prototype.item=function(i){return this[i]};
I'm wondering if it is possible to correlate the item Name with the Item ID from the initial pricing script. To start, the 1st script only list items that are tradeable, while the 2nd list ALL item IDs in the game. I'd essentially like to correlate the 1st and 2nd script to show as such
ID
Name
avgHighPrice
avgLowPrice
2
Cannonball
180261
192880
6
Cannon Base
182132
195209
Try this script (without any addon)
function prices(url){
//var url='https://prices.runescape.wiki/api/v1/osrs/1h'
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
var result = []
result.push(['#','avgHighPrice','avgLowPrice'])
for (let p in eval('data.data')) {
try{result.push([p,data.data.item(p).avgHighPrice,data.data.item(p).avgLowPrice])}catch(e){}
}
return result
}
Object.prototype.item=function(i){return this[i]};
You can retrieve informations for naming / from mapping as follows
function naming(url){
//var url='https://prices.runescape.wiki/api/v1/osrs/mapping'
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
var result = []
result.push(["id","name","examine","members","lowalch","limit","value","highalch"])
json=eval('data')
json.forEach(function(elem){
result.push([elem.id.toString(),elem.name,elem.examine,elem.members,elem.lowalch,elem.limit,elem.value,elem.highalch])
})
return result
}
https://docs.google.com/spreadsheets/d/1HddcbLchYqwnsxKFT2tI4GFytL-LINA-3o9J3fvEPpE/copy
Integrated function
=pricesV2()
https://docs.google.com/spreadsheets/d/1HddcbLchYqwnsxKFT2tI4GFytL-LINA-3o9J3fvEPpE/copy
function pricesV2(){
var url='https://prices.runescape.wiki/api/v1/osrs/mapping'
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
let myItems = new Map()
json=eval('data')
json.forEach(function(elem){myItems.set(elem.id.toString(),elem.name)})
var url='https://prices.runescape.wiki/api/v1/osrs/1h'
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
var result = []
result.push(['#','name','avgHighPrice','avgLowPrice'])
for (let p in eval('data.data')) {
try{result.push([p,myItems.get(p),data.data.item(p).avgHighPrice,data.data.item(p).avgLowPrice])}catch(e){}
}
return result
}
Object.prototype.item=function(i){return this[i]};

Extract visual text from Google Classic Site page using Apps Script in Google Sheets

I have about 5,000 Classic Google Sites pages that I need to have a Google Apps script under Google Sheets examine one by one, extract the data, and enter that data into the Google Sheet row by row.
I wrote an apps script to use one of the sheets called "Pages" that contains the exactly URL of each page row by row, to run down while doing the extraction.
That in return would get the HTML contents and I would then use regex to extract the data I want which is the values to the right of each of the following...
Job name
Domain owner
Urgency/Impact
ISOC instructions
Which would then write that date under the proper columns in the Google Sheet.
This worked except for one big problem. The HTML is not consistent. Also, ID's and tags were not used so really it makes trying to do this through SitesApp.getPageByUrl not possible.
Here is the code I came up with for that attempt.
function startCollection () {
var masterList = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Pages");
var startRow = 1;
var lastRow = masterList.getLastRow();
for(var i = startRow; i <= lastRow; i++) {
var target = masterList.getRange("A"+i).getValue();
sniff(target)
};
}
function sniff (target) {
var pageURL = target;
var pageContent = SitesApp.getPageByUrl(pageURL).getHtmlContent();
Logger.log("Scraping: ", target);
// Extract the job name
var JobNameRegExp = new RegExp(/(Job name:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(\<\/td>)/m);
var JobNameValue = JobNameRegExp.exec(pageContent);
var JobMatch = JobNameValue[2];
if (JobMatch == null){
JobMatch = "NOTE FOUND: " + pageURL;
}
// Extract domain owner
var DomainRegExp = new RegExp(/(Domain owner:<\/b><\/td><td style='text-align:left;width:738px'><span style='font-family:arial,sans,sans-serif;font-size:13px'>)(.*?)(<\/span>)/m);
var DomainValue = DomainRegExp.exec(pageContent);
Logger.log("DUMP1:",SitesApp.getPageByUrl(pageURL).getHtmlContent());
var DomainMatch = DomainValue[2];
if (JobMatch == null){
DomainMatch = "N/A";
}
// Extract Urgency & Impact
var UrgRegExp = new RegExp(/(Urgency\/Impact:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(<\/td>)/m);
var UrgValue = UrgRegExp.exec(pageContent);
var UrgMatch = UrgValue[2];
if (JobMatch == null){
UrgMatch = "N/A";
}
// Extract ISOC Instructions
var ISOCRegExp = new RegExp(/(ISOC instructions:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(<\/td>)/m);
var ISOCValue = ISOCRegExp.exec(pageContent);
var ISOCMatch = ISOCValue[2];
if (JobMatch == null){
ISOCMatch = "N/A";
}
// Add record to sheet
var row_data = {
Job_Name:JobMatch,
Domain_Owner:DomainMatch,
Urgency_Impact:UrgMatch,
ISOC_Instructions:ISOCMatch,
};
insertRowInTracker(row_data)
}
function insertRowInTracker(rowData) {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Jobs");
var rowValues = [];
var columnHeaders = sheet.getDataRange().offset(0, 0, 1).getValues()[0];
Logger.log("Writing to the sheet: ", sheet.getName());
Logger.log("Writing Row Data: ", rowData);
columnHeaders.forEach((header) => {
rowValues.push(rowData[header]);
});
sheet.appendRow(rowValues);
}
So for my next idea, I have thought about using UrlFetchApp.fetch. The one problem I have though is that these pages on that Classics Google Site sit behind a non-shared with the public domain. While using SitesApp.getPageByUrl has the script ask for authorization and works, SitesApp.getPageByUrl does not meaning when it tries to call the direct page, it just gets the Google login page.
I might be able to work around this and turn them public, but I am still working on that.
I am running out of ideas fast on this one and hoping there is another way I have not thought of or seen. What I would really like to do is not even mess with the HTML content. I would like to use apps script under the Google Sheet to just look at the actual data presented on the page and then match a text and capture the value to the right of it.
For example have it go down the list of URLS on sheet called "Pages" and do the following for each page:
Find the following values:
Find the text "Job name:", capture the text to the right of it.
Find the text "Domain owner:", capture the text to the right of it.
Find the text "Urgency/Impact:", capture the text to the right of it.
Find the text "ISOC instructions:", capture the text to the right of it.
Write those values to a new row in sheet called "Jobs" as seen below.
Then move on the the next URL in the sheet called "Pages" and repeat until all rows in the sheet "Pages" have been completed.
Example of the data I want to capture
I have created an exact copy of one of the pages for testing and is public.
https://sites.google.com/site/2020dump/test
An inspect example
The raw HTML of the table which contains all the data I am after.
<tr>
<td style="width:190px"><b>Domain owner:</b></td>
<td style="text-align:left;width:738px">IT.FinanceHRCore </td>
</tr>
<tr>
<td style="width:190px"> <b>Urgency/Impact:</b></td>
<td style="text-align:left;width:738px">Medium (3 - Urgency, 3 - Impact) </td>
</tr>
<tr>
<td style="width:190px"><b>ISOC instructions:</b></td>
<td style="text-align:left;width:738px">None </td>
</tr>
<tr>
<td style="width:190px"></td>
<td style="text-align:left;width:738px"> </td>
</tr>
</tbody>
</table>
Any examples of how I can accomplish this? I am not sure how from an apps script perspective to go about not looking at HTML and only looking at the actual data displayed on the page. For example looking for the text "Job name:" and then grabbing the text to the right of it.
The goal at the end of the day is to transfer the data from each page into one big Google Sheet so we can kill off the Google Classic Site.
I have been scraping data with apps script using regular expressions for a while, but I will say that the formatting of this page does make it difficult.
A lot of the pages that I scrape have tables in them so I made a helper script that will go through and clean them up and turn them into arrays. Copy and paste the script below into a new google script:
function scrapetables(html,startingtable,extractlinksTF) {
var totaltables = /<table.*?>/g
var total = html.match(totaltables)
var tableregex = /<table[\s\S]*?<\/table>/g;
var tables = html.match(tableregex);
var arrays = []
var i = startingtable || 0;
while (tables[i]) {
var thistable = []
var rows = tables[i].match(/<tr[\s\S]*?<\/tr>/g);
if(rows) {
var j = 0;
while (rows[j]) {
var thisrow = tablerow(rows[j])
if(thisrow.length > 2) {
thistable.push(tablerow(rows[j]))
} else {thistable.push(thisrow)}
j++
}
arrays.push(thistable);
}
i++
}
return arrays;
}
function removespaces(string) {
var newstring = string.trim().replace(/[\r\n\t]/g,'').replace(/ /g,' ');
return newstring
}
function tablerow(row,extractlinksTF) {
var cells = row.match(/<t[dh][\s\S]*?<\/t[dh]>/g);
var i = 0;
var thisrow = [];
while (cells[i]) {
thisrow.push(removehtmlmarkup(cells[i],extractlinksTF))
i++
}
return thisrow
}
function removehtmlmarkup(string,extractlinksTF) {
var string2 = removespaces(string.replace(/<\/?[A-Za-z].*?>/g,''))
var obj = {string: string2}
//check for link
if(/<a href=.*?<\/a>/.test(string)) {
obj['link'] = /<a href="(.*?)"/.exec(string)[1]
}
if(extractlinksTF) {
return obj;
} else {return string2}
}
Running this got close, but at the moment, this doesn't handle nested tables well so I cleaned up the input by sending only the table that we want by isolating it with a regular expression:
var tablehtml = /(<table[\s\S]{200,1000}Job Name[\s\S]*?<\/table>)/im.exec(html)[1]
Your parent function will then look like this:
function sniff(pageURL) {
var html= SitesApp.getPageByUrl(pageURL).getHtmlContent();
var tablehtml = /(<table[\s\S]{200,1000}Job Name[\s\S]*?<\/table>)/im.exec(html)[1]
var table = scrapetables(tablehtml);
var row_data =
{
Job_Name: na(table[0][3][1]), //indicates the 1st table in the html, row 4, cell 2
Domain_Owner: na(table[0][4][1]), // indicates 1st table in the html, row 5, cell 2 etc...
Urgency_Impact: na(table[0][5][1]),
ISOC_Instructions: na(table[0][6][1])
}
insertRowInTracker(row_data)
}
function na(string) {
if(string) {
return string
} else { return 'N/A'}
}

How to find the SearchImpressionShare for a particular keyword?

One could easily find the average position for a keyword using getAveragePositon() method but the same is not available for SearchImpressionShare.
EDIT
I tried to get the SearchImpressionShare by querying the data but that gives me inconsistent data.
function main() {
var keywordId = 297285633818;
var last14dayStatsQuery = "SELECT Id, SearchTopImpressionShare FROM KEYWORDS_PERFORMANCE_REPORT WHERE Id = "+keywordId+" DURING LAST_14_DAYS"
var last14dayReport = AdWordsApp.report(last14dayStatsQuery);
var last14dayRows = last14dayReport.rows();
var last14dayRow = last14dayRows.next();
Logger.log('Keyword: ' + last14dayRow['Id'] + ' SearchTopIS: ' + last14dayRow['SearchTopImpressionShare']);
}
For example, below are the two outputs I received after running the same code twice.
Output 1:
10/16/2019 10:47:29 AM Keyword: 297285633818 SearchTopIS: 0.0
Output 2:
10/16/2019 10:47:45 AM Keyword: 297285633818 SearchTopIS: 0.17
Keywords performance report provides you those data https://developers.google.com/adwords/api/docs/appendix/reports/keywords-performance-report#searchimpressionshare
sample use:
function main () {
var query = "SELECT SearchImpressionShare, Criteria FROM KEYWORDS_PERFORMANCE_REPORT WHERE Clicks > 15 DURING YESTERDAY"
var report = AdWordsApp.report(query)
var rows = report.rows()
while (rows.hasNext()) {
var row = rows.next()
Logger.log('Keyrword %s, Impressions Share %s', row['Criteria'], row['SearchImpressionShare'])
}
}
update:
please note that if you have the same keyword within several ad group you'll get aslo several rows in report, each row for each adgroup. for the whole list of keywords use the following approach:
function main() {
var keywordId = 350608245287;
var last14dayStatsQuery = "SELECT Id, SearchTopImpressionShare FROM KEYWORDS_PERFORMANCE_REPORT WHERE Id = "+keywordId+" DURING LAST_14_DAYS"
var last14dayReport = AdWordsApp.report(last14dayStatsQuery);
var last14dayRows = last14dayReport.rows();
while (last14dayRows.hasNext()) {
var last14dayRow = last14dayRows.next();
Logger.log('Keyword: ' + last14dayRow['Id'] + ' SearchTopIS: ' + last14dayRow['SearchTopImpressionShare']);
}
}
You might find it useful to add ad group parameters to your query such as AdGroupName, AdGroupId.

reading xml with Linq

I cannot figure out how to get the all the ItemDetail nodes in the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<AssessmentMetadata xmlns="http://tempuri.org/AssessmentMetadata.xsd">
<ItemDetails>
<ItemName>I1200</ItemName>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
<ItemDetails>
<ItemName>I1300</ItemName>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
<ItemDetails>
<ItemName>I1400</ItemName>
<ISC_Active_Codes>NC</ISC_Active_Codes>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
</AssessmentMetadata>
I have tried a number of things, I am thinking it might be a namespace issue, so this is my last try:
var xdoc = XDocument.Load(asmtMetadata.Filepath);
var assessmentMetadata = xdoc.XPathSelectElement("/AssessmentMetadata");
You need to get the default namespace and use it when querying:
var ns = xdoc.Root.GetDefaultNamespace();
var query = xdoc.Root.Elements(ns + "ItemDetails");
You'll need to prefix it for any element. For example, the following query retrieves all ItemName values:
var itemNames = xdoc.Root.Elements(ns + "ItemDetails")
.Elements(ns + "ItemName")
.Select(n => n.Value);

Resources