We have a huge collection of spreadsheets with statistical data. There is one "master-sheet" with links to all other sheets. Most of these links have been there for a long time. It seems Google has changed link-formats over time, including id's used to identify the sheets.
Old link format, used often in our master sheet:
http://spreadsheets.google.com/pub?key=rcTO3doih5lvJCjgLSvlajA
Newer link format, used occasionally in our master sheet:
https://docs.google.com/spreadsheet/pub?key=0AkBd6lyS3EmpdDlSTTVWUkU3Z254aEhERmVuQWZaeWc
Newest link format, where Google redirects when you visit a link in the "newer" format: https://docs.google.com/spreadsheets/d/1WipPWXQqXSjj9vPTu1LXD8IxeTfIn4RIBrGaOBd0DXc/pub
Now recently (since a week or so) Google seems to have quit support for the first format. I.e., most of our links are dead, so we can't access our spreadsheets. And we have no way to find out what the new, working, links are.
Does anyone know how to retrieve the spreadsheets when all you have is the old link? We don't have a Google Drive folder with the spreadsheets, so that solution doesn't work.
Thank you so much for any ideas!
You can take the ID of the old link and put it in place of the ID of the newer link (not the newest!), then it will work.
e.g. old link:
http://spreadsheets.google.com/pub?key=rcTO3doih5lvJCjgLSvlajA
Take rcTO3doih5lvJCjgLSvlajA and insert below:
https://docs.google.com/spreadsheet/pub?key=
Results in: https://docs.google.com/spreadsheet/pub?key=rcTO3doih5lvJCjgLSvlajA
You can then follow the redirect to get the newest version of the link
Related
longtime lurker, first-time poster. I usually solve my issues & upvote without needing to post, but I've been stumped all weekend!
Edit: Erik solved it:
I was looking for an answer to extract the "datePublished" or "dateModified" from a Substack article in a Google Sheet.
Goal: This will tell me when it was the last date/time I updated, for example, my PS5 restock guide, my Walmart PS5 restock guide, etc. If it's too stale, I try to add relevant information. Having it in Google Sheets makes it streamlined as there are dozens of guides.
Test Google Sheet:
https://docs.google.com/spreadsheets/d/1hLBFMWCTc2hpC-1C8Sxd5OVREdNHTVTtrJsAAU5Jl94/edit#gid=0
I've done this before for other sites I've worked at, but there appears to be no date in the meta data on Substack :/ (I could be wrong, as I'm no expert at reading XPATH)
I do see this in the body for the linked example:
<time datetime="2022-07-29T11:52:00.000Z">Jul 29</time>
I've been trying things like this (where E17 is where I put the article URL in Google Sheets) to no effect.
=REGEXEXTRACT(IMPORTXML(E17, "//time[#datetime='datePublished']/#content"), "(.+)T")
I've been mostly working off of this StackOverflow solution, but I haven't been able to apply the same finding to Substack's formatting.
If you want to grab it directly using a Google Sheets formula, this should work for you:
=ArrayFormula(IFERROR(VLOOKUP("*",FLATTEN(IFERROR(REGEXEXTRACT(IMPORTXML("https://www.theshortcut.com/p/ps5-restock","//div[2]"),"Swider(.?.?.?.?\d\d{1}[hrago\s]*)"))),1,FALSE),"???"))
To set realistic expectations, I usually can't invest this much time into working out such a solution on this forum. But I'm on vacation at the moment and filling time while my guest is otherwise occupied.
One further note: this is specific to the two sites you gave as examples. It will only work for sites where the second <div> holds this information and only where the data exists as strings exactly like those found on these two sites (including the poster's last name as "Swider").
ADDENDUM:
Looking at this further, did you try simply the following?
=IMPORTXML(C2, "//time")
(assuming your URL is in C2, etc.)
This seems to work for me, given that it appears the date/time data you want is contained within the first <time> element on the web page.
This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I'm attempting to parse the 'PEG Ratio' value of a stock from Yahoo Finance into a Google Sheet, but seeing an error.
URL used: https://finance.yahoo.com/quote/ABBV/key-statistics?p=ABBV
Cell Expression used: =IMPORTXML("http://finance.yahoo.com/quote/ABBV/key-statistics?p=ABBV", "//td[#data-reactid='132']")
Error: '#N/A' value (Error: Imported Content is empty)
Value expected is 1.28 (at the time of posting this query) - from Yahoo Finance > Statistics tab > PEG Ratio table (td has a, attribute data-reactid='132' that I have attempted to filter in the query)
Can anyone help please? Here is a link to the sheet: Google Sheet
Issue
IMPORTXML can only read the HTML source of a website. Therefore, those elements and components of a website added dynamically will not be able to be retrieved by the IMPORTXML and thus IMPORTXML will interpret the tag to be with empty content.
Possible workaround
Sometimes, in the Javascript files of the website, you can find out the URL of the source of data being inserted dynamically but that is a tedious task to achieve.
Other option to get the desired value is to use other web scraping techniques.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
This is probably not what you want, but I was searching around, and found a Google Sheets Add-On that does manage to pull the "1.28" value from that page. It is free for doing a very limited number of queries per month. If interested, search for IMPORTFROMWEB in the GSuite Marketplace.
I only plugged in your URL and the same XPath that you used, so I was very surprised when the data showed up. No idea how it works.
I apologise if mentioning an Add-On is not appropriate on SO. But knowing that an add-on can get that data off the web page may encourage some other ideas on how to do it natively with Sheets.
Normally, when I use the Google Sheets API, I get a very predictable URL structure from the "Publish Sheet" menu option, that I use to extract the Spreadsheet ID with a regular expression and use it for other tasks on the Google Sheets API.
This has worked for years and is the way that Google's documentation recommends getting the Spreadsheet ID - from the URL.
e.g.
https://docs.google.com/spreadsheets/d/{MYSPREADSHEETID}/pubhtml
However, as of today, when publishing a spreadsheet, I now get a URL like this:
https://docs.google.com/spreadsheets/d/e/2PACX{BUNCH OF RANDOM CHARACTERS}/pubhtml
This breaks my code as the bunch of random characters that appears with 2PAC is not the spreadsheet ID and does not work with the API.
Does anyone know if this is an unannounced change to Google's URL structure or some kind of bug?
I have no idea when or why Google has decided to change their URL structure. The Google Sheets API Documentation states to pull the spreadsheet ID from the editing URL. Google Sheets API Documentation It seems unlikely to me that this is a bug of some kind, since this has been going on for a while, and to me, seems permanent.
The solution to this problem would be to pull the spreadsheet ID from the editing (or the sharing URL) URL itself instead of using the URL of the published sheet.
I hope Google fixes this issue as this affects consistency across their URLs but for now, the only way to retrieve the spreadsheet ID is to get it from the editing or sharing URLs.
Hope this helps! :)
I have a problem in my reporting, i create every day a google doc tracker where all the stack holders in my department update their work progress in it, so i have plenty of spreadsheets to monitor which is a hassle, here is what im trying to do, I'm trying to create a big google doc tracker where i can have an access over date applied in the normal spreadsheets, all what i need is the spreadsheet's URLs that exist in my google drive to be retrieved in this big tracker, with this i'll be able to drive all the needed data from the normal trackers.
PS: I'm not good with google scripts.
You can use the Drive Service to get a list of files with MIME type "application/vnd.google-apps.spreadsheet" using getFilesByType. This returns a FileIterator, which you can use to individually get each Spreadsheet file. From there, just use getUrl() to find the URL's. The FileIterator link has examples of how to loop through all the matching files.
So, a quick background. I make productivity apps (specifically CRM and Project Management). And I love the docs, spreadsheet and presentation products made by Google. Not surprisingly, my products have done a lot of "things" with Google Docs for a long time:
Create "native" (ie. Docs/Spreadsheets/Presentations) documents
Use native documents as templates
Link and modify permissions of any file in Docs/Drive
Upload any arbitrary file
etc.
What I'm confused about is what does Google want me to do on the labels on the buttons in my app. Right now, they all say "Google Docs". You're linking any arbitrary file to a presentation, you're linking it from "Google Docs". You're exporting a spreadsheet of time sheet entries, you're exporting it to "Google Docs". You upload a PDF, you uploaded it to Google Docs. Etc.
What I'm confused about is that, and correct me if I'm wrong, but I don't think it is a complete switch over to "Drive." I still see labels on the Google site for Google Docs. So, this is what I think the breakdown is:
If it is a Google "native" file, then it is Docs, else it is Drive. Thus, if your uploading any arbitrary file, that button should refer to drive. But if you are exporting a spreadsheet of data to the Google Spreadsheets format, then that is Docs.
Is this right at all? Does Google have some information somewhere?
Disclaimer: personal opinion
I would use Drive everywhere, except when specifically talking about the collaborative word processor provided in Google Drive, that is the Google Doc.
I would also make sure that all my integrations use the new Google Drive API.
There is reasonably good guidance here: https://developers.google.com/drive/branding
Google Docs and Google Drive are two seperate products from Google. They can work together, but they are still their own individual products and should be called their respective names when being used