I know it's possible to scrape websites but is it possible to have Google Sheets scrape a Google Doc for data? For example, if I have a bunch of google docs and they all have a line that says last updated: mm/dd/yyyy . Is it possible to have a google sheet with urls to the docs and have them scrape for the date
Solution
With the use of a script in your Spreadsheet you can actually retrieve information from any document of your drive. With the DocumentApp class in Apps Script you can actually retrive the body of your document with this method. Moreover, you can make searches using the findText() method of the Body class as shown here.
Therefore yes, with a script you could easily search a document for a sepecific text pattern like yours mm/dd/yyyy.
For more info about the DocumentApp functionallities check this documentation.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
Related
This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I'm attempting to parse the 'PEG Ratio' value of a stock from Yahoo Finance into a Google Sheet, but seeing an error.
URL used: https://finance.yahoo.com/quote/ABBV/key-statistics?p=ABBV
Cell Expression used: =IMPORTXML("http://finance.yahoo.com/quote/ABBV/key-statistics?p=ABBV", "//td[#data-reactid='132']")
Error: '#N/A' value (Error: Imported Content is empty)
Value expected is 1.28 (at the time of posting this query) - from Yahoo Finance > Statistics tab > PEG Ratio table (td has a, attribute data-reactid='132' that I have attempted to filter in the query)
Can anyone help please? Here is a link to the sheet: Google Sheet
Issue
IMPORTXML can only read the HTML source of a website. Therefore, those elements and components of a website added dynamically will not be able to be retrieved by the IMPORTXML and thus IMPORTXML will interpret the tag to be with empty content.
Possible workaround
Sometimes, in the Javascript files of the website, you can find out the URL of the source of data being inserted dynamically but that is a tedious task to achieve.
Other option to get the desired value is to use other web scraping techniques.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
This is probably not what you want, but I was searching around, and found a Google Sheets Add-On that does manage to pull the "1.28" value from that page. It is free for doing a very limited number of queries per month. If interested, search for IMPORTFROMWEB in the GSuite Marketplace.
I only plugged in your URL and the same XPath that you used, so I was very surprised when the data showed up. No idea how it works.
I apologise if mentioning an Add-On is not appropriate on SO. But knowing that an add-on can get that data off the web page may encourage some other ideas on how to do it natively with Sheets.
I want to create download link for my spreadsheet on google drive and I read about something like that:
https://docs.google.com/spreadsheets/d/MY_SPREADSHEET/export?format=csv
But it only downloads first sheet. I've read about GID parameter, but I don't want to spend time developing something that will get all GID's from API, and then download every sheet. Is there any way to have one link that leads to download of whole spreadsheet?
You might want to try suggestion in Labnol's guide:
Open your Google Spreadsheet in the browser, make the sheet Public (or Anyone with a link) and make a note of the shared URL. It should be something like this:
https://docs.google.com/spreadsheets/d/FILE_ID/edit?usp=sharing
The direct download links use a similar format as Google Documents and will read like:
https://docs.google.com/spreadsheets/d/FILE_ID/export?format=xlsx
https://docs.google.com/spreadsheets/d/FILE_ID/export?format=pdf
In addition to that, you may want to also try using the suggested URL in this SO post and see if it will help.
https://docs.google.com/spreadsheets/u/1/d/${id}/export?format=csv&id=${id}&gid=${gid}
Normally, when I use the Google Sheets API, I get a very predictable URL structure from the "Publish Sheet" menu option, that I use to extract the Spreadsheet ID with a regular expression and use it for other tasks on the Google Sheets API.
This has worked for years and is the way that Google's documentation recommends getting the Spreadsheet ID - from the URL.
e.g.
https://docs.google.com/spreadsheets/d/{MYSPREADSHEETID}/pubhtml
However, as of today, when publishing a spreadsheet, I now get a URL like this:
https://docs.google.com/spreadsheets/d/e/2PACX{BUNCH OF RANDOM CHARACTERS}/pubhtml
This breaks my code as the bunch of random characters that appears with 2PAC is not the spreadsheet ID and does not work with the API.
Does anyone know if this is an unannounced change to Google's URL structure or some kind of bug?
I have no idea when or why Google has decided to change their URL structure. The Google Sheets API Documentation states to pull the spreadsheet ID from the editing URL. Google Sheets API Documentation It seems unlikely to me that this is a bug of some kind, since this has been going on for a while, and to me, seems permanent.
The solution to this problem would be to pull the spreadsheet ID from the editing (or the sharing URL) URL itself instead of using the URL of the published sheet.
I hope Google fixes this issue as this affects consistency across their URLs but for now, the only way to retrieve the spreadsheet ID is to get it from the editing or sharing URLs.
Hope this helps! :)
I have a problem in my reporting, i create every day a google doc tracker where all the stack holders in my department update their work progress in it, so i have plenty of spreadsheets to monitor which is a hassle, here is what im trying to do, I'm trying to create a big google doc tracker where i can have an access over date applied in the normal spreadsheets, all what i need is the spreadsheet's URLs that exist in my google drive to be retrieved in this big tracker, with this i'll be able to drive all the needed data from the normal trackers.
PS: I'm not good with google scripts.
You can use the Drive Service to get a list of files with MIME type "application/vnd.google-apps.spreadsheet" using getFilesByType. This returns a FileIterator, which you can use to individually get each Spreadsheet file. From there, just use getUrl() to find the URL's. The FileIterator link has examples of how to loop through all the matching files.
I'm just beginning with programming, but i wanted to know if it's possible to use google docs api to make documents on another site using the google docs text editor?
Is there some sort of way i can put the google docs text editor onto a website so that we can use that for document creation instead of tiny mce?
Basically the functionality needed would be documents created, openly shared, a postable version of it (take html code) -- so it can go on the document display page, and
Of course there would be google login and everything, but i just wanted to see if this would work.
No, that is not possible, sorry.