Extract link details from cell in Google Sheet

Extract link details from cell in Google Sheet - google-sheets

I was wondering whether it is possible to extract the link details from a cell when hovering the mouse over it?
For example the instagram followers of Kim Kardashian account as follows:

I do not think you could obtain values from the social media preview of the link, directly via the popover, easily.
The content seen in the social media preview are embedded under the meta tags within the HTML of the relevant webpage and the content can be scrapped by parsing the relevant HTTP response.
Solution 1
You could use Google Sheet's IMPORTXML() and the suitable XPath, which helps in pointing to the data you want. The below is to be used as a cell value.
=IMPORTXML([Relevant cell id; Example: A1],"/html/head/meta[14]/#content]")
However, this does produces an error: "Could not fetch url", on my end, despite using the valid example URL that you gave.
Solution 2
You could leverage on a Google sheet bounded Google Apps Script with the below code and suitable constraints:
function fetchContent(url) {
const htmlResponse = UrlFetchApp.fetch(url).getContentText();
return Parser.data(htmlResponse).from("<meta content=\"").to("\"").build(); }
The relevant cell value to be:
=fetchContent([Relevant cell id])
However, when using the example URL, this solution also causes an error:"Exception: Request failed for https://www.instagram.com... returned code 429 ..", alluding that there has been too many HTTP request made to Instagram from the relevant IP address. See this SO Q&A regarding IP addresses associated with App script's UrlFetchApp()and this SO Q&A regarding Instagram blocking IP addresses.
In theory, the above solutions should work for other websites which do not block IP addresses.
Obtaining specific parts of the content via Google Apps script should be straightforward, but for within Google Sheets you could check out this SO Q&A and the related ones.
If you need to obtain data specifically from Instagram, you could check out Instagram's Graph API.

Related

How to Scrape the "span.VlHyHc" values from google image reinement buubles with Google sheet? [duplicate]

I am trying to import data from the following website to Google Sheets. I want to import all the matches for the day.
https://www.tournamentsoftware.com/tournament/b731fdcd-a0c8-4558-9344-2a14c267ee8b/Matches
I have tried importxml and importhtml, but it seems this does not work as the website uses JavaScript. I have also tried to use Apipheny without any success.
When using Apipheny, the error message is
'Failed to fetch data - please verify your API Request: {DNS error'

Tl;Dr
Adapted from my answer to How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website? (also posted by me)
Please spend some time learning how to use the browsers developers tools so you will be able to identify
if the data is already included in source code of the webpage as JSON / literal JavaScript object or in another form
if the webpage is doing a GET or POST requests to retrieve the data and when those requests are done (i.e. as some point of the page parsing, or on event)
if the requests require data from cookies
Brief guide about how to use the web browser to find useful details about the webpage / data to import
Open the source code and look if the required data is included. Sometimes the data is included as JSON and added to the DOM using JavaScript. In this case it might be possible to retrieve the data by using the Google Sheets functions or URL Fetch Service from Google Apps Script.
Let say that you use Chrome. Open the Dev Tools, then look at the Elements tab. There you will see the DOM. It might be helpful to identify if the data that you want to import besides being on visible elements is included in hidden / not visible elements like <script> tags.
Look at Source, there you might be able to see the JavaScript code. It might include the data that you want to import as JavaScript object (commonly referred as JSON).
There are a lot of questions about google-sheets +web-scraping that mentions problems using importhtml and/or importxml that already have answers and even many include code (JavaScript snippets, Google Apps Script functions, etc.) that might save you to have to use an specialized web-scraping tool that has a more stepped learning curve. At the bottom of this answer there is a list of questions about using Google Sheets built-in functions, including annotations of the workaround proposed.
On Is there a way to get a single response from a text/event-stream without using event listeners? ask about using EventSource. While this can't be used on server side code, the answer show how to use the HtmlService to use it on client-side code and retrieve the result to Google Sheets.
As you already realized, the Google Sheets built-in functions importhtml(), importxml(), importdata() and importfeed() only work with static pages that do not require signing in or other forms of authentication.
When the content of a public page is created dynamically by using JavaScript, it cannot be accessed with those functions, by the other hand the website's webmaster may also purposefully have prevented web scraping.
How to identify if content is added dynamically
To check if the content is added dynamically, using Chrome,
Open the URL of the source data.
Press F12 to open Chrome Developer Tools
Press Control+Shift+P to open the Command Menu.
Start typing javascript, select Disable JavaScript, and then press Enter to run the command. JavaScript is now disabled.
JavaScript will remain disabled in this tab so long as you have DevTools open.
Reload the page to see if the content that you want to import is shown, if it's shown it could be imported by using Google Sheets built-in functions, otherwise it's not possible but might be possible by using other means for doing web scraping.
According to Wikipedia,
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
Use of robots.txt to block Web crawlers
The webmasters could use robots.txt file to block access to website. In such case the result will be #N/A Could not fetch URL.
Use of User agent
The webpage could be designed to return a special a custom message instead of the data.
Below there are more details about how Google Sheets built-in "web-scraping" functions works
IMPORTDATA, IMPORTFEED, IMPORTHTML and IMPORTXML are able to get content from resources hosted on websites that are:
Publicly available. This means that the resource doesn't require authorization / to be logged in into any service to access it.
The content is "static". This mean that if you open the resource using the view source code option of modern web browsers it will be displayed as plain text.
NOTE: The Chrome's Inspect tool shows the parsed DOM; in other works the actual structure/content of the web page which could be dynamically modified by JavaScript code or browser extensions/plugins.
The content has the appropriated structure.
IMPORTDATA works with structured content as csv or tsv doesn't matter of the file extension of the resource.
IMPORTFEED works with marked up content as ATOM/RSS
IMPORTHTML works with marked up content as HTML that includes properly markedup list or tables.
IMPORTXML works with marked up content as XML or any of its variants like XHTML.
The content doesn't exceeds the maximum size. Google haven't disclosed this limit but the below error will be shown when the content exceeds the maximum size:
Resource at url contents exceeded maximum size.
Google servers are not blocked by means of robots.txt or the user agent.
On W3C Markup Validator there are several tools to checkout is the resources had been properly marked up.
Regarding CSV check out Are there known services to validate CSV files
It's worth to note that the spreadsheet
should have enough room for the imported content; Google Sheets has a 10 million cell limit by spreadsheet, according to this post a columns limit of 18278, and a 50 thousand characters as cell content even as a value or formula.
it doesn't handle well large in-cell content; the "limit" depends on the user screen size and resolution as now it's possible to zoom in/out.
References
https://developers.google.com/web/tools/chrome-devtools/javascript/disable
https://en.wikipedia.org/wiki/Web_scraping
Related
Using Google Apps Script to scrape Dynamic Web Pages
Scraping data from website using vba
Block Website Scraping by Google Docs
Is there a way to get a single response from a text/event-stream without using event listeners?
Software Recommendations
Web scraping tool/software available for free?
Recommendations for web scraping tools that require minimal installation
Web Applications
The following question is about a different result, #N/A Could not fetch URL
Inability to use IMPORTHTML in Google sheets
Similar questions
Some of this questions might be closed as duplicate of this one
Importing javascript table into Google Docs spreadsheet
Importxml Imported Content Empty
scrape table using google app scripts
One answer includes Google Apps Script code using the URL Fetch Service
Capture element using ImportXML with XPath
How to import Javascript tables into Google spreadsheet?
Scrape the current share price data from the ASX
One of the answers includes Google Apps Script code to get data from a JSON source
Guidance on webscraping using Google Sheets
How to Scrape data from Indiegogo.com in google sheets via IMPORTXML formula
Why importxml and importhtml not working here?
Google Sheet use Importxml error could not fetch url
One answer includes Google Apps Script code using the URL Fetch Service
Google Sheets - Pull Data for investment portfolio
Extracting value from API/Webpage
IMPORTXML shows an error while scraping data from website
One answer shows the xhr request found using browser developer tools
Replacing =ImportHTML with URLFetchApp
One answer includes Google Apps Script code using the URL Fetch Service
How to use IMPORTXML to import hidden div tag?
Google Sheet Web-scraping ImportXml Xpath on Yahoo Finance doesn't works with french stock
One of the answers includes Google Apps Script code to get data from a JSON source. As of January 4th 2023, it's not longer working, very likely because Yahoo! Finance is now encrying the JSON. See the Tainake's answer to How to pull Yahoo Finance Historical Price Data from its Object with Google Apps Script? for script using Crypto.js to handle this.
How to fetch data which is loaded by the ajax (asynchronous) method after the web page has already been loaded using apps script?
One answer suggest to read the data from the server instead of scraping from a webpage.
Using ImportXML to pull data
Extracting data from web page using Cheerio Library
One answer suggest the use of an API and Google Apps Script

ImportXML is good for basic tasks, but it won't get you too far if you are serious in scraping:
The approach only works with the most basic websites (no SPAs rendered in browsers can be scraped this way. Any basic web scraping protection or connectivity issue breaks the process, and there isn't any control over HTTP request geo location, or number of retries) - and Yahoo Finance is not a simple website
If the target website data requires some cleanup post-processing, it's getting very complicated since you are now "programming with Excel formulas", rather a painful process compared to regular code writing in conventional programming languages
There isn't any proper launch and cache control, so the function can be triggered occasionally and if the HTTP request fails, cells will be populated with ERR! values
I recommend using proper tools (automation framework and scraping engine which can render JavaScript-powered websites) and use Google Sheets just for basic storage purposes:
https://youtu.be/uBC752CWTew (Pipedream for automation and ScrapeNinja engine for scraping)

Can you use Google Sheets Importxml to get information from google search results [duplicate]

I am trying to import data from the following website to Google Sheets. I want to import all the matches for the day.
https://www.tournamentsoftware.com/tournament/b731fdcd-a0c8-4558-9344-2a14c267ee8b/Matches
I have tried importxml and importhtml, but it seems this does not work as the website uses JavaScript. I have also tried to use Apipheny without any success.
When using Apipheny, the error message is
'Failed to fetch data - please verify your API Request: {DNS error'

ImportXML is good for basic tasks, but it won't get you too far if you are serious in scraping:
The approach only works with the most basic websites (no SPAs rendered in browsers can be scraped this way. Any basic web scraping protection or connectivity issue breaks the process, and there isn't any control over HTTP request geo location, or number of retries) - and Yahoo Finance is not a simple website
If the target website data requires some cleanup post-processing, it's getting very complicated since you are now "programming with Excel formulas", rather a painful process compared to regular code writing in conventional programming languages
There isn't any proper launch and cache control, so the function can be triggered occasionally and if the HTTP request fails, cells will be populated with ERR! values
I recommend using proper tools (automation framework and scraping engine which can render JavaScript-powered websites) and use Google Sheets just for basic storage purposes:
https://youtu.be/uBC752CWTew (Pipedream for automation and ScrapeNinja engine for scraping)

Google sheets shows could not fetch URL [duplicate]

I am trying to import data from the following website to Google Sheets. I want to import all the matches for the day.
https://www.tournamentsoftware.com/tournament/b731fdcd-a0c8-4558-9344-2a14c267ee8b/Matches
I have tried importxml and importhtml, but it seems this does not work as the website uses JavaScript. I have also tried to use Apipheny without any success.
When using Apipheny, the error message is
'Failed to fetch data - please verify your API Request: {DNS error'

ImportXML is good for basic tasks, but it won't get you too far if you are serious in scraping:
The approach only works with the most basic websites (no SPAs rendered in browsers can be scraped this way. Any basic web scraping protection or connectivity issue breaks the process, and there isn't any control over HTTP request geo location, or number of retries) - and Yahoo Finance is not a simple website
If the target website data requires some cleanup post-processing, it's getting very complicated since you are now "programming with Excel formulas", rather a painful process compared to regular code writing in conventional programming languages
There isn't any proper launch and cache control, so the function can be triggered occasionally and if the HTTP request fails, cells will be populated with ERR! values
I recommend using proper tools (automation framework and scraping engine which can render JavaScript-powered websites) and use Google Sheets just for basic storage purposes:
https://youtu.be/uBC752CWTew (Pipedream for automation and ScrapeNinja engine for scraping)

Bing Image Search API Authorise within Google Sheets

I've been putting together a Google Sheet (https://docs.google.com/spreadsheets/d/1Pp9QyrF-aaM6cT9mOs9BWODwbSc2ZHXAovCU2fxQ6Uc/edit?usp=sharing)
that automatically uses the Google Custom Search API (I set up a free account to test) to pull in images to the table from a feed provided by the Google Custom Search API.
The problem is that Google's Custom Search API service seems to be totally unreliable and inconsistent, with the generated search URL's (results in column S of the linked spreadsheet) sometimes returning a useable result (and picture), sometimes returning an empty feed (for no obvious reason) or sometimes giving me a "This site can’t be reached" "ERR_INVALID_RESPONSE" error which is equally inexplicable since the URLs are all formed by the same code.
So I've had a look at the Microsoft Bing Image Search API to see if their results were more reliable and found that I can also use a URL to return an XML feed of results (which I could then parse within Google Sheets to get the image links (i hope!).
The problem is that with the awful Google Custom Search API, I could at least put the authentication information in the URL itself so Google Sheets just followed the link and got the data from the search result. With Bing, which produces much more reliable results, it asks for authentication via a browser dialog and states that "Only Basic and OAuth are supported".
Now I've looked into how I could get Google Sheets to authenticate the URL queries but haven't had much luck figuring it out.
Also saw that you can apparently authenticate using the Basic method but the account key needs to be converted to Base64 and a colon added in front of it but this then needs to be set in the (http?) headers: How do i return JSON results from BING Search Engine API.
Can headers be set via a script in Google Sheets and/or can the Bing Search API urls be authorised from within Google Sheets in some way?
Otherwise is there anything obvious as to what I'm doing wrong with the Google Search URL's that return such broken and inconsistent results?
I'd prefer to get the Bing Search URL's authorised within Google Sheets ideally.
Thanks

How to add Twitter Expanded Tweets? (was: twitter media preview card for my site?)

I googled a lot about how to make twitter media preview for my website entities if they are linked in a tweet like images below:
Any idea where can I find some documentation about it? Or a tutorial? Is this possible or these media/site previews are hardcoded in twitter?
EDIT:
so, what I need:
If someone links my site on twitter, my widget appear under the tweet, like below:
UPDATE 2012-06-13
It appears this is an Expanded Tweet - -what the requirements are to integrate these expansions into Twitter are do not appear to be displayed - but this sure is interesting.

Nope your in luck. They're not hardcoded into Twitter, they're available in the JSON response. You actually have in your post the word you need to google for entities.
You can add include_entities=1 to the end of most REST api calls and it will give you expanded information about the URL's contained within the JSON. It will split out all the URL's where you can parse out the Youtube links for example. The JSON also includes a special media_url entity but it only works for pictures. In any case, you can still parse out the media easily like youtube with a regex match because you get the URL's split out nicely with this include_entities=1 parameter.
example call :
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=twitterapi&include_entities=1
more documentation : https://dev.twitter.com/docs/tweet-entities
answer edited below based on clarification:
Editing Twitter itself with previews is impossible and it's also ineffective. 75% of traffic to Twitter happens outside of Twitter.com. However the most probable solution to achieving this request would be to download a browser extension.
This extension for example enbales previews of webpages directly in the users stream content preview pane on Twitter.com
https://chrome.google.com/webstore/detail/oijgblonhcagdhfbgjilnpjipmijimmn

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart