I'm having trouble pulling just the price for these sites into a Google sheet. Instead, I'm pulling multiple rows/currencies, etc. and I don't know how to fix it
1---->
https://www.discountfilters.com/refrigerator-water-filters/models/ukf8001/
//main/div/div/div/div/div/div/div/div/div[1]/span/span/span
2---->
https://www.discountfilters.com/refrigerator-water-filters/models/ukf8001/
//div[1]/form/div/div/div[1]/div/div/div[2]/div[1]
3---->
https://filterbuy.com/air-filters/8x16x1/
//div[2]/div[1]/div[3]/span
I tried the xpaths above and it's giving me all the data instead of just the discounted price (row1) that I'm looking for.
try:
=INDEX(IMPORTXML(A1, "//div[#class='price mt-2 mt-md-0 mb-0 mb-md-3']"),,2)
regarding issues on multiple websites you are trying to scrape.. ImportXML is good for basic tasks, but won't get you too far if you are serious in scraping:
If the target website data requires some cleanup post-processing, it's getting very complicated since you are now "programming with excel formulas", rather painful process compared to regular code writing in conventional programming languages
There is no proper launch & cache control so the function can be triggered occasionally and if the HTTP request fails, cells will be populated with ERR! values
The approach only works with most basic websites (no SPAs rendered in browsers can be scraped this way, any basic web scraping protection or connectivity issue breaks the process, no control over HTTP request geo location, or number of retries)
When ImportXML() fails, the second approach to web scraping in Google Sheets is usually to write some custom Google Apps Script. This approach is much more flexible, just write Javascript code and deploy it as Google Sheets addon, but it takes a lot of time, and is not too easy to debug and iterate over - definitely not low code.
And the third approach is to use proper tools (automation framework + scraping engine) and use Google Sheets just for storage purposes:
https://youtu.be/uBC752CWTew
Related
I have a Google Spreadsheets with data connected to a Data Studio Panel. I'm using the following data flow to get the data:
Google SpreadSheets --> BigQuery External Table --> View To the External Table --> Data Studio (Updated every 10 minutes)
But for some reason that I don't know, sometimes, when executing a select on the BigQuery External Table I get the following error:
Resources exceeded during query execution: Google Sheets service overloaded for spreadsheet id:XXX
The Google SpreadSheet has only 1500x10 Columns, which I think is pretty small. Also, there are about 6 users.
What can cause that error? Any idea about how to solve this?
Thanks
The Google documentation has information about this error:
A BigQuery query can overload Sheets, resulting in an error like Resources exceeded during query execution: Google Sheets service overloaded. Consider simplifying your spreadsheet; for example, by minimizing the use of formulas.
It seems that along with size of the Sheet, the "complexity" also matters. We cannot know how complex is your spreadsheet without seeing it but consider reducing your formula usage. This article also mentions a max result size of 10MB and other pivot table limits. You could also try to divide the data or if the error rate is manageable you could also use some kind of retry strategy to query again until you get the results.
If this is not enough then you may have reached the limits of what you can do with Sheets. Digging deeper I found this Google issue tracker post which has a quote from their engineering team:
The BigQuery Engineering Team has stated that the current suggested approach is to simplify the spreadsheet. Sheets is designed for Web/Mobile use cases and not as a DB backend. Even a couple of thousand rows is large in this context, especially if there are formulas involved.
The post is a feature request to the Google engineering team to allow for more complexity, but these requests can take time and if they don't intend Sheets to be used that way it's also possible that they won't implement it. If you cannot reduce the spreadsheet's complexity enough to stop getting the error you may want to consider querying the data from a different source.
I recently decided to update my spreadsheet of games I need to complete. In order to ensure my data was constantly up to date I made use of the IMPORTXML function but with the amount of urls I have begun to encounter 'loading' issues.
This is the spreadsheet:
https://docs.google.com/spreadsheets/d/1ZdcsIf9Upn_0zqTFyLAm1TMMFu_MpyTEm23EU0nVaTA/edit?usp=sharing
(Columns B,E,G and I are usually hidden)
Column A is the url.
Column B scrapes the image url and column C displays it.
Columns D,E,G and I scrape the data I want and displays it in columns D,F,H and J.
If my aim is to have upwards of 500 urls, is this something that can be only be accomplished with a script?
In this scenario you are encountering the limit of Google services. That quota is reached by aggregating the usage of all documents and projects. Also please be aware that there could be more than one import inside the same document, like one per every cell in your example.
To diminish that usage you could modify old documents so they don't refresh anymore (commenting out the relevant pieces and deactivating triggers). Alternatively you could just delete them. If you plan to run large amounts of imports, you could use Apps Script. Although this option is limited by the same quota discussed above, you could programmatically control when and how much to import in order to optimise your utilisation of Google services.
I have an iOS app which can be used offline. I need to do anonymous page view tracking, so our customers can tell which pages people are most interested in (to drive future investments). So when the user is offline, we save a timestamped page view list, and if the user happens to be online when they use the app, we send these historic records up, and also do real-time tracking.
I'm keeping some summary statistics in my GAE app, so I can report the page views with historic accuracy. However, I'm also feeding these views into google analytics, using some python code I ported from google's server-side samples.
That all works great (except for language tracking, which I may have solved thanks to a separate question here on SO). However, I'd love for google analytics to be able to understand the historical hits in context. Right now, if I connect up after looking at several pages offline, GA thinks I just popped through a bunch of pages over the course of a couple seconds.
There is no documented utm variable for timestamping. The google analytics SDK for iOS (which I'm not using) has this ominous note:
Known Issues
Possible inaccurate timestamps: timestamps are recorded at the time the application dispatches to Google Analytics, so if a user experiences long periods of offline use, the timestamps may not be 100% accurate.
That seems like a bit of an understatement. Wouldn't offline timestamps be 100% inaccurate?
Anyway, the fact that the SDK doesn't handle this right makes me think I'm not going to be able to solve this. But I figured some SO wizard might have an idea...
In fact, timestamp is a "relative" (client side) information used by Analytics to compute things like "time on page".
When the page is view in "absolute" (date and time) is always the time you send the request.
I'm using Google's Custom Search API to dynamically provide web search results. I very intensely searched the API's docs and could not find anything that states it grants you access to Google's site image previews, which happen to be stored as base64 encodes.
I want to be able to provide image previews for sites for each of the urls that the Google web search API returns. Keep in mind that I do not want these images to be thumbnails, but rather large images. My question is what is the best way to go about doing this, in terms of both efficiency and cost, in both the short and long term.
One option would be to crawl the web and generate and store the images myself. However this is way beyond my technical ability, and plus storing all of these images would be too expensive.
The other option would be to dynamically fetch the images right after Google's API returns the search results. However where/how I fetch the images is another question.
Would there be a low cost way of me generating the images myself? Or would the best solution be to use some sort of site thumbnailing service that does this for me? Would this be fast enough? Would it be too expensive? Would the service provide the image in the correct size for me? If not, how could I change the size of the image?
I'd really appreciate answers that are comprehensive and for any code examples to be in ruby using rails.
So as you pointed out in your question, there are two approaches that I can see to your issue:
Use an external service to render and host the images.
Render and host the images yourself.
I'm no expert in field, but my Googling has so far only returned services that allow you to generate thumbnails and not full-size screenshots (like the few mentioned here). If there are hosted services out there that will do this for you, I wasn't able to find them easily.
So, that leaves #2. For this, my first instinct was to look for a ruby library that could generate an image from a webpage, which quickly led me to IMGKit (there may be others, but this one looked clean and simple). With this library, you can easily pass in a URL and it will use the webkit engine to generate a screenshot of the page for you. From there, I would save it to wherever your assets are stored (like Amazon S3) using a file attachment gem like Paperclip or CarrierWave (railscast). Store your attachment with a field recording the original URL you passed to IMGKit from WSAPI (Web Search API) so that you can compare against it on subsequent searches and use the cached version instead of re-rendering the preview. You can also use the created_at field for your attachment model to throw in some "if older than x days, refresh the image" type logic. Lastly, I'd put this all in a background job using something like resque (railscast) so that the user isn't blocked when waiting for screenshots to render. Pass the array of returned URLs from WSAPI to background workers in resque that will generate the images via IMGKit--saving them to S3 via paperclip/carrierwave, basically. All of these projects are well-documented, and the Railscasts will walk you through the basics of the resque and carrierwave gems.
I haven't crunched the numbers, but you can against hosting the images yourself on S3 versus any other external provider of web thumbnail generation. Of course, doing it yourself gives you full control over how the image looks (quality, format, etc.), whereas most of the services I've come across only offer a small thumbnail, so there's something to be said for that. If you don't cache the images from previous searches, then your costs reduces even further, since you'll always be rendering the images on the fly. However I suspect that this won't scale very well, as you may end up paying a lot more for server power (for IMGKit and image processing) and bandwidth (for external requests to fetch the source HTML for IMGKit). I'd be sure to include some metrics in your project to attach some exact numbers to the kind of requests you're dealing with to help determine what the subsequent costs would be.
Anywho, that would be my high-level approach. I hope it helps some.
Screen shotting web pages reliably is extremely hard to pull off. The main problem is that all the current solutions (khtml2png, CutyCapt, Phantom.js etc) are all based around QT which provides access to an embedded Webkit library. However that webkit build is quite old and with HTML5 and CSS3, most of the effects either don't show, or render incorrectly.
One of my colleagues has used most, if not all, of the current technologies for generating screenshots of web pages for one of his personal projects. He has written an informative post about it here about how he now uses a SaaS solution instead of trying to maintain a solution himself.
The TLDR version; he now uses URL2PNG to do all his thumbnail and full size screenshots. It isn't free, but he says that it does the job for him. If you don't want to use them, they have a list of their competitors here.
I'm trying to write a script that would download currency rates from yahoo finance. The problem is ... i can't find any information on the limitations of this service. Especially i'm interested in how often i can query yahoo finance to access the quotes.csv file. Would yahoo kill my script if i executed it periodically every minute or so? Does anyone know where i could find some official yahoo information of things like that? I've been searching for hours, but it's either well hidden or it's just hiding in plain sight and i don't see it...
Usually its against the TOS of the website. However if you want to collect data that way on a small scale it is fairly trivial. I have mined yahoo finance in the past and have never been turned off. Don't hammer the site...space out your requests. If you want to be even more clever about it script a web browser to do it for you. The page requests will then look identical.