We have developed an importing solution for one of our clients. It parses and converts data contained in many OneNote notebooks, to required proprietary data structures, for the client to store and use within another information system.
There is substantial amount of data across many notebooks, requiring a considerable amount of Graph API queries to be performed, in order to retrieve all of the data.
In essence, we built a bulk-importing (batch process, essentially) solution, which goes through all OneNote notebooks under a client's account, parses sections and pages data of each, as well as downloads and stores all page content - including linked documents and images. The linked documents and images require the most amount of Graph API queries.
When performing these imports, the Graph API throttling issue arises. After certain time, even though we are sending queries at a relatively low rate, we start getting the 429 errors.
Regarding data volume, average section size of a client notebook is 50-70 pages. Each page contains links to about 5 documents for download, on average. Thus, it requires up to 70+350 requests to retrieve all the pages content and files of a single notebook section. And our client has many such sections in a notebook. In turn, there are many notebooks.
In total, there are approximately 150 such sections across several notebooks that we need to import for our client. Considering the stats above, this means that our import needs to make a total of 60000-65000 Graph API queries, estimated.
To not flood the Graph API service and keep within the throttling limits, we have experimented a lot and gradually decreased our request rate to be just 1 query for every 4 seconds. That is, at max 900 Graph API requests are made per hour.
This already makes each section import noticeably slow - but it is endurable, even though it means that our full import would take up to 72 continuous hours to complete.
However - even with our throttling logic at this rate implemented and proven working, we still get 429 "too many requests" errors from the Graph API, after about 1hr 10mins, about 1100 consequtive queries. As a result, we are unable to proceed our import on all remaining, unfinished notebook sections. This enables us to only import a few sections consequtively, having then to wait for some random while before we can manually attempt to continue the importing again.
So this is our problem that we seek help with - especially from Microsoft representatives. Can Microsoft provide a way for us to be able to perform this importing of these 60...65K pages+documents, at a reasonably fast query rate, without getting throttled, so we could just get the job done in a continuous batch process, for our client? In example, as either a separate access point (dedicated service endpoint), perhaps time-constrained eg configured for our use within a certain period - so we could within that period, perform all the necessary imports?
For additional information - we currently load the data using the following Graph API URL-s (placeholders of actual different values are brought in uppercase letters between curly braces):
Pages under the notebook section:
https://graph.microsoft.com/v1.0/users/{USER}/onenote/sections/{SECTION_ID}/pages?...
Content of a page:
https://graph.microsoft.com/v1.0/users/{USER}/onenote/pages/{PAGE_ID}/content
A file (document or image) eg link from the page content:
https://graph.microsoft.com/v1.0/{USER}/onenote/resources/{RESOURCE_ID}/$value
which call is most likely to cause the throttling?
What can you retrieve before throttling - just pageids (150 calls total) or pageids+content (10000 calls)? If the latter can you store the results (eg sql database) so that you don't have to call these again.
If you can get pageids+content can you then access the resources using preAuthenticated=true (maybe this is less likely to be throttled). I don't actually offline images as I usually deal with ink or print.
I find the onenote API is very sensitive to multiple calls without waiting for them to complete, I find more than 12 simultaneous calls via a curl multi technique problematic. Once you get throttled if you don't back off immediately you can be throttled for a long, long time. I usually have my scripts bail if I get too many 429 in a row (I have it set for 10 simultaneous 429s and it bails for 10 minutes).
We now have the solution released & working in production. Turns out that indeed adding ?preAuthenticated=true to the page requests returns the page content having resource links (for contained documents, images) in a different format. Then, as it seems, querying these resource links will not impact the API throttling counters - as we've had no 429 errors since.
We even managed to bring the call rate down to 2 seconds from 4, without any problems. So I have marked codeeye's answer as the accepted one.
Related
I receive this error while trying to export form my datagrid to Google Sheets. How can I solve it?
Don't make many requests too quickly.
You are either exceeding your quota or you are making too many requests too quickly.
Also, look into batch requests
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/batchUpdate
As you may be trying to make a call to the API for every single cell updated, which is an easy way to run into the above error.
If you must do it on a cell by cell basis, you would have to insert a small delay between requests. Bear in mind that although the usage page says:
This version of the Google Sheets API has a limit of 500 requests per 100 seconds per project, and 100 requests per 100 seconds per user. Limits for reads and writes are tracked separately. There is no daily usage limit.
This does not mean that you can make 100 requests in 1 second and then wait 99 seconds. This will give you a quota error like what you are running into. You would have to put in a one second delay between requests, for example.
I'm currently developing a chat bot for one specific YouTube channel, which can already fetch messages from the currently active livechat. However I noticed my quota usage shooting up, so I took the "liberty" to calculate my quota cost.
My API call currently looks like this https://www.googleapis.com/youtube/v3/liveChat/messages?liveChatId=some_livechat_id&part=snippet,authorDetails&pageToken=pageTokenIfProvided, which uses up 5 units. I checked this by running one API call and comparing the quota usage before and after (so apologies, if this is inaccurate). The response contains pollingIntervalMillis set to 5086 milliseconds. Currently, my bot adds that interval to the current datetime and schedules the next fetch at that time (using Celery), so it currently fetches messages at a rate of 4-6 seconds. I'm gonna take the liberty and always wait for 6 seconds.
Calculating my API quota would result in a usage of 72.000 units per day:
10 requests per minute * 60 minutes * 24 hours = 14.400 requests per day
14.400 requests * 5 units per request = 72.000 units per day
This means that if I used the pollingIntervalMillis as a guideline for how often to request, I'd easily reach the maximum quota of 10.000 units by running the bot for 3 hours and 20 minutes. In order to not use up the quota by just fetching chat messages, I would need to run 1 API call per minute (1,3889 approximately). This is very unfeasible for a chatbot, since this is only for fetching messages and not even sending any messages to the chat.
So my question is: Is there maybe a more efficient way to fetch chat messages which won't use up the quota so much? Or will I only get this resolved by applying for a quota extension? And if this is only resolved by a quota extension, how much would I need to ask for reliably? Around 100k units? Even more?
I am also asking myself how something like Streamlabs Chatbot (previously known as AnkhBot) accomplishes this without hitting the quota limit despite thousands of users using their API client, their quota must probably be in the millions or billions.
And another question would be how I'd actually fill out the form, if the bot is still in this "early" state of development?
You pretty much hit the nail on the head. Services like Streamlabs are owned by larger companies, in their case Logitech. They not only have the money to throw around for things like increasing their API quota, but they also have professional relationships with companies like Google to decrease their per unit cost.
As for efficiency, the API costs are easily found in the documentation, but for live chat as you've found, you're going to be hitting the API for 5 units per hit. The only way to improve your overall daily cost with your calls is to perform them less frequently. While once per minute is clearly excessively long, once every 15-18 seconds could reduce the overall cost of your API quota increase, while making the chat bot adequately responsive.
Of course that all depends on your desired usage of the data, but still a recommendation if you're implementing the bot still in the realm of hobbyist usage.
I recently began using the Youtube Data v3 API for a program that I'm writing which is purely for personal use. To give a brief summary of what it does, it checks the the live chat from my most recent (usually ongoing) livestream and performs actions based on certain keywords entered in chat (essentially commands for people to use from live chat). In order to do that, however, I have to constantly send requests to get a refreshed livechat. As it is now, it sends requests on 1 second intervals. I recently did a livestream to test out my program and it only took about 25 minutes for me to reach the daily quota limit of 10,000 units/day.
The request is:youtube.liveChatMessages().list(liveChatId=liveChatId,part="snippet")
It seems like every request I make costs 6 units, according to the math. I want to be able to host livestreams at lengths of up to 3 hours, which would require a significant quota increase. I'm aware that there is an option to fill out a form to request additional quota. However, it asks for business information such as a business name, business website, business mailing address, etc. Like I said before, I'm doing this for my own use only. I'm in no way part of a business, and just made my program as a personal project. Does anyone know if there's any way to apply for additional quota as an individual/hobbyist? If not, do you think just putting n/a in those fields would be acceptable? I did find another post where someone else had the exact same problem, but no one was able to give a helpful answer. Any advice would be greatly appreciated.
Unfortunately, and although only related, it seems as Google is for the money here. I also tried to do something similar myself (a very basic chat bot just reading the chat messages), and, although some other users on the net got some different results, they all have in common that, according to the doc how it should be done, all poll at this interval of about once a second (that's the timeout one get as part of the answer to a poll for new messages). I, along with a few others, got as most as about 5 minutes with polling once a second, some others, like you, got a few more minutes out of it. I changed the interval by hand in incrementing intervals of 5 seconds each: 5, 10, 15, etc... you get the picture. I can't remember on which value I finally tuned in, but I was only able to get about 2 1/2 hours worth with a rather long polling interval of just once every 10 seconds or so - still way enough for a simple chat bot just reading the chat. But also replying would had at least doubled the usage and hence halfed the time.
It's already a pain to get it working as an idividual as just setting up the required OAuth authentication requires one to at least provide basic information like providing a fixed callback and some legal and policy information. I always ended up in had it rejected with this standard reply "Your project seem to be for internal use only.". I even was able to got this G suite working (before it required payment) to set up an "internal" project (only possible if account belongs to a G suite organization account), but after I set up the OAuth login I got the error that my private account I wanted to use the bot on was not part of the organization and hence can't be used. TLDR: Just useless waste of time.
As far as I'm in for this for several months now there's just no way to get it done as a private individual for personal use. Yes, one can just set it up and have the required check rejected (as it uses the YouTube data API scopes), but one still stuck with that 10.000 units / day quota. Building your own powerful tool capable of doing more than just polling once every 10 to 30 seconds with just a minimum of interaction doesn't get you any further than just a few minuts, maybe one or two hours if you're lucky. If you want more you have to set up a business and pay for it - simple and short: Google wants you to pay for that service.
As Mixer got officially announced to be shut down on July 22nd you have exactly these two options:
Use one of the public available services like Streamlabs, Nightbot, etc ... They're backed by their respective "businesses" and by it don't seem to have those quota limits (although I just found some complaints on Streamlabs just from April - so about one month prior to when you posted this question where they admitted to had reached their limits - don't know if they already got it solved).
Don't use YouTube for streaming but rather Twitch - as Twitch doesn't have these limits and anybody is free to set up an API token either on the main account or on a second bot account (which is also explicitly explained in their docs). The downside of this are of course the objective sacrifices one has to suffer: a) viewers only have the quality of the streamer until one reaches at least affiliate b) caped at max 1080p60 with only 6.000kBit/s c) only short time of VOD storage
I myself wanted to use YouTube as my main platform (and currently do, but without my own stuff at the moment) and my own bot stuff and such as streaming on YouTube has some advantages over Twitch, but as YouTube wants me to pay what others (namely: Twitch) offer me for free (although overall not as good quality) it's an easy decision to make. Mixer looked promissing, as it also offered quite some neat features (overall better quality than Twitch, lower latency), but the requirements to get partner status were so high (2.000 followers along with another insane high number to reach) and Mixer itself just so little of a platform (I made the fun to count all the streamers and viewers - only a few hundred streamers with just a few 10.000s viewers the whole platform had less than some big Twitch channels on their own) - and now it's announced soon to be dead anyway.
Hope this may give you some input into what a small streamer has to consider and suffer from when chosing a platform - but after all what I experienced I have these information: Either do it like all the others: Stream on Twitch and use YouTube as an archive to export to from Twitch (although Twitch STILL doesn't have an auto-export of the latest VOD implemented - but I guess that could be done by some small script) - or if you want to stay on YouTube use some existing bot like Nightbot or any of the other services like Streamlabs.
If you get any other information on how to convince Google to increase the limit as an individual please let us know.
I have been getting intermittent 500 errors while batch-uploading simple row data to Google Fusion Tables via the v2 API, using the importRows method.
We have tried throttling and backing off, but the patterns seem to indicate that we are going over quota even with small numbers of requests and fairly slow rates.
I can see in the API console it's limited to 200 requests / 100s (as confirmed in other posts it's a 0.5/s rate limit).
We are about to sadly abandon the Fusion Tables API and rebuild the entire project using something else, due to the unpredictable nature of the 500 errors. (Sometimes insert happens but sometimes not, after an error is returned which makes retrying run the risk of duplicate inserts).
It occurred to me that as we are uploading 1,000 rows per request, does this count as 1,000 requests?
Are you uploading media files when you make an importRows request? It could be that you're exceeding table storage limits (250MB per table). You may want to check your code and data payloads against this and other Fusion Table limitations.
Here's a good reference on the limits of Fusion Tables:
What are the technical limitations when using Fusion Tables?
In my app I want to retrieve a large amount of data from Parse to build a view of statistics. However, in future, as data builds up, this may be a huge amount.
For example, 10,000 results. Even if I fetched in batches of 1000 at a time, this would result in 10 fetches. This could rapidly, send me over the 30 requests per second limitation by Parse. Specifically when several other chunks of data may need to be collected at the same time for other stats.
Any recommendations/tips/advice for this scenario?
You will also run into limits with the skip and limit query variables. And heavy weight lifting on a mobile device could also present issues for you.
If you can you should pre-aggregate these statistics, perhaps once per day, so that you can simply directly request the details.
Alternatively, create a cloud code function to do some processing for you and return the results. Again, you may well run into limits here, so a cloud job may meed your needs better, and then you may need to effectively create a request object which is processed by the job and then poll for completion or send out push notifications on completion.