Yahoo finance or google finance will block if i will subscribe all stocks? - yahoo-finance

I want to retrieve all stocks from few exchanges - by retrieve the stocks that inside those exchanges (by taking from http://www.nasdaq.com/screening/company-list.aspx).
And then I will quote for all stocks from google or Yahoo.
My question is if I will quote all of them for every 5 seconds or 10 seconds - will they block me?
What is the correct way for getting all stocks and they updated data?
Thanks!

David,
tl;dr - yahoo finace is OK (scraping 2,000 stocks) if you insert pauses in your code
I have some clumsy, but working code (my first attempt at scrapping) that pulls some data from Yahoo Finance. While I don't like the code and I will rewrite it for nasdaq.com in following weeks, I can tell you that I'm not getting blocked.
I have a few years old list of stocks for Russel 2000 so there are around 2,000 tickers I'm slowly going through and pulling some data from balance sheet. I'm using Selenium (see my question history, there is only one to see/get working code), code loads Chromium web browser (Linux) clicks on Balance sheet, scrape some data, clicks quarterly link, scraps more data and then closes the browser. For every ticker (stock).
Just to be on a safe side, I put several pauses into my code, for every scrap or navigation on site I added between 5 and 10 seconds. That way I'm slowly scraping data and Yahoo seems to be OK with this :-) It takes about one minute per ticker. I'm running this scrap job (for the first time!) now for over 30 hours lol and I'm currently at ticker that starts with T so I have few more hours to go.
I have read somewhere that some sites can spot this slow scraping also. So as an idea, instead of just hard code pause of say 7 seconds, you could run random number generator between IDK, 7-15 seconds and that way pauses will be more random and less prone to be spotted... Just a though Hope this helps a little bit even if with delay.
Ah, and if this answer does help you, please be so kind to mark it as solved and up vote it. Maybe I can get a point or two for it. My points are so low I can't even vote other posts that I like and that helped me.

Related

Applying for Additional Quota for YouTube API as an Individual (without business info)

I recently began using the Youtube Data v3 API for a program that I'm writing which is purely for personal use. To give a brief summary of what it does, it checks the the live chat from my most recent (usually ongoing) livestream and performs actions based on certain keywords entered in chat (essentially commands for people to use from live chat). In order to do that, however, I have to constantly send requests to get a refreshed livechat. As it is now, it sends requests on 1 second intervals. I recently did a livestream to test out my program and it only took about 25 minutes for me to reach the daily quota limit of 10,000 units/day.
The request is:youtube.liveChatMessages().list(liveChatId=liveChatId,part="snippet")
It seems like every request I make costs 6 units, according to the math. I want to be able to host livestreams at lengths of up to 3 hours, which would require a significant quota increase. I'm aware that there is an option to fill out a form to request additional quota. However, it asks for business information such as a business name, business website, business mailing address, etc. Like I said before, I'm doing this for my own use only. I'm in no way part of a business, and just made my program as a personal project. Does anyone know if there's any way to apply for additional quota as an individual/hobbyist? If not, do you think just putting n/a in those fields would be acceptable? I did find another post where someone else had the exact same problem, but no one was able to give a helpful answer. Any advice would be greatly appreciated.
Unfortunately, and although only related, it seems as Google is for the money here. I also tried to do something similar myself (a very basic chat bot just reading the chat messages), and, although some other users on the net got some different results, they all have in common that, according to the doc how it should be done, all poll at this interval of about once a second (that's the timeout one get as part of the answer to a poll for new messages). I, along with a few others, got as most as about 5 minutes with polling once a second, some others, like you, got a few more minutes out of it. I changed the interval by hand in incrementing intervals of 5 seconds each: 5, 10, 15, etc... you get the picture. I can't remember on which value I finally tuned in, but I was only able to get about 2 1/2 hours worth with a rather long polling interval of just once every 10 seconds or so - still way enough for a simple chat bot just reading the chat. But also replying would had at least doubled the usage and hence halfed the time.
It's already a pain to get it working as an idividual as just setting up the required OAuth authentication requires one to at least provide basic information like providing a fixed callback and some legal and policy information. I always ended up in had it rejected with this standard reply "Your project seem to be for internal use only.". I even was able to got this G suite working (before it required payment) to set up an "internal" project (only possible if account belongs to a G suite organization account), but after I set up the OAuth login I got the error that my private account I wanted to use the bot on was not part of the organization and hence can't be used. TLDR: Just useless waste of time.
As far as I'm in for this for several months now there's just no way to get it done as a private individual for personal use. Yes, one can just set it up and have the required check rejected (as it uses the YouTube data API scopes), but one still stuck with that 10.000 units / day quota. Building your own powerful tool capable of doing more than just polling once every 10 to 30 seconds with just a minimum of interaction doesn't get you any further than just a few minuts, maybe one or two hours if you're lucky. If you want more you have to set up a business and pay for it - simple and short: Google wants you to pay for that service.
As Mixer got officially announced to be shut down on July 22nd you have exactly these two options:
Use one of the public available services like Streamlabs, Nightbot, etc ... They're backed by their respective "businesses" and by it don't seem to have those quota limits (although I just found some complaints on Streamlabs just from April - so about one month prior to when you posted this question where they admitted to had reached their limits - don't know if they already got it solved).
Don't use YouTube for streaming but rather Twitch - as Twitch doesn't have these limits and anybody is free to set up an API token either on the main account or on a second bot account (which is also explicitly explained in their docs). The downside of this are of course the objective sacrifices one has to suffer: a) viewers only have the quality of the streamer until one reaches at least affiliate b) caped at max 1080p60 with only 6.000kBit/s c) only short time of VOD storage
I myself wanted to use YouTube as my main platform (and currently do, but without my own stuff at the moment) and my own bot stuff and such as streaming on YouTube has some advantages over Twitch, but as YouTube wants me to pay what others (namely: Twitch) offer me for free (although overall not as good quality) it's an easy decision to make. Mixer looked promissing, as it also offered quite some neat features (overall better quality than Twitch, lower latency), but the requirements to get partner status were so high (2.000 followers along with another insane high number to reach) and Mixer itself just so little of a platform (I made the fun to count all the streamers and viewers - only a few hundred streamers with just a few 10.000s viewers the whole platform had less than some big Twitch channels on their own) - and now it's announced soon to be dead anyway.
Hope this may give you some input into what a small streamer has to consider and suffer from when chosing a platform - but after all what I experienced I have these information: Either do it like all the others: Stream on Twitch and use YouTube as an archive to export to from Twitch (although Twitch STILL doesn't have an auto-export of the latest VOD implemented - but I guess that could be done by some small script) - or if you want to stay on YouTube use some existing bot like Nightbot or any of the other services like Streamlabs.
If you get any other information on how to convince Google to increase the limit as an individual please let us know.

Iphone app that needs to scrape a website once every day

So I'm making an iphone application that needs to scrape a website once everyday.
What I'm going to scrape is a table of upcoming games for that same day for a soccer division. Thats why i need the app to scrape from the same page and same table once everyday to keep the upcoming games updated.
I was referred to import.io but they didn't have something like a schedule re-crawl.
I would love to get some ideas and tips to how i should do this since I'm stuck now.
You might take a look at https://www.kimonolabs.com/
I played around with the service a while back and was impressed with how easy it way to set up. They have a "free" option so long as the APIs you create are not private.
Oh, and I agree with Paul, screen scraping is not something the iOS client should be doing. Too fragile, and when (not if) something breaks, you will need to go through an Apple review process to fix it.
This doesn't seem like something an app should do, your server should do it (so that the scraping is only performed once), and your clients can retrieve it from your server. That also means you could send out push notifications for important fixtures etc. Maybe that's what you meant, anyway.
If it's on the server you can just setup a scheduler (in Java, for example) to run once every x hours (probably a smaller number than 24 assuming you don't know when the website is to be updated). Then your app can just get the latest list of fixtures from your server on startup, pull-to-refresh, etc. Presumably someone will open your app, look at the fixtures, then come out of your app - so it doesn't seem like you need to cover the case where someone is in your app all day, but if you did you could use NSTimer to run every x minutes after the initial on-startup server call.

Google Finance API Not Consistent

I'm writing some software to do charting and analysis of intraday stock data, and so far the only free (or even affordable) feed I've found which gives 15 minute data for the past week or so is Google Finance. But something I've noticed, which I don't understand and has caused many headaches, is that the responses from the API for 15 minute intervals seem to be very inconsistent.
So far I haven't seen this problem with the 30 minute interval, in this case the response is always correct. But if I specify an interval of 15 minutes (900 seconds), I get anywhere from 70 to 200 or more quotes back. The data is correct, but the responses seem to pretty much ignore the number of days I'm specifying. Also this happens for individual stocks, so it isn't a case of some stocks having missing data. Here's an example of an API request I'm sending:
https://www.google.com/finance/getprices?i=900&p=8d&f=d,o,h,l,c&q=INTC
If anyone could help I'd appreciate it, this API doesn't seem to be documented so it's been difficult to find any help with it.
Yes, google is not consistent in providing stock data. For the same reason i switched over to yahoo API, their data is pretty much consistent compared to google

View counter in ASP.NET MVC

I'm going to create a view counter for articles. I have some questions:
Should I ignore article's author
when he opens the article?
I don't want to update database each
time. I can store in a
Dictionary<int, int> (articleId, viewCount) how many times
each article was viewed. After 100
hits I can update the database.
I should only count the hit once per
hour for each user and article. (if
the user opens one article many
times during one hour the view count
should be incremented only once).
For each question I want to know your suggestions how to do it right.
I'm especially interested how to do #3. Should I store the time when the user opened the article in a cookie? Does it mean that I should create a new cookie for each page?
I think I know the answer - they are analyzing the IIS log as Ope suggested.
Hidden image src is set to
http://stackoverflow.com/posts/3590653/ivc/[Random code]
[Random code] is needed because many people may share the same IP (in a network, for example) and the code is used to distinguish users.
Sure - I think that is a good idea
and 3. are related: The issue is where would you actually store this dictionary and logic.
An ASP.NET application or session scope are of course the easiest choice, but there you really need to understand the logic of application pools. ASP.NET applications are recycled from time to time: when there is no action on the site for a certain period or in special situations - e.g. if the process starts to take too much memory the application is shut down and a new one is started in the next request. There are events for session and application shut-down, but at least some years ago they were not really reliable: In many special cases they did not always fire. Perhaps they are better now, but it is painful to test. And 1 hour is really a long time: Usually sessions are kept alive only like 20 minutes after last request.
A reliable way would be to have a separate Windows service (a lot of work to program) or always storing to database with double-view analyses (quite a lot of overhead for such a small feature).
Do you have access to IIS logs? How about analyzing IIS logs e.g. every 30 minutes with some kind of timer process and taking the count from there? Or then just store all the hits to the database with user information and calculate the unique hits with a similar timed process.
One final question: Are you really sure none of the thousands of counter applications/services in the Internet wouldn't do the job close enough to your requirements?
Good luck!
This is the screenshot of this page in Firebug. You can see that there is a request which returns 204 status code (No Content).
This is stackoverflow's view counter. They are using a hidden image which point to a controller's action.
I have many articles. How to track which articles the user visited already?
P.S. BTW, why is this request made two times?

Tracking impressions/visits per web page

I have a site with several pages for each company and I want to show how their page is performing in terms of number of people coming to this profile.
We have already made sure that bots are excluded.
Currently, we are recording each hit in a DB with either insert (for the first request in a day to a profile) or update (for the following requests in a day to a profile). But, given that requests have gone from few thousands per days to tens of thousands per day, these inserts/updates are causing major performance issues.
Assuming no JS solution, what will be the best way to handle this?
I am using Ruby on Rails, MySQL, Memcache, Apache, HaProxy for running overall show.
Any help will be much appreciated.
Thx
http://www.scribd.com/doc/49575/Scaling-Rails-Presentation-From-Scribd-Launch
you should start reading from slide 17.
i think the performance isnt a problem, if it's possible to build solution like this for website as big as scribd.
Here are 4 ways to address this, from easy estimates to complex and accurate:
Track only a percentage (10% or 1%) of users, then multiply to get an estimate of the count.
After the first 50 counts for a given page, start updating the count 1/13th of the time by a count of 13. This helps if it's a few page doing many counts while keeping small counts accurate. (use 13 as it's hard to notice that the incr isn't 1).
Save exact counts in a cache layer like memcache or local server memory and save them all to disk when they hit 10 counts or have been in the cache for a certain amount of time.
Build a separate counting layer that 1) always has the current count available in memory, 2) persists the count to it's own tables/database, 3) has calls that adjust both places

Resources