I am trying to simply get n tweets for a given query. The problem is, tweepy keeps returning tweets < n when I do this via cursor method.
I think this has something to do rate limiting. Though I initialized the api to wait on the rate limit, and tell me when this is happening.
Here's my code.
# Initialize API
ckey = "xxx"
csecret = "xxx"
atoken = "xxx"
asecret = "xxx"
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
max_tweets = 1000
query = 'tweepy'
l = []
for tweet in tweepy.Cursor(api.search, q=query).items(max_tweets):
l.append(tweet.text)
print(len(l))
And it always happens that l has less tweets than max_tweets.
It’s due to the query, try different query instead of tweety may be there is no enough tweets in last seven days. I just tried different query getting 1000 tweets as requested.
Related
I've been using the classic article function which returns the articles for a string
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='100000',
retmode='xml',
term=t)
return list(Entrez.read(handle)["IdList"])
print(len(article_machine('T cell')))
I've noticed now that there's a limit to the amount of articles I receive (not the one I put in retmax).
The amount I get is 9999 PMIDS, for key words who used to return 100k PMIDS (T cell for example)
The amount I get now
The amount I used to get
I know it's not a bug in the package itself but in NCBI.
Has someone managed to solve it?
from : The E-utilities In-Depth: Parameters, Syntax and More
retmax
Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.
To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results
Unfortunately my code devised on the above mentioned info, doesnt work:
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
all_res = []
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='count',
term=t)
number = int(Entrez.read(handle)['Count'])
print(number)
retstart = 0
while retstart < number:
retstart += 1000
print('\n retstart now : ' , retstart ,' out of : ', number)
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='xml',
retstart = retstart,
retmax = str(retstart),
term=t)
all_res.extend(list(Entrez.read(handle)["IdList"]))
return all_res
print(article_machine('T cell'))
changing while retstart < number: with while retstart < 5000:
the code works, but as soon as retmax exceeds 9998, that is using
the former while loop needed to access all the results, I get the following error:
RuntimeError: Search Backend failed: Exception:
'retstart' cannot be larger than 9998. For PubMed,
ESearch can only retrieve the first 9,999 records
matching the query.
To obtain more than 9,999 PubMed records, consider
using EDirect that contains additional logic
to batch PubMed search results automatically
so that an arbitrary number can be retrieved.
For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/
See https://www.ncbi.nlm.nih.gov/books/NBK25499/ that actually should be https://www.ncbi.nlm.nih.gov/books/NBK179288/?report=reader
Try to have a look at NCBI APIs to see if there is something could work out your problem usinng a Python interface, I am not an expert on this , sorry : https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
I'm trying to get information about tweets (number of likes, comments, etc) from a specific account between two dates. I have access to the Twitter API and tweepy installed, but I have not been able to figure out how to do this. My authenitcation method is OAuth 2.0 Bearer Token. Any help is appreciated!
You can check the time when a tweet was created by looking at the created_at attribute under the tweet object. To fetch all the tweets from a specific user during a specific timeframe, however, you have to first fetch all the tweets under that account. Also, it's worth mentioning that Twitter's API only supports up to 3200 of the user's latest tweets.
To fetch all the tweets, you could do
# Get 200 tweets every time and add it onto the list (200 max tweets per request). Keep looping until there's no more to fetch.
username = ""
tweets = []
fetchedTweets = twitterAPI.user_timeline(screen_name = username, count = 200)
tweets.extend(fetchedTweets)
lastTweetInList = tweets[-1].id - 1
while (len(fetchedTweets) > 0):
fetchedTweets = twitterAPI.user_timeline(screen_name = username, count = 200, max_id = lastTweetInList)
tweets.extend(fetchedTweets)
lastTweetInList = tweets[-1].id - 1
print(f"Catched {len(tweets)} tweets so far.")
Then, you would have to filter out all the tweets that fall under your specific timeframe (you have to import datetime):
start = datetime.datetime(2020, 1, 1, 0, 0, 0)
end = datetime.datetime(2021, 1, 1, 0, 0, 0)
specificTweets = []
for tweet in tweets:
if (tweet.created_at > start) and (tweet.created_at < end):
specificTweets.append(tweet)
You can now see all the tweets that falls under your timeframe in specificTweets.
I am querying a fusion table using a request like this:
https://www.googleapis.com/fusiontables/v2/query?alt=media&sql=SELECT ROWID,State,County,Name,Location,Geometry,[... many more ...] FROM <table ID>
The results from this query exceed 10MB, so I must include the alt=media option (for smaller queries, where I can remove this option, the problem does not exist). The response is in csv, as promised by the documentation. The first line of the response appears to be a header line which exactly matches my query string (except that it shows rowid instead of ROWID):
rowid,State,County,Name,Location,Geometry,[... many more ...]
The following rows however do not include the row id. Each line begins with the second item requested in the query - it seems as though the row id was ignored:
WV,Calhoun County,"Calhoun County, WV, USA",38.858 -81.1196,"<Polygon><outerBoundaryIs>...
Is there any way around this? How can I retrieve row ids from a table when the table is large?
Missing ROWIDs are also present for "media" requests made via Google's Python API Client library, e.g.
def doGetQuery(query):
request = FusionTables.query().sqlGet_media(sql = query);
response = request.execute();
ft = {'kind': "fusiontables#sqlresponse"};
s = queryResult.decode(); # bytestring to string
data = [];
for i in s.splitlines():
data.append(i.split(','));
ft['columns'] = data.pop(0); # ['Rowid', 'A', 'B', 'C']
ft['rows'] = data; # [['a1', 'b1', 'c1'], ...]
return ft;
You may at least have one fewer issue than I - this sqlGet_media method can only be made with a "pure" GET request - a query long enough (2 - 8k characters) to get sent as overridden POST generates a 502 Bad Gateway error, even for tiny response sizes such as the result from SELECT COUNT(). The same query as a non-media request works flawlessly (provided the response size is not over 10 MB, of course).
The solution to both your and my issue is to batch the request using OFFSET and LIMIT, such that the 10 MB response limit isn't hit. Estimate the size of an average response row at the call site, pass it into a wrapper function, and let the wrapper handle adding OFFSET and LIMIT to your input SQL, and then collate the multi-query result into a single output of the desired format:
def doGetQuery(query, rowSize = 1.):
limitValue = floor(9.5 * 1024 / rowSize) # rowSize in kB
offsetValue = 0;
ft = {'kind': "fusiontables#sqlresponse"};
data = [];
done = False;
while not done:
tail = ' '.join(['OFFSET', str(offsetValue), 'LIMIT', str(limitValue)]);
request = FusionTables.query().sqlGet(sql = query + ' ' + tail);
response = request.execute();
offsetValue += limitValue;
if 'rows' in response.keys():
data.extend(response['rows']);
# Check the exit condition.
if 'rows' not in response.keys() or len(response['rows']) < limitValue:
done = True;
if 'columns' not in ft.keys() and 'columns' in response.keys():
ft['columns'] = response['columns'];
ft['rows'] = data;
return ft;
This wrapper can be extended to handle actual desired uses of offset and limit. Ideally for FusionTable or other REST API methods they provide a list() and list_next() method for native pagination, but no such features are present for FusionTable::Query. Given the horrendously slow rate of FusionTable API / functionality updates, I wouldn't expect ROWIDs to magically appear in alt=media downloads, or for GET-turned-POST media-format queries to ever work, so writing your own wrapper is going to save you a lot of headache.
I'm trying to pull data from Twitter over a month or so for a project. There are <10000 tweets over this time period with this hashtag, but I'm only seeming to get all the tweets from the current day. I got 68 yesterday, and 80 today; both were timestamped with the current day.
api = tweepy.API(auth)
igsjc_tweets = api.search(q="#igsjc", since='2014-12-31', count=100000)
ipdb> len(igsjc_tweets)
80
I know for certain there should be more than 80 tweets. I've heard that Twitter rate-limits to 1500 tweets at a time, but does it also rate-limit to a certain day? Note that I've also tried the Cursor approach with
igsjc_tweets = tweepy.Cursor(api.search, q="#igsjc", since='2015-12-31', count=10000)
This also only gets me 80 tweets. Any tips or suggestions on how to get the full data would be appreciated.
Here's the official tweepy tutorial on Cursor. Note: you need to iterate through the Cursor, shown below. Also, there is a max count that you can pass .items(), so it's probably a good idea to pull month-by-month or something similar and probably a good idea to sleep in between calls. HTH!
igsjc_tweets_jan = [tweet for tweet in tweepy.Cursor(
api.search, q="#igsjc", since='2016-01-01', until='2016-01-31').items(1000)]
First, tweepy cannot bring too old data using its search API
I don't know the exact limitation but maybe month or two back only.
anyway,
you can use this piece of code to get tweets.
i run it in order to get tweets from last few days and it works for me.
notice that you can refine it and add geocode information - i left an example commented out for you
flag = True
last_id = None
while (flag):
flag = False
for status in tweepy.Cursor(api.search,
#q='geocode:"37.781157,-122.398720,1mi" since:'+since+' until:'+until+' include:retweets',
q="#igsjc",
since='2015-12-31',
max_id=last_id,
result_type='recent',
include_entities=True,
monitor_rate_limit=False,
wait_on_rate_limit=False).items(300):
tweet = status._json
print(Tweet)
flag = True # there still some more data to collect
last_id = status.id # for next time
Good luck
Using twython I am trying to retrieve list of all of the followers of a particular id which has more than 40k followers. But I am running into below error
"Twitter API returned a 429 (Too many requests) rate limit exceeded. How to over come this issue?
Below is the snippet, I am printing user name and time zone information.
next_cursor = -1
while(next_cursor):
search = twitter.get_followers_list(screen_name='ndtvgadgets',cursor=next_cursor)
for result in search['users']:
time_zone =result['time_zone'] if result['time_zone'] != None else "N/A"
print result["name"].encode('utf-8')+ ' '+time_zone.encode('utf-8')
next_cursor = search["next_cursor"]
Change the search line to:
search = twitter.get_followers_list(screen_name='ndtvgadgets',count=200,cursor=next_cursor)
Then import the time module and insert time.sleep(60) between each API call.
It'll take ages for a user with 41K followers (around three and a half hours for the ndtvgadgets account), but it should work. With the count increased to 200 (the maximum) you're effectively requesting 200 results every minute. If there are other API calls in your script in addition to twitter.get_followers_list you might want to pad the sleep time a little or insert a sleep call after each one.