I'm trying to pull data from Twitter over a month or so for a project. There are <10000 tweets over this time period with this hashtag, but I'm only seeming to get all the tweets from the current day. I got 68 yesterday, and 80 today; both were timestamped with the current day.
api = tweepy.API(auth)
igsjc_tweets = api.search(q="#igsjc", since='2014-12-31', count=100000)
ipdb> len(igsjc_tweets)
80
I know for certain there should be more than 80 tweets. I've heard that Twitter rate-limits to 1500 tweets at a time, but does it also rate-limit to a certain day? Note that I've also tried the Cursor approach with
igsjc_tweets = tweepy.Cursor(api.search, q="#igsjc", since='2015-12-31', count=10000)
This also only gets me 80 tweets. Any tips or suggestions on how to get the full data would be appreciated.
Here's the official tweepy tutorial on Cursor. Note: you need to iterate through the Cursor, shown below. Also, there is a max count that you can pass .items(), so it's probably a good idea to pull month-by-month or something similar and probably a good idea to sleep in between calls. HTH!
igsjc_tweets_jan = [tweet for tweet in tweepy.Cursor(
api.search, q="#igsjc", since='2016-01-01', until='2016-01-31').items(1000)]
First, tweepy cannot bring too old data using its search API
I don't know the exact limitation but maybe month or two back only.
anyway,
you can use this piece of code to get tweets.
i run it in order to get tweets from last few days and it works for me.
notice that you can refine it and add geocode information - i left an example commented out for you
flag = True
last_id = None
while (flag):
flag = False
for status in tweepy.Cursor(api.search,
#q='geocode:"37.781157,-122.398720,1mi" since:'+since+' until:'+until+' include:retweets',
q="#igsjc",
since='2015-12-31',
max_id=last_id,
result_type='recent',
include_entities=True,
monitor_rate_limit=False,
wait_on_rate_limit=False).items(300):
tweet = status._json
print(Tweet)
flag = True # there still some more data to collect
last_id = status.id # for next time
Good luck
Related
I would like to find public users on Twitter that have 0 followers. I was thinking of using https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-search, but this doesn't have a way to filter by number of followers. Are there any simple alternatives? (Otherwise, I might have to resort to using a graph/search based approach starting from a random point)
Well you didn't specify what library you are using to interact with the Twitter API. But regardless of which technology you're using, the underlying concept is the same. I will use tweepy library in python for my example.
Start by getting the public users using this. The return type is a list of user objects. The user object has several attributes which you can learn about here. For now we are interested in the followers_count attribute. Simply loop through the objects returned and check where the value of this attribute is 0.
Here's how the implementation would look like in python using tweepy library;
search_query = 'your search query here'
get_users = api.search_users(q = search_query)#Returns a list of user objects
for users in get_users:
if users.followers_count ==0:
#Do stuff if the user has 0 followers
Bird SQL by Perplexity AI allows you to do this simply: https://www.perplexity.ai/sql
Query: Users with 0 followers, and 0 following, with at least 5 tweets
SELECT user_url, full_name, followers_count, following_count, tweet_count
FROM users
WHERE (followers_count = 0)
AND (following_count = 0)
AND (tweet_count >= 5)
ORDER BY tweet_count DESC
LIMIT 10
I think I've found a bug with the date filtering on the delta API.
I'm finding on one of the email accounts I'm working with using Office 365 Graph API that the "messages" graph API delta request is returning a different number of items than are actually in a folder for the expected time range. There are 150,000 items covering 10 years in the folder but delta only returns the last 5,000-ish items covering the last 60 or so days.
Paging Works Fine
When querying the graph API for the folder "Inbox" it has 154,045 total items and 57456 unread items.
IUserMailFoldersCollectionPage foldersPage =
await client.Users[mailboxid].MailFolders.Request().GetAsync();
I can skip over 10,000, 50,000 or more messages using paging.
model.messages = await client.Users[mailboxid].MailFolders[folderid].Messages.Request().Top(top)
.Skip(skip).GetAsync();
Delta with Date Filter doesn't work
But when looping with nextToken and deltaTokens, the deltaToken appears after 5000 or so email messages. Basically it seems like it's only returning results for the last couple months even though the filter is saying find messages for the last 20 years.
Here is the example for how we generate the Delta request. The time is hardcoded here but in reality it is a variable.
var sFilter = $"receivedDateTime ge {DateTimeOffset.UtcNow.AddYears(-20).ToString("yyyy-MM-dd")}";
model.messages = await client.Users[mailboxid].MailFolders[folderid].Messages.Delta().Request()
.Header("Prefer", "odata.maxpagesize=" + maxpagesize)
.Filter(sFilter)
.OrderBy("receivedDateTime desc")
.GetAsync();
And then on each paging operation I do the following. "nexttoken" is either the next or delta link depending on what came back from the first request.
model.messages = new MessageDeltaCollectionPage();
model.messages.InitializeNextPageRequest(client, nexttoken);
model.messages = await model.messages.NextPageRequest
.Header("Prefer", "odata.maxpagesize=" + maxpagesize)
.GetAsync();
Delta without Filter works
If I do the exact same code for delta above but remove the "Filter" operation on date, then I get all the messages in the folder.
This isn't a great solution since I normally only need messages for the last year or 2 years and if there are 15 years of messages it is a huge waste to query everything.
Update on 12/3/2019
I'm still getting this issue. I recently switched back to trying to use Delta again whereas before I was querying everything from the server even though I might only need the last month of data. But that's super wasteful.
This code works fine for most mailboxes but sometimes I encounter a mailbox with this issue.
My code looks like this.
string sStartingTime = startingTime.ToString("yyyy'-'MM'-'dd'T'HH':'mm':'ss") + "Z";
var messageCollectionPage = await client.Users[mailboxsource.GetMailboxIdFromAccountID()].MailFolders[folder.Id].Messages.Delta().Request()
.Filter("receivedDateTime+ge+" + Uri.EscapeDataString(sStartingTime))
.Select(select)
.Header("Prefer", "odata.maxpagesize=" + preferredPageSize)
.OrderBy("receivedDateTime desc")
.GetAsync(cancellationToken);
At around 5000 results the Delta request just stops returning results even though there are 66K items in the folder.
Paul, my peers confirmed there is indeed a 5000-item limit if you apply $filter to a delta query of the message resource.
Within the next day, the docs will also be updated with this information. Thank you for your patience and support!
I would like to get the worklogs list from my Jira without specifing an issue.
Yet I have:
jira = JIRA(basic_auth = ('username', 'password'), server='server_name')
issue = jira.search_issues('key=ISSUE-XXX')[0]
wklog = jira.worklogs(issue)
and I would like to have all the worklogs, i.e. something like:
jira = JIRA(basic_auth = ('username', 'password'), server='server_name')
wklog = jira.worklogs('')
Is it possible? Any suggestions? thanks in advance !
In order to get all the issues' worklogs without specifying one particular issue or project, you have to loop through all the issues.
To do this, you have to execute a search on the system with this query:
all_project_key = jira.search_issues('project is not empty&fields=key')
This will return the first 50 issues key of the system and a number, in the field maxResults, of the total number of issues present.
Taken than number, you can perform others searches adding the to the previous query:
&startAt=50
With this new parameter you will be able to fetch the issues from 51 to 100 (or 50 to 99 if you consider the first issue 0).
The next query will be &startAt=100 and so on until you fetch all the issues in the system.
If you wish to fetch more than 50 issues, add to the query:
&maxResults=200
Once you have finished looping through the system, you will have the list of all the issues where you can loop to retrieve the worklog of that issue.
Unfortunately I don't think you can fetch all the worklogs of all the issues at once.
EDIT:
Added query in JIRA python.
I want to get users who retweeted my tweets
$tweets2 = $connection->get("https://api.twitter.com/1.1/statuses/retweets_of_me.json");
Gives me list of my tweets which are retweed by other.
But it does not provide me details about who retweeted it. Any way to do this?
CAn I get this details using tweet ID?
In version 1.o
https://api.twitter.com/1/statuses/21947795900469248/retweeted_by.json
it was there but not present in version 2.:
I tried this:
https://api.twitter.com/1.1/statuses/retweet/241259202004267009.json
but does not show anny response
Any idea or help is appreciated.
Scenaraio is like this:
user1 retweets me 5 times, user2 retweets me 7 times, that means I had 12 retweets.
User1 has 500 followers, user2 has 100 followers, that means my retweet reach was 5x500 + 7x100 = 3200. So, on my webpage, I would like to see 12 retweets and 3200 retweet reach.
Use this api https://dev.twitter.com/docs/api/1.1/get/statuses/retweets_of_me to get the ID of the users.
Pass comma-separated user_id or screen_name to this https://dev.twitter.com/docs/api/1.1/get/users/lookup to get the info about the users.
Using twython I am trying to retrieve list of all of the followers of a particular id which has more than 40k followers. But I am running into below error
"Twitter API returned a 429 (Too many requests) rate limit exceeded. How to over come this issue?
Below is the snippet, I am printing user name and time zone information.
next_cursor = -1
while(next_cursor):
search = twitter.get_followers_list(screen_name='ndtvgadgets',cursor=next_cursor)
for result in search['users']:
time_zone =result['time_zone'] if result['time_zone'] != None else "N/A"
print result["name"].encode('utf-8')+ ' '+time_zone.encode('utf-8')
next_cursor = search["next_cursor"]
Change the search line to:
search = twitter.get_followers_list(screen_name='ndtvgadgets',count=200,cursor=next_cursor)
Then import the time module and insert time.sleep(60) between each API call.
It'll take ages for a user with 41K followers (around three and a half hours for the ndtvgadgets account), but it should work. With the count increased to 200 (the maximum) you're effectively requesting 200 results every minute. If there are other API calls in your script in addition to twitter.get_followers_list you might want to pad the sleep time a little or insert a sleep call after each one.