tweepy Streaming API : full text - twitter

I am using tweepy streaming API to get the tweets containing a particular hashtag . The problem that I am facing is that I am unable to extract full text of the tweet from the Streaming API . Only 140 characters are available and after that it gets truncated.
Here is the code:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
def analyze_status(text):
if 'RT' in text[0:3]:
return True
else:
return False
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
if not analyze_status(status.text):
with open('fetched_tweets.txt', 'a') as tf:
tf.write(status.text.encode('utf-8') + '\n\n')
print(status.text)
def on_error(self, status):
print("Error Code : " + status)
def test_rate_limit(api, wait=True, buffer=.1):
"""
Tests whether the rate limit of the last request has been reached.
:param api: The `tweepy` api instance.
:param wait: A flag indicating whether to wait for the rate limit reset
if the rate limit has been reached.
:param buffer: A buffer time in seconds that is added on to the waiting
time as an extra safety margin.
:return: True if it is ok to proceed with the next request. False otherwise.
"""
# Get the number of remaining requests
remaining = int(api.last_response.getheader('x-rate-limit-remaining'))
# Check if we have reached the limit
if remaining == 0:
limit = int(api.last_response.getheader('x-rate-limit-limit'))
reset = int(api.last_response.getheader('x-rate-limit-reset'))
# Parse the UTC time
reset = datetime.fromtimestamp(reset)
# Let the user know we have reached the rate limit
print "0 of {} requests remaining until {}.".format(limit, reset)
if wait:
# Determine the delay and sleep
delay = (reset - datetime.now()).total_seconds() + buffer
print "Sleeping for {}s...".format(delay)
sleep(delay)
# We have waited for the rate limit reset. OK to proceed.
return True
else:
# We have reached the rate limit. The user needs to handle the rate limit manually.
return False
# We have not reached the rate limit
return True
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=api.auth, listener=myStreamListener,
tweet_mode='extended')
myStream.filter(track=['#bitcoin'], async=True)
Does any one have a solution ?

tweet_mode=extended will have no effect in this code, since the Streaming API does not support that parameter. If a Tweet contains longer text, it will contain an additional object in the JSON response called extended_tweet, which will in turn contain a field called full_text.
In that case, you'll want something like print(status.extended_tweet.full_text) to extract the longer text.

There is Boolean available in the Twitter stream. 'status.truncated' is True when the message contains more than 140 characters. Only then the 'extended_tweet' object is available:
if not status.truncated:
text = status.text
else:
text = status.extended_tweet['full_text']
This works only when you are streaming tweets. When you are collecting older tweets using the API method you can use something like this:
tweets = api.user_timeline(screen_name='whoever', count=5, tweet_mode='extended')
for tweet in tweets:
print(tweet.full_text)
This full_text field contains the text of all tweets, truncated or not.

You have to enable extended tweet mode like so:
s = tweepy.Stream(auth, l, tweet_mode='extended')
Then you can print the extended tweet, but remember due to Twitter APIs you have to make sure extended tweet exists otherwise it'll throw an error
l = listener()
class listener(StreamListener):
def on_status(self, status):
try:
print(status.extended_tweet['full_text'])
except Exception as e:
raise
else:
print(status.text)
return True
def on_error(self, status_code):
if status_code == 420:
return False
Worked for me.

Building upon #AndyPiper's answer, you can check to see if the tweet is there by either a try/except:
def get_tweet_text(tweet):
try:
return tweet.extended_tweet['full_text']
except AttributeError as e:
return tweet.text
OR check against the inner json:
def get_tweet_text(tweet):
if 'extended_tweet' in tweet._json:
return tweet.extended_tweet['full_text']
else:
return tweet.text
Note that extended_tweet is a dictionary object, so "tweet.extended_tweet.full_text" doesn't actually work and will throw an error.

In addition to the previous answer: in my case it worked only as status.extended_tweet['full_text'], because the status.extended_tweet is nothing but a dictionary.

this is what worked for me:
status = tweet if 'extended_tweet' in status._json: status_json = status._json['extended_tweet']['full_text'] elif 'retweeted_status' in status._json and 'extended_tweet' in status._json['retweeted_status']: status_json = status._json['retweeted_status']['extended_tweet']['full_text'] elif 'retweeted_status' in status._json: status_json = status._json['retweeted_status']['full_text'] else: status_json = status._json['full_text'] print(status_json)'
https://github.com/tweepy/tweepy/issues/935 - implemented from here, needed to change what they suggest but the idea stays the same

I use the Following Function:
def full_text_tweeet(id_):
status = api.get_status(id_, tweet_mode="extended")
try:
return status.retweeted_status.full_text
except AttributeError:
return status.full_text
and then call it in my list
tweets_list = []
# foreach through all tweets pulled
for tweet in tweets:
# printing the text stored inside the tweet object
tweet_list = [str(tweet.id),str(full_text_tweeet(tweet.id))]
tweets_list.append(tweet_list)

try this, this is the most simplest and fastest way.
def on_status(self, status):
if hasattr(status, "retweeted_status"): # Check if Retweet
try:
print(status.retweeted_status.extended_tweet["full_text"])
except AttributeError:
print(status.retweeted_status.text)
else:
try:
print(status.extended_tweet["full_text"])
except AttributeError:
print(status.text)
Visit the link it will give you the how extended tweet can be achieve

Related

Adding rules to running stream (tweepy, streamingClient)

I am trying to collect all Tweets of certain users and the replies to those Tweets. Collecting the original Tweets works out fine, but collecting the replies does not. As soon as on_tweet is called (when the stream receives a Tweet), I am trying to add a rule 'in_reply_to_tweet_id: 'id of incoming tweet'' to my stream so that it also stream those replies. Yet the code below doesn't work. I checked with get_rules after the stream was closed and there was no rule added. I also tried adding a simple 'OR: keyword' rule, which was also not added, so the ID is not the problem.
Thanks, any help is appreciated!:)
class stream(tweepy.StreamingClient):
def __init__(self, token):
tweepy.StreamingClient.__init__(self, token)
self.raw_tweets = []
self.raw_replies = []
#this method is called whenever the stream receives a tweet
def on_tweet(self, tweet):
#checking whether new tweet in stream is a original tweet or a reply to a tweet
if (tweet.conversation_id == tweet.id):
#add the id to the rules so that replies to the tweet are also streamed
#this is where the problem is
id_as_str = str(tweet.id)
new_rule = 'OR in_reply_to_tweet_id:' + id_as_str
self.add_rules(add= tweepy.StreamRule(new_rule), dry_run = True)
self.raw_tweets.append(tweet)
print('this is an original'+tweet.text)
else:
self.raw_replies.append(tweet)
print('this is a reply:'+tweet.text)
return self.raw_tweets

extract Geo location of Tweet

Is there any way we culd extract Geo location of tweet, even when user did not enable the location?
I am looking to collect tweets from specific states for sentiment analysis of tweets.
Please help
class StdOutListener(StreamListener): #class allow us to print tweets
def on_data(self, data):
full_tweet = json.loads(data)
##this makes sure that you won't get clipped tweets
if 'extended_tweet' in full_tweet:
tweet_text = full_tweet.get('full_text')
else:
tweet_text = full_tweet.get('text')
tweet_time = full_tweet.get('created_at')
tweet_lang = full_tweet.get('lang')
if tweet_lang != None and tweet_lang == 'en' and tweet_text != None and 'RT #' not in tweet_text:
##this is only taking the text and the time stamp, which is making the DB very space effecient
tweetObject = {
"text": tweet_text,
"time": tweet_time,
}
scPrimary.insert_one(tweetObject)
return True
No, it is not possible unless the user has enabled the feature (it is off by default), more here
If the feature is enabled then when retrieving a tweet (with any of the available methods) you can get this information in the Status object (model returned by Tweepy API methods):
status.coordinates # ie [-49.319543, -16.679431]
status.place.full_name # ie The Netherlands
status.user.location

Storing streamed tweets in a list for further analysis

I am building a data mining app to collect tweets using the Twitter streaming API (via tweepy) and run a suite of NLP algorithms on it. So far all I have been able to do is get the tweets to be written into an external file. Due to the volume of tweets I am going to collect is a 100 at a time (pretty small) and deployment concerns, I wish to collect these tweets to a dictionary or list for further analysis. However, I have failed in doing this. The code I have so far is given below:
import tweepy
class MyStreamListener(tweepy.StreamListener):
def __init__(self, api=None):
super(MyStreamListener, self).__init__()
self.num_tweets = 0
self.tweets = []
def on_status(self, status):
#print(status.text)
self.num_tweets += 1
self.tweets.append(status.text)
if self.num_tweets > 100:
return False
def getstreams(keyword):
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_SECRET = ''
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth,listener=myStreamListener)
tweet_list = myStream.filter(track=[keyword])
return tweet_list.tweets
getstreams('Starbucks')
However when I run this, all I get is:
AttributeError: 'NoneType' object has no attribute 'tweets'
pointing to the line:
return tweet_list.tweets
I'd be grateful if anyone could answer how to overcome this issue and shed insight on how to collect n number of tweets into a list.
You can use the on_data function in your class.
def on_data(self, data):
# Converting data , which is an object, into JSON
tweet = json.loads(data)
# my_tweet is our list declared globally
my_tweet.append(tweet)

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 512 more expected)', IncompleteRead

I wanted to write a program to fetch tweets from Twitter and then do sentiment analysis. I wrote the following code and got the error even after importing all the necessary libraries. I'm relatively new to data science, so please help me.
I could not understand the reason for this error:
class TwitterClient(object):
def __init__(self):
# keys and tokens from the Twitter Dev Console
consumer_key = 'XXXXXXXXX'
consumer_secret = 'XXXXXXXXX'
access_token = 'XXXXXXXXX'
access_token_secret = 'XXXXXXXXX'
api = Api(consumer_key, consumer_secret, access_token, access_token_secret)
def preprocess(tweet, ascii=True, ignore_rt_char=True, ignore_url=True, ignore_mention=True, ignore_hashtag=True,letter_only=True, remove_stopwords=True, min_tweet_len=3):
sword = stopwords.words('english')
if ascii: # maybe remove lines with ANY non-ascii character
for c in tweet:
if not (0 < ord(c) < 127):
return ''
tokens = tweet.lower().split() # to lower, split
res = []
for token in tokens:
if remove_stopwords and token in sword: # ignore stopword
continue
if ignore_rt_char and token == 'rt': # ignore 'retweet' symbol
continue
if ignore_url and token.startswith('https:'): # ignore url
continue
if ignore_mention and token.startswith('#'): # ignore mentions
continue
if ignore_hashtag and token.startswith('#'): # ignore hashtags
continue
if letter_only: # ignore digits
if not token.isalpha():
continue
elif token.isdigit(): # otherwise unify digits
token = '<num>'
res += token, # append token
if min_tweet_len and len(res) < min_tweet_len: # ignore tweets few than n tokens
return ''
else:
return ' '.join(res)
for line in api.GetStreamSample():
if 'text' in line and line['lang'] == u'en': # step 1
text = line['text'].encode('utf-8').replace('\n', ' ') # step 2
p_t = preprocess(text)
# attempt authentication
try:
# create OAuthHandler object
self.auth = OAuthHandler(consumer_key, consumer_secret)
# set access token and secret
self.auth.set_access_token(access_token, access_token_secret)
# create tweepy API object to fetch tweets
self.api = tweepy.API(self.auth)
except:
print("Error: Authentication Failed")
Assume all the necessary libraries are imported. The error is on line 69.
for line in api.GetStreamSample():
if 'text' in line and line['lang'] == u'en': # step 1
text = line['text'].encode('utf-8').replace('\n', ' ') # step 2
p_t = preprocess(text)
I tried checking on the internet the reason for the error but could not get any solution.
Error was:
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 512 more expected)', IncompleteRead(0 bytes read, 512 more expected))
I'm using Python 2.7 and requests version 2.14, the latest one.
If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should make the request within a with statement to ensure it’s always closed:
with requests.get('http://httpbin.org/get', stream=True) as r:
# Do things with the response here.
I had the same problem but without stream, and as stone mini said, just apply "with" clause before to make sure your request is closed before a new request.
with requests.request("POST", url_base, json=task, headers=headers) as report:
print('report: ', report)
actually the problem with your django2.7 or earlier version based application. that django versions by default allowed 2.5mb data upload memory size of request body.
I was facing the same issue with django2.7 based application, I just updated the setting.py file of my django application where my urls(endpoints) were working.
DATA_UPLOAD_MAX_MEMORY_SIZE = None
I just add the above variable in my application's settings.py file.
you can also readout about that from here
I'm pretty sure this will work for you.

Youtube Data Api stopping after 1000 results

I have been trying to retrieve a list of all my subscribers for my youtube channel. I am using a query of the format:
https://content.googleapis.com/youtube/v3/subscriptions?mySubscribers=true&maxResults=50&part=subscriberSnippet&access_token=xxxx
However, after 1000 results (for example, 20 pages # 50 per page, or 50 pages # 20 per page) it stops. Within the documentation it say for myRecentSubscribers it should only retrieve 1000, but for mySubscribers there is no limit. Is there some unwritten limit?
I just ran into the same issue w/ the following code:
import os
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
scopes = ["https://www.googleapis.com/auth/youtube.readonly"]
def main():
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
client_secrets_file = "google_oauth_client_secret.json"
# Get credentials and create an API client
flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
client_secrets_file, scopes)
credentials = flow.run_console()
youtube = googleapiclient.discovery.build(
api_service_name, api_version, credentials=credentials)
next_page_token = ''
count = 0
while next_page_token != None:
print("Getting next page:", next_page_token, count)
request = youtube.subscriptions().list(
part="subscriberSnippet",
maxResults=50,
mySubscribers=True,
pageToken=next_page_token
)
response = request.execute()
next_page_token = response.get('nextPageToken')
subscribers = response.get('items')
for subscriber in subscribers:
count += 1
print('Total subscribers:', count)
if __name__ == "__main__":
main()
This is the expected behavior if myRecentSubscribers=True but the list should continue with mySubscribers=True...
EDIT: Apparently this is a WONT FIX issue, and the documentation will likely eventually be updated to reflect.(https://issuetracker.google.com/issues/172325507)

Resources