I created a tweepy listener to collect tweets into a local MongoDB during the first presidential debate but have realized that the tweets I have been collecting are limited to 140 characters and many are being cut off at the 140 character limit. In my stream I had definied tweet_mode='extended' which I thought would have resolved this issue, however, I am still not able to retrieve the full length of tweets longer than 140 characters. Below is my code:
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# Create a listener MyListener that streams and stores tweets to a local MongoDB
class MyListener(StreamListener):
def __init__(self):
super().__init__()
self.list_of_tweets = deque([], maxlen=5)
def on_data(self, data):
try:
tweet_text = json.loads(data)
self.list_of_tweets.append(tweet_text)
self.print_list_of_tweets()
db['09292020'].insert_one(tweet_text)
except:
None
def on_error(self, status):
print(status)
def print_list_of_tweets(self):
display.clear_output(wait=True)
for index, tweet_text in enumerate(self.list_of_tweets):
m='{}. {}\n\n'.format(index, tweet_text)
print(m)
debate_stream = Stream(auth, MyListener(), tweet_mode='extended')
debate_stream = debate_stream.filter(track=['insert', 'debate', 'keywords', 'here'])
Any input into how I can obtain the full extended tweet via this listener would be greatly appreciated!
tweet_mode=extended has no effect on the legacy standard streaming API, as Tweets are delivered in both truncated (140) and extended (280) form by default.
So you'll want your Stream Listener set up like this:
debate_stream = Stream(auth, MyListener())
What you should be seeing is that the JSON object for longer Tweets has a text field of 140 characters, but contains an additional dictionary called extended_tweet which in turn contains a full_text field with the full Tweet text.
You can try changing your second to last line to
debate_stream = Stream(auth, MyListener()).extended_tweet["full_text"]
Not sure, if this will work, but try it out.
Related
I'm currently trying to use Tweepy to get a bunch of recent tweets from one user, without including retweets. Originally I was using:
tweets = []
for i in tweepy.Cursor(api.user_timeline,
screen_name = 'user',
tweet_mode='extended').items():
tweets.append(i.full_text)
Using api.user_timeline gave me about 3400 results, but this included retweets.
I then tried using api.search_tweets as follows:
tweets = []
for i in tweepy.Cursor(api.search_tweets,
q = 'from:user -filter:retweets',
tweet_mode='extended').items():
tweets.append(i.full_text)
This only gives me 148 results, where the user indeed has tweeted a lot more than that. Is there any way I can use api.search_tweets and get more tweets? I tried adding in since:2021-06-01 but that still didn't work, also tried adding a count parameter into the mix... that didnt' work either.
I am unable to get the entire text of a tweet - every tweet is limited to only 140 characters, afterwards ends in "..."
I am already using full_text and tweet_mode='extended', still it does not yet work
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
users_locs = [[tweet.user.screen_name, tweet.user.location,tweet.full_text, tweet.created_at] for tweet in tweets]
tweet_text = pd.DataFrame(data=users_locs,columns=['user', 'location','text','date'])
Could you help me with this?
Check out the version of Tweepy you are using and update to the latest.
I used three different (including yours) ways and I am able to get the "full_text" with "tweet_mode='extended'" with Tweepy 3.9.0
The methods I used are:
status_list = api.statuses_lookup(list_of_ids, trim_user=False, tweet_mode="extended")
tweet_status = api.get_status(single_id, tweet_mode="extended")
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
And got the full_text in every one of them.
I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)
$http.get("https://www.googleapis.com/youtube/v3/playlistItems?part=snippet&playlistId=PLFgquLnL59alCl_2TQvOiD5Vgm1hCaGSI&key={mykey}&maxResults=10")
I used the playlistItems but couldn't get the statistic part which contain duration of the video. Do I need to call twice? Get the video Id and make another call? or I'm missing something in this case?
For whatever reason, playlistItems do not include some things like statistics or category. You'll need to make a separate call using the video ID and https://developers.google.com/youtube/v3/docs/videos/list in order to get those fields.
This is how I do it (using Python but you can adapt it for whatever language you are using with http requests and JSON parsing)
url = "https://www.googleapis.com/youtube/v3/videos?id=" + videoId
+ "&key=" + DEVELOPER_KEY + "&part=snippet,contentDetails"
r = requests.get(url)
metadata = r.json()["items"][0]
channelName = metadata["snippet"]["channelTitle"]
publishedTime = metadata["snippet"]["publishedAt"]
duration = metadata["contentDetails"]["duration"]
duration is in a strange format that looks like
PT4M11S
meaning 4 minutes 11 seconds. You will have to "parse" this.
I'm trying to get all connections (interactions) on a facebook page since a certain time period. I'm using the koala gem and filtering the request with "since: 1.month.ago.to_i" which seems to work fine. However, this gives me 25 results at a time. If I change the limit to 446 (the maximum it seems) that works better. But...if I use .next_page to give me the next set of results within the given time range, it instead just gives me a next set of results without obeying the time range.
For example, let's say I don't increase the limit and I have 25 results per request. I do something like:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
let's assume there are 30 results for this and the first request gets me 25 (the default limit). then, if I do this:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i}).next_page
instead of returning the last 5 results, it returns 25 more, 20 of which are not "since: 1.month.ago.to_i". I have a while loop cycling through the pages but I don't know where to stop since it just keep returning results to me no matter what as long as I keep calling .next_page.
is there a better way of doing this?
if not, what's the best way to check to make sure the post i'm looking at in the loop is still within the time range i want and to break out if not?
here's my code:
def perform(fan_page_id, pagination_options = {})
#since_date = pagination_options[:since_date] if pagination_options[:since_date]
#limit = pagination_options[:limit] if pagination_options[:limit]
#oauth = Koala::Facebook::OAuth.new
#api = Koala::Facebook::API.new #oauth.get_app_access_token
fb_page = #api.get_object(fan_page_id)
#fan_page_id = fb_page["id"]
# Collect all the users who liked, commented, or liked *and* commented on a post
process_posts(#api.get_connections(#fan_page_id, "feed", {since: #since_date})) do |post|
## do stuff based on each post
end
end
private
# Take each post from the specified feed and perform the provided
# code on each post in that feed.
#
# #param [Koala::Facebook::API::GraphCollection] feed An API response containing a page's feed
def process_posts(feed, options = {})
raise ArgumentError unless block_given?
current_feed = feed
begin
current_feed.each { |post| yield(post) }
current_feed = current_feed.next_page
end while current_feed.any?
end
current = #api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
next = current.next_page
next = next.next_page
.....
Please try these, I think they work.