I'm currently trying to use Tweepy to get a bunch of recent tweets from one user, without including retweets. Originally I was using:
tweets = []
for i in tweepy.Cursor(api.user_timeline,
screen_name = 'user',
tweet_mode='extended').items():
tweets.append(i.full_text)
Using api.user_timeline gave me about 3400 results, but this included retweets.
I then tried using api.search_tweets as follows:
tweets = []
for i in tweepy.Cursor(api.search_tweets,
q = 'from:user -filter:retweets',
tweet_mode='extended').items():
tweets.append(i.full_text)
This only gives me 148 results, where the user indeed has tweeted a lot more than that. Is there any way I can use api.search_tweets and get more tweets? I tried adding in since:2021-06-01 but that still didn't work, also tried adding a count parameter into the mix... that didnt' work either.
Related
Goal: Get all (non-live) videos uploaded on October 2020 order by viewCount
My first try was like this:
request = youtube.search().list(
part="snippet",
order="viewCount",
publishedAfter="2020-10-01T00:00:00Z",
publishedBefore="2020-10-31T23:59:59Z",
type="video",
maxResults=50
)
It returns me YpUR1rHXbqs video which is live (and still running). IMHO it doesn't fit publishedBefore="2020-10-31T23:59:59Z" since it still running, but never mind.
I try to filter out live videos with:
response = request.execute()
for item in response["items"]:
if item["snippet"]["liveBroadcastContent"] != 'live':
It returns o6zDG9jYpC0 with 7 624 725 views today and 'liveBroadcastContent'='none' Seems too little to me.
So I try add q="a" to check if o6zDG9jYpC0 is the most viewed one:
request = youtube.search().list(
q='a',
part="snippet",
order="viewCount",
publishedAfter="2020-10-01T00:00:00Z",
publishedBefore="2020-10-31T23:59:59Z",
type="video",
maxResults=50
)
Returns uy30PB5BpV0 with 192 861 990 views :-( and 'liveBroadcastContent'='none'
I try to experiment with q="", q="*", q="%2A", q="+", q=" ", q=None ... , but no luck.
Adding eventType="completed" seems to work well with missing/not set q but it omits 'liveBroadcastContent'='none' videos.
How can I query all videos please?
Side note: I'm aware of Video: list (most popular videos) chart='mostPopular' but it doesn't support publishedAfter/Before which is required
I am unable to get the entire text of a tweet - every tweet is limited to only 140 characters, afterwards ends in "..."
I am already using full_text and tweet_mode='extended', still it does not yet work
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
users_locs = [[tweet.user.screen_name, tweet.user.location,tweet.full_text, tweet.created_at] for tweet in tweets]
tweet_text = pd.DataFrame(data=users_locs,columns=['user', 'location','text','date'])
Could you help me with this?
Check out the version of Tweepy you are using and update to the latest.
I used three different (including yours) ways and I am able to get the "full_text" with "tweet_mode='extended'" with Tweepy 3.9.0
The methods I used are:
status_list = api.statuses_lookup(list_of_ids, trim_user=False, tweet_mode="extended")
tweet_status = api.get_status(single_id, tweet_mode="extended")
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
And got the full_text in every one of them.
I created a tweepy listener to collect tweets into a local MongoDB during the first presidential debate but have realized that the tweets I have been collecting are limited to 140 characters and many are being cut off at the 140 character limit. In my stream I had definied tweet_mode='extended' which I thought would have resolved this issue, however, I am still not able to retrieve the full length of tweets longer than 140 characters. Below is my code:
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# Create a listener MyListener that streams and stores tweets to a local MongoDB
class MyListener(StreamListener):
def __init__(self):
super().__init__()
self.list_of_tweets = deque([], maxlen=5)
def on_data(self, data):
try:
tweet_text = json.loads(data)
self.list_of_tweets.append(tweet_text)
self.print_list_of_tweets()
db['09292020'].insert_one(tweet_text)
except:
None
def on_error(self, status):
print(status)
def print_list_of_tweets(self):
display.clear_output(wait=True)
for index, tweet_text in enumerate(self.list_of_tweets):
m='{}. {}\n\n'.format(index, tweet_text)
print(m)
debate_stream = Stream(auth, MyListener(), tweet_mode='extended')
debate_stream = debate_stream.filter(track=['insert', 'debate', 'keywords', 'here'])
Any input into how I can obtain the full extended tweet via this listener would be greatly appreciated!
tweet_mode=extended has no effect on the legacy standard streaming API, as Tweets are delivered in both truncated (140) and extended (280) form by default.
So you'll want your Stream Listener set up like this:
debate_stream = Stream(auth, MyListener())
What you should be seeing is that the JSON object for longer Tweets has a text field of 140 characters, but contains an additional dictionary called extended_tweet which in turn contains a full_text field with the full Tweet text.
You can try changing your second to last line to
debate_stream = Stream(auth, MyListener()).extended_tweet["full_text"]
Not sure, if this will work, but try it out.
Is it possible to pull the auto (non-user) generated video transcripts from any of the YouTube APIs?
As of Aug 2019 the following method you to download transcripts:
Open in Browser
https://www.youtube.com/watch?v=[Video ID]
From Console type:
JSON.parse(ytplayer.config.args.player_response).captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl
You may refer with this thread: How to get "transcript" in youtube-api v3
If you're authenticating with oAuth2, you could do a quick call to
this feed:
http://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captiondata/[CAPTIONTRACKID]
to get the data you want. To retrieve a list of possible caption track
IDs with v2 of the API, you access this feed:
https://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captions
That feed request also accepts some optional parameters, including
language, max-results, etc. For more details, along with a sample that
shows the returned format of the caption track list, see the
documentation at
https://developers.google.com/youtube/2.0/developers_guide_protocol_captions#Retrieve_Caption_Set
Also, here are some references which migh help:
https://www.quora.com/Is-there-any-way-to-download-the-YouTube-transcripts-that-are-generated-automatically
http://ccm.net/faq/40644-how-to-get-the-transcript-of-a-youtube-video
1 Install youtube-transcript-api (https://github.com/jdepoix/youtube-transcript-api), e.g.:
pip3 install youtube_transcript_api
2 Create youtube_transcript_api-wrapper.py with the following code (based partially on https://stackoverflow.com/a/65325576/2585501):
from youtube_transcript_api import YouTubeTranscriptApi
#srt = YouTubeTranscriptApi.get_transcript(video_id)
videoListName = "youtubeVideoIDlist.txt"
with open(videoListName) as f:
video_ids = f.read().splitlines()
transcript_list, unretrievable_videos = YouTubeTranscriptApi.get_transcripts(video_ids, continue_after_error=True)
for video_id in video_ids:
if video_id in transcript_list.keys():
print("\nvideo_id = ", video_id)
#print(transcript)
srt = transcript_list.get(video_id)
text_list = []
for i in srt:
text_list.append(i['text'])
text = ' '.join(text_list)
print(text)
3 Create youtubeVideoIDlist.txt containing a list of video_ids
4 python3 youtube_transcript_api-wrapper.py
I'm trying to get all connections (interactions) on a facebook page since a certain time period. I'm using the koala gem and filtering the request with "since: 1.month.ago.to_i" which seems to work fine. However, this gives me 25 results at a time. If I change the limit to 446 (the maximum it seems) that works better. But...if I use .next_page to give me the next set of results within the given time range, it instead just gives me a next set of results without obeying the time range.
For example, let's say I don't increase the limit and I have 25 results per request. I do something like:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
let's assume there are 30 results for this and the first request gets me 25 (the default limit). then, if I do this:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i}).next_page
instead of returning the last 5 results, it returns 25 more, 20 of which are not "since: 1.month.ago.to_i". I have a while loop cycling through the pages but I don't know where to stop since it just keep returning results to me no matter what as long as I keep calling .next_page.
is there a better way of doing this?
if not, what's the best way to check to make sure the post i'm looking at in the loop is still within the time range i want and to break out if not?
here's my code:
def perform(fan_page_id, pagination_options = {})
#since_date = pagination_options[:since_date] if pagination_options[:since_date]
#limit = pagination_options[:limit] if pagination_options[:limit]
#oauth = Koala::Facebook::OAuth.new
#api = Koala::Facebook::API.new #oauth.get_app_access_token
fb_page = #api.get_object(fan_page_id)
#fan_page_id = fb_page["id"]
# Collect all the users who liked, commented, or liked *and* commented on a post
process_posts(#api.get_connections(#fan_page_id, "feed", {since: #since_date})) do |post|
## do stuff based on each post
end
end
private
# Take each post from the specified feed and perform the provided
# code on each post in that feed.
#
# #param [Koala::Facebook::API::GraphCollection] feed An API response containing a page's feed
def process_posts(feed, options = {})
raise ArgumentError unless block_given?
current_feed = feed
begin
current_feed.each { |post| yield(post) }
current_feed = current_feed.next_page
end while current_feed.any?
end
current = #api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
next = current.next_page
next = next.next_page
.....
Please try these, I think they work.