I am unable to get the entire text of a tweet - every tweet is limited to only 140 characters, afterwards ends in "..."
I am already using full_text and tweet_mode='extended', still it does not yet work
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
users_locs = [[tweet.user.screen_name, tweet.user.location,tweet.full_text, tweet.created_at] for tweet in tweets]
tweet_text = pd.DataFrame(data=users_locs,columns=['user', 'location','text','date'])
Could you help me with this?
Check out the version of Tweepy you are using and update to the latest.
I used three different (including yours) ways and I am able to get the "full_text" with "tweet_mode='extended'" with Tweepy 3.9.0
The methods I used are:
status_list = api.statuses_lookup(list_of_ids, trim_user=False, tweet_mode="extended")
tweet_status = api.get_status(single_id, tweet_mode="extended")
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
And got the full_text in every one of them.
Related
I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!
I'm currently trying to use Tweepy to get a bunch of recent tweets from one user, without including retweets. Originally I was using:
tweets = []
for i in tweepy.Cursor(api.user_timeline,
screen_name = 'user',
tweet_mode='extended').items():
tweets.append(i.full_text)
Using api.user_timeline gave me about 3400 results, but this included retweets.
I then tried using api.search_tweets as follows:
tweets = []
for i in tweepy.Cursor(api.search_tweets,
q = 'from:user -filter:retweets',
tweet_mode='extended').items():
tweets.append(i.full_text)
This only gives me 148 results, where the user indeed has tweeted a lot more than that. Is there any way I can use api.search_tweets and get more tweets? I tried adding in since:2021-06-01 but that still didn't work, also tried adding a count parameter into the mix... that didnt' work either.
I created a tweepy listener to collect tweets into a local MongoDB during the first presidential debate but have realized that the tweets I have been collecting are limited to 140 characters and many are being cut off at the 140 character limit. In my stream I had definied tweet_mode='extended' which I thought would have resolved this issue, however, I am still not able to retrieve the full length of tweets longer than 140 characters. Below is my code:
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# Create a listener MyListener that streams and stores tweets to a local MongoDB
class MyListener(StreamListener):
def __init__(self):
super().__init__()
self.list_of_tweets = deque([], maxlen=5)
def on_data(self, data):
try:
tweet_text = json.loads(data)
self.list_of_tweets.append(tweet_text)
self.print_list_of_tweets()
db['09292020'].insert_one(tweet_text)
except:
None
def on_error(self, status):
print(status)
def print_list_of_tweets(self):
display.clear_output(wait=True)
for index, tweet_text in enumerate(self.list_of_tweets):
m='{}. {}\n\n'.format(index, tweet_text)
print(m)
debate_stream = Stream(auth, MyListener(), tweet_mode='extended')
debate_stream = debate_stream.filter(track=['insert', 'debate', 'keywords', 'here'])
Any input into how I can obtain the full extended tweet via this listener would be greatly appreciated!
tweet_mode=extended has no effect on the legacy standard streaming API, as Tweets are delivered in both truncated (140) and extended (280) form by default.
So you'll want your Stream Listener set up like this:
debate_stream = Stream(auth, MyListener())
What you should be seeing is that the JSON object for longer Tweets has a text field of 140 characters, but contains an additional dictionary called extended_tweet which in turn contains a full_text field with the full Tweet text.
You can try changing your second to last line to
debate_stream = Stream(auth, MyListener()).extended_tweet["full_text"]
Not sure, if this will work, but try it out.
How to remove hashtag, user mentions & URLs from tweet. Twitter4j library(sentiment analysis) does not work properly with these noise words
Example:
Tweet: Hello great morning today #summermorning #evilpriest #holysinner https://goo.le/asxmo/dataload.......
Should look like -
Hello great morning today summermorning
Is there any method or utility available in twitter4J itself or we need to write our own? Please guide.
Use regular expressions to filter out the #es before parsing a sentence through the sentiment analysis pipeline!
Use this:
String withoutHashTweet = originalTweet.replaceAll("[#]", "");
So "Hello great morning today #summermorning #evilpriest #holysinner " should return : "Hello great morning today summermorning #evilpriest #holysinner"
Similarly replace the hash in the code with # to remove the respective sign
Something like that :
let tweet = "#arthurlacoste check this link : http://lit.ly/hugeLink ! so #nsfw";
tweet = tweet.replace(/(?:https?|ftp):\/\/[\n\S]+/g, '') // remove links
//.replace(/\#\w\w+\s?/g, '') remove hashtags words
.replace('#', '') // remove hashtags only
.replace(/\#\w\w+\s?/g, ''); // remove mentions
console.log(tweet);
// output : "check this link : ! so nsfw"
Is it possible to pull the auto (non-user) generated video transcripts from any of the YouTube APIs?
As of Aug 2019 the following method you to download transcripts:
Open in Browser
https://www.youtube.com/watch?v=[Video ID]
From Console type:
JSON.parse(ytplayer.config.args.player_response).captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl
You may refer with this thread: How to get "transcript" in youtube-api v3
If you're authenticating with oAuth2, you could do a quick call to
this feed:
http://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captiondata/[CAPTIONTRACKID]
to get the data you want. To retrieve a list of possible caption track
IDs with v2 of the API, you access this feed:
https://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captions
That feed request also accepts some optional parameters, including
language, max-results, etc. For more details, along with a sample that
shows the returned format of the caption track list, see the
documentation at
https://developers.google.com/youtube/2.0/developers_guide_protocol_captions#Retrieve_Caption_Set
Also, here are some references which migh help:
https://www.quora.com/Is-there-any-way-to-download-the-YouTube-transcripts-that-are-generated-automatically
http://ccm.net/faq/40644-how-to-get-the-transcript-of-a-youtube-video
1 Install youtube-transcript-api (https://github.com/jdepoix/youtube-transcript-api), e.g.:
pip3 install youtube_transcript_api
2 Create youtube_transcript_api-wrapper.py with the following code (based partially on https://stackoverflow.com/a/65325576/2585501):
from youtube_transcript_api import YouTubeTranscriptApi
#srt = YouTubeTranscriptApi.get_transcript(video_id)
videoListName = "youtubeVideoIDlist.txt"
with open(videoListName) as f:
video_ids = f.read().splitlines()
transcript_list, unretrievable_videos = YouTubeTranscriptApi.get_transcripts(video_ids, continue_after_error=True)
for video_id in video_ids:
if video_id in transcript_list.keys():
print("\nvideo_id = ", video_id)
#print(transcript)
srt = transcript_list.get(video_id)
text_list = []
for i in srt:
text_list.append(i['text'])
text = ' '.join(text_list)
print(text)
3 Create youtubeVideoIDlist.txt containing a list of video_ids
4 python3 youtube_transcript_api-wrapper.py