How to keep order when parsing reddit with praw? - parsing

I'm trying praw for parsing this reddit page. And I found that this code doesn't save the order:
sm = reddit.submission(url="https://www.reddit.com/r/AskReddit/comments/1irtkq/taxi_drivers_whats_the_deepest_secret_youve/")
sm.comment_sort = 'top'
sm.comments.replace_more(0)
allComments = sm.comments.list()
for i in allComments[1].replies:
print(i.body[:10])
Is it possible to fix that and get the same order for all trees?

Related

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

Text string using Biopython

I'm using Biopython in my code and i need to extract the abstract out of articles. For searching the article I'm using the function:
def search(query):
Entrez.email = 'your.email#example.com'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='20',
retmode='xml',
term=query)
results = Entrez.read(handle)
return results
I'm looking for the simpliest way to get the text as a string after searching the article (I'm aiming just for one result in a search using the pmid).
cheers
Try use metapub:
from metapub import PubMedFetcher
fetch = PubMedFetcher()
article = fetch.article_by_pmid('31326596')
article.abstract

Tweepy's api.search_tweets is only giving me 148 results

I'm currently trying to use Tweepy to get a bunch of recent tweets from one user, without including retweets. Originally I was using:
tweets = []
for i in tweepy.Cursor(api.user_timeline,
screen_name = 'user',
tweet_mode='extended').items():
tweets.append(i.full_text)
Using api.user_timeline gave me about 3400 results, but this included retweets.
I then tried using api.search_tweets as follows:
tweets = []
for i in tweepy.Cursor(api.search_tweets,
q = 'from:user -filter:retweets',
tweet_mode='extended').items():
tweets.append(i.full_text)
This only gives me 148 results, where the user indeed has tweeted a lot more than that. Is there any way I can use api.search_tweets and get more tweets? I tried adding in since:2021-06-01 but that still didn't work, also tried adding a count parameter into the mix... that didnt' work either.

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

Reddit api return the content of a comment or a self.text

After looking at the documentation I still can't understand how it's all tied up. What I am trying to accomplish is simple: given an url, return the text contents of that url.
For example:
import praw
r = praw.Reddit(user_agent='my_cool_app')
post = "http://www.reddit.com/r/askscience/comments/10kp2h\
/lots_of_people_dont_feel_identified_or_find/"
comment = "http://www.reddit.com/r/askscience/comments/10kp2h\
/lots_of_people_dont_feel_identified_or_find/c6ec6hf"
Establishing which is a comment and which is a post can be done using regex but if there's a better way I will use that.
So my question is: what is the best way to determine the nature of a reddit url and how do I get the contents of that url?
What I tried so far:
post=praw.objects.Submission.get_info(r, url).selftext
#returns the self.text of a post regardless if that url is a permalink to a comment
comment_text = praw.objects.?????() # how to do this ?
Thanks in advance.
import praw
r = praw.Reddit('<USERAGENT>')
comment_url = ('http://www.reddit.com/r/askscience/comments/10kp2h'
'/lots_of_people_dont_feel_identified_or_find/c6ec6hf')
comment = r.get_submission(comment_url).comments[0]
print comment.body
My responses here and here should provide additional useful information related to your question.

Resources