Extracting titles of articles deposited in pubmed from Pubmed IDs using bioentrez - biopython

I am trying to extract titles for some articles deposited in Pubmed using Pubmed IDs (that I have in a list called 'ids'). There are around 650K Pubmed IDs. The code seems to work fine and doesn't throw any errors. But the code extracts tiles only for a fraction of articles and not all.
Following is the code:
Entrez.email = "xyz#enginebio.com"
for i in range(0,len(ids),10000):
if i%10000 == 0: # for me to track the progress of the script
print (i)
idlist=ids[i:i+10000]
handle = Entrez.efetch(db="pubmed", id=idlist, retmode="xml")
try:
record = Entrez.read(handle)
except:
continue
title={}
for j in range(len(record["PubmedArticle"])):
pmid=record["PubmedArticle"][j]['MedlineCitation']['PMID'][:]
if "Abstract" in record["PubmedArticle"][j]['MedlineCitation']['Article'].keys():
title[pmid]=record["PubmedArticle"][j]['MedlineCitation']['Article']['ArticleTitle'].encode('ascii', 'ignore').decode('ascii')
# save article titles
subfile='article_titles_'+str(i)+'.txt'
ar = pd.DataFrame.from_dict(title, orient="index")
ar.to_csv(subfile,sep="\t",header=None)
Any suggestions will be useful. Thanks

I cannot reproduce your example because I do not have your Pubmed ID list. It would be also interesting to know how many of the ~650K IDs you recover, it's not the same if you are recovering 639K titles (probably some of your IDs are simply missing), or 10K. I have tried a mini example myself and it does retrieve the titles. I think maybe some of the IDs are not valid. You could try to do smaller batches and also:
This except: continue will hide any problems that may have arisen from an empty handle (if the query result was empty). I would try to check the exceptions.
Throw a warning if len(record["PubmedArticle"]) is smaller than your batch size. This way you can narrow down the IDs you may be missing.
You are only adding the title to your title dict if the registry has an Abstract field. Are you sure this is the case for all the records? The cases I tried it did apply but maybe not all entries have this.

Related

Lua - search for key inside of multiple tables

For my program, I have multiple tables inside of each other for organization and readability. These tables can look something like this
local buttons = {
loadingScreen = {
bLoadingEnter = function()
end
}, ...
}
What I want to do is find the first element named bLoadingEnter in the table. I dont know that the element named bLoadingEnter will be in loadingScreen. I've thought of getting all the keys in the table then check them. I couldn't get that to work. Any help would be appreciated!
If execution speed is in any way relevant, the answer is:
you don't
Looking up a key in a nested table takes a lot longer than looking it up in just a normal table. The next problem is that of uniqueness. Two or more nested tables can have the same key with different values, which may lead to odd bugs down the road. You'd either have to check for that when inserting (making your code even slower) or just hope things miraculously go well and nothing explodes later on.
I'd say just use a flat table. If your keys are well named (like bLoadingEnter) you'll be able to infer meaning from the name either way, no nesting required.
That being said, nested tables can be a good option, if most of the time you know what path to take, or when you have some sort of ordered structure like a binary search tree. Or if speed really isn't an important factor to consider.
Okay try
for _, item in ipairs(buttons) do
if item.bLoadingEnter ~= nil then
return item.bLoadingEnter
end
end

Implement an efficient slack-like subdomain name suggestion

I have a Rails app with PostgreSQL.
I'm trying to implement a method to suggest alternative names for a certain resource, if the user input has been already chosen.
My reference is slack:
Is there any solution that could do this efficiently?
For efficiently I mean: using only one or also a small set of queries. A pure SQL solution would be great, though.
My initial implementation looked like this:
def generate_alternative_names(model, column_name, count)
words = model[column_name].split(/[,\s\-_]+/).reject(&:blank?)
candidates = 100.times.map! { |i| generate_candidates_using_a_certain_strategy(i, words) }
already_used = model.class.where(column_name => candidates).pluck(column_name)
(candidates - already_used).first(count)
end
# Usage example:
model = Domain.new
model.name = 'hello-world'
generate_alternative_names(model, :name, 5)
# => ["hello_world", "hello-world2", "world_hello", ...]
It generates 100 candidates, then checks the database for matches and removes them from the candidates list. Finally it returns the first count values extracted.
This method is a best effort implementation, as it works for small sets of suggestions, that have few conflicts (in my case, 100 conflicts).
Even if I increase this magic number (100), it does not scales indefinitely.
Do you know a method to improve this, so it can scale for large number of conflicts and without using magic numbers?
I would go with reversed approach: query the database for existing records using LIKE and then generate suggestions skipping already taken:
def alternatives(model, column, word, count)
taken = model.class.where("#{column} LIKE '%#{word}%'").pluck(column)
count.times.map! do |i|
generate_candidates_using_a_certain_strategy(i, taken)
end
end
Make a generate_candidates_using_a_certain_strategy to receive an array of already taken words to be skipped. There could be one possible glitch with race condition on two requests taking the same name, but I don’t think it might cause any problems, since you are always free to apologize when an actual creation will fail.

Solr join and non-Join queries give different results

I am trying to link two types of documents in my Solr index. The parent is named "house" and the child is named "available". So, I want to return a list of houses that have available documents with some filtering. However, the following query gives me around 18 documents, which is wrong. It should return 0 documents.
q=*:*
&fq={!join from=house_id_fk to=house_id}doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq={!join from=house_id_fk to=house_id}doctype:available AND sd_year:2014 AND sd_month:11
To debug it, I tried first to check whether there is any available documents with the given filter queries. So, I tried the following query:
q=*:*
&fq=doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq=doctype:available AND sd_year:2014 AND sd_month:11
The query gives 0 results, which is correct. So as you can see both queries are the same, the different is using the join query parser. I am a bit confused, why the first query gives results. My understanding is that this should not happen because the second query shows that there is no any available documents that satisfy the given filter queries.
I have figured it out.
The reason is simply the type of join in Solr. It is an outer join. Since both filter queries are executed separately, a house that has available documents with discount > 1 or (sd_year:2014 AND sd_month:11) will be returned even though my intention was applying bother conditions at the same time.
However, in the second case, both conditions are applied at same time to find available documents, then houses based on the matching available documents are returned. Since there is no any available document that satisfies both conditions, then there is no any matching house which gives zero results.
It really took sometime to figure this out, I hope this will help someone else.

CoreData getting objects based on a distinct property

I've had a problem for a while and I have hacked together a solution but I am revisiting it in the hopes of finding a real solution. Unfortunately that is not happening. In Core Data I've got a bunch of RSS articles. The user can subscribe to individual channels within a single feed. The problem is that some feed providers post the exact same article in multiple channels of the same feed. So the user ends up getting 2+ versions of the same article. I want to keep all articles in case the user unsubscribes from a channel that contains one copy but stays subscribed to another channel with a duplicate, but I only want to show a single article in the list of available articles.
To identify duplicates, I create a hash value of the article text content and store it as a property on the Article entity in Core Data (text_hash). My original thinking was that I would be able to craft a fetch request that could get the articles based on a unique match on this property, something like an SQL query. That turns out not to be the case (I was just learning Core Data at the time).
So to hack up a solution, I fetch all the articles, I make an empty set, i enumerate the fetch results, checking if the hash is in the set. If it is, I ignore it, if it isn't, i add it to the set and I add the article id to an array. When I'm finished, I create a predicate based on the article ids and do another fetch.
This seems really wasteful and clumsy, not only am i fetching twice and enumerating the results, since the final predicate is based on the individual article ids, I have to re-run it every time I add a new article.
It works for now but I am going to work on a new version of this app and I would like to make this better if at all possible. Any help is appreciated, thanks!
You could use propertiesToGroupBy like so:
NSFetchRequest *fr = [NSFetchRequest fetchRequestWithEntityName:#"Article"];
fr.propertiesToGroupBy = #[#"text_hash"];
fr.resultType = NSDictionaryResultType;
NSArray *articles = [ctx executeFetchRequest:fr error:nil];

How do database indices make search faster

I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.

Resources