BioPython Entrez article limit - biopython

I've been using the classic article function which returns the articles for a string
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='100000',
retmode='xml',
term=t)
return list(Entrez.read(handle)["IdList"])
print(len(article_machine('T cell')))
I've noticed now that there's a limit to the amount of articles I receive (not the one I put in retmax).
The amount I get is 9999 PMIDS, for key words who used to return 100k PMIDS (T cell for example)
The amount I get now
The amount I used to get
I know it's not a bug in the package itself but in NCBI.
Has someone managed to solve it?

from : The E-utilities In-Depth: Parameters, Syntax and More
retmax
Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.
To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results
Unfortunately my code devised on the above mentioned info, doesnt work:
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
all_res = []
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='count',
term=t)
number = int(Entrez.read(handle)['Count'])
print(number)
retstart = 0
while retstart < number:
retstart += 1000
print('\n retstart now : ' , retstart ,' out of : ', number)
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='xml',
retstart = retstart,
retmax = str(retstart),
term=t)
all_res.extend(list(Entrez.read(handle)["IdList"]))
return all_res
print(article_machine('T cell'))
changing while retstart < number: with while retstart < 5000:
the code works, but as soon as retmax exceeds 9998, that is using
the former while loop needed to access all the results, I get the following error:
RuntimeError: Search Backend failed: Exception:
'retstart' cannot be larger than 9998. For PubMed,
ESearch can only retrieve the first 9,999 records
matching the query.
To obtain more than 9,999 PubMed records, consider
using EDirect that contains additional logic
to batch PubMed search results automatically
so that an arbitrary number can be retrieved.
For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/
See https://www.ncbi.nlm.nih.gov/books/NBK25499/ that actually should be https://www.ncbi.nlm.nih.gov/books/NBK179288/?report=reader
Try to have a look at NCBI APIs to see if there is something could work out your problem usinng a Python interface, I am not an expert on this , sorry : https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

Related

Table API to join two Kafka streams simultaneously

I have a Kafka producer that reads data from two large files and sends them in the JSON format with the same structure:
def create_sample_json(row_id, data_file): return {'row_id':int(row_id), 'row_data': data_file}
The producer breaks every file into small chunks and creates JSON format from each chunk and sends them in a for-loop finally.
The process of sending those two files happens simultaneously through multithreading.
I want to do join from those streams (s1.row_id == s2.row_id) and eventually some stream processing while my producer is sending data on Kafka. Because the producer generates a huge amount of data from multiple sources, I can't wait to consume them all, and it must be done simultaneously.
I am not sure if Table API is a good approach but this is my pyflink code so far:
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings
from pyflink.table.expressions import col
from pyflink.table.table_environment import StreamTableEnvironment
KAFKA_SERVERS = 'localhost:9092'
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///flink_jar/kafka-clients-3.3.2.jar")
env.add_jars("file:///flink_jar/flink-connector-kafka-1.16.1.jar")
env.add_jars("file:///flink_jar/flink-sql-connector-kafka-1.16.1.jar")
settings = EnvironmentSettings.new_instance() \
.in_streaming_mode() \
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=settings)
t1 = f"""
CREATE TEMPORARY TABLE table1(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
t2 = f"""
CREATE TEMPORARY TABLE table2(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
p1 = t_env.execute_sql(t1)
p2 = t_env.execute_sql(t2)
// please tell me what should I do next:
// Questions:
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
// 2) If so:
2.1) how can I make join from those streams in Python?
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
// 3) If not, what should I do?
Thank you very much.
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
The later one, data will be consumed by the 'kafka' table connector which we implemented. And you need to define a Sink table as the target you insert, the sink table could a kafka connector table with a topic that you want to ouput.
2.1) how can I make join from those streams in Python?
You can write SQL to join table1 and table2 and then insert into your sink table in Python
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
You can filter these messages before 'join' or before 'insert', a 'WHERE' clause is enough in your case

How to aggregate data using apache beam api with multiple keys

I am new to google cloud data platform as well as to Apache beam api. I would like aggregate data based on multiple keys. In my requirement I will get a transaction feed having fields like customer id,customer name,transaction amount and transaction type. I would like to aggregate the data based on customer id & transaction type. Here is an example.
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
In google most of the examples are based on single key like group by single key. Can any please help me on how my PTransform look like in my requirement and how to produce aggregated data along with rest of the fields.
Regards,
Ravi.
Here is an easy way. I concatenated all the keys together to form a single key and then did the the sub and after than split the key to organize the output to a way you wanted. Please let me know if any question.
The code does not expect header in the CSV file. I just kept it short to show the main point you are asking.
import apache_beam as beam
import sys
class Split(beam.DoFn):
def process(self, element):
"""
Splits each row on commas and returns a tuple representing the row to process
"""
customer_id, customer_name, transction_amount, transaction_type = element.split(",")
return [
(customer_id +","+customer_name+","+transaction_type, float(transction_amount))
]
if __name__ == '__main__':
p = beam.Pipeline(argv=sys.argv)
input = 'aggregate.csv'
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
| 'parse' >> beam.ParDo(Split())
| 'sum' >> beam.CombinePerKey(sum)
| 'convertToString' >>beam.Map(lambda (combined_key, total_balance): '%s,%s,%s,%s' % (combined_key.split(",")[0], combined_key.split(",")[1],total_balance,combined_key.split(",")[2]))
| 'write' >> beam.io.WriteToText(output_prefix)
)
p.run().wait_until_finish()
it will produce output as below:
cust234,Srini,200.0,C
cust444,shaker,500.0,D
cust123,ravi,300.0,D
cust123,ravi,400.0,C

Google Fusion Tables - How to get ROWID in media download

I am querying a fusion table using a request like this:
https://www.googleapis.com/fusiontables/v2/query?alt=media&sql=SELECT ROWID,State,County,Name,Location,Geometry,[... many more ...] FROM <table ID>
The results from this query exceed 10MB, so I must include the alt=media option (for smaller queries, where I can remove this option, the problem does not exist). The response is in csv, as promised by the documentation. The first line of the response appears to be a header line which exactly matches my query string (except that it shows rowid instead of ROWID):
rowid,State,County,Name,Location,Geometry,[... many more ...]
The following rows however do not include the row id. Each line begins with the second item requested in the query - it seems as though the row id was ignored:
WV,Calhoun County,"Calhoun County, WV, USA",38.858 -81.1196,"<Polygon><outerBoundaryIs>...
Is there any way around this? How can I retrieve row ids from a table when the table is large?
Missing ROWIDs are also present for "media" requests made via Google's Python API Client library, e.g.
def doGetQuery(query):
request = FusionTables.query().sqlGet_media(sql = query);
response = request.execute();
ft = {'kind': "fusiontables#sqlresponse"};
s = queryResult.decode(); # bytestring to string
data = [];
for i in s.splitlines():
data.append(i.split(','));
ft['columns'] = data.pop(0); # ['Rowid', 'A', 'B', 'C']
ft['rows'] = data; # [['a1', 'b1', 'c1'], ...]
return ft;
You may at least have one fewer issue than I - this sqlGet_media method can only be made with a "pure" GET request - a query long enough (2 - 8k characters) to get sent as overridden POST generates a 502 Bad Gateway error, even for tiny response sizes such as the result from SELECT COUNT(). The same query as a non-media request works flawlessly (provided the response size is not over 10 MB, of course).
The solution to both your and my issue is to batch the request using OFFSET and LIMIT, such that the 10 MB response limit isn't hit. Estimate the size of an average response row at the call site, pass it into a wrapper function, and let the wrapper handle adding OFFSET and LIMIT to your input SQL, and then collate the multi-query result into a single output of the desired format:
def doGetQuery(query, rowSize = 1.):
limitValue = floor(9.5 * 1024 / rowSize) # rowSize in kB
offsetValue = 0;
ft = {'kind': "fusiontables#sqlresponse"};
data = [];
done = False;
while not done:
tail = ' '.join(['OFFSET', str(offsetValue), 'LIMIT', str(limitValue)]);
request = FusionTables.query().sqlGet(sql = query + ' ' + tail);
response = request.execute();
offsetValue += limitValue;
if 'rows' in response.keys():
data.extend(response['rows']);
# Check the exit condition.
if 'rows' not in response.keys() or len(response['rows']) < limitValue:
done = True;
if 'columns' not in ft.keys() and 'columns' in response.keys():
ft['columns'] = response['columns'];
ft['rows'] = data;
return ft;
This wrapper can be extended to handle actual desired uses of offset and limit. Ideally for FusionTable or other REST API methods they provide a list() and list_next() method for native pagination, but no such features are present for FusionTable::Query. Given the horrendously slow rate of FusionTable API / functionality updates, I wouldn't expect ROWIDs to magically appear in alt=media downloads, or for GET-turned-POST media-format queries to ever work, so writing your own wrapper is going to save you a lot of headache.

In Factual how to get results with unique field values (similar to GROUP BY in SQL)?

I've just set out on the path to discovery of Factual API and I cannot see how to achieve a retrieval of a selection of entries each with a unique value in the specified field.
For example, give me 10 results from various cities:
q.limit(10);
q.field("locality").unique(); // no such filter exists
factual.fetch("places", q);
This would be an equivalent query in MySQL:
SELECT * FROM places GROUP BY locality LIMIT 10;
What I want is a little bit similar to facets:
FacetQuery fq = new FacetQuery("locality").maxValuesPerFacet(10);
fq.field("country").isEqual("gb");
FacetResponse resp = factual.fetch("places", fq);
but instead of the total for each result I would like to see a random object with all the information.
Is anything like this possible?

twython : get followers list

Using twython I am trying to retrieve list of all of the followers of a particular id which has more than 40k followers. But I am running into below error
"Twitter API returned a 429 (Too many requests) rate limit exceeded. How to over come this issue?
Below is the snippet, I am printing user name and time zone information.
next_cursor = -1
while(next_cursor):
search = twitter.get_followers_list(screen_name='ndtvgadgets',cursor=next_cursor)
for result in search['users']:
time_zone =result['time_zone'] if result['time_zone'] != None else "N/A"
print result["name"].encode('utf-8')+ ' '+time_zone.encode('utf-8')
next_cursor = search["next_cursor"]
Change the search line to:
search = twitter.get_followers_list(screen_name='ndtvgadgets',count=200,cursor=next_cursor)
Then import the time module and insert time.sleep(60) between each API call.
It'll take ages for a user with 41K followers (around three and a half hours for the ndtvgadgets account), but it should work. With the count increased to 200 (the maximum) you're effectively requesting 200 results every minute. If there are other API calls in your script in addition to twitter.get_followers_list you might want to pad the sleep time a little or insert a sleep call after each one.

Resources