Entrez.esummary ('gene' db): how to retrieve uid from DictElement? - biopython

I'm trying to retrieve and save gene summaries from NCBI Entrez Gene database, and would like to keep the uid too, but, though it's there, I can't find the right way to retrieve it from the results. See below (NB: obviously not my valid email address used here):
from Bio import Entrez
Entrez.email = "bogus#bogus.com"
handle = Entrez.esummary(db="gene", id="79001")
record = Entrez.read(handle)
handle.close()
for k in record["DocumentSummarySet"]['DocumentSummary'][0].keys():
print k
These are the keys:
Status, NomenclatureSymbol, OtherDesignations, Mim, Name, NomenclatureName, CurrentID, GenomicInfo, OtherAliases, Summary, GeneWeight, GeneticSource, MapLocation, ChrSort, ChrStart, LocationHist, Organism, NomenclatureStatus, Chromosome, Description
But if you look at the element itself (record["DocumentSummarySet"]['DocumentSummary'][0]), you will notice attributes={u'uid': u'79001'} at the end:
DictElement(
{u'Status': '0',
u'NomenclatureSymbol': 'VKORC1',
u'OtherDesignations': 'phylloquinone epoxide reductase',
u'Mim': ['608547'],
u'Name': 'VKORC1',
u'NomenclatureName': 'vitamin K epoxide reductase complex subunit 1',
u'CurrentID': '0',
u'GenomicInfo': [
{u'ChrAccVer': 'NC_000016.10',
u'ChrLoc': '16',
u'ExonCount': '4',
u'ChrStop': '31090841',
u'ChrStart': '31094998'}],
u'OtherAliases': 'EDTP308, MST134, MST576, VKCFD2, VKOR',
u'Summary': 'This gene [...] variants. [provided by RefSeq, Aug 2015]',
u'GeneWeight': '46017',
u'GeneticSource': 'genomic',
u'MapLocation': '16p11.2',
u'ChrSort': '16',
u'ChrStart': '31090841',
u'LocationHist': [
{u'AssemblyAccVer': 'GCF_000001405.33',
u'ChrAccVer': 'NC_000016.10',
u'AnnotationRelease': '108',
u'ChrStop': '31090841',
u'ChrStart': '31094998'}],
u'Organism': {
u'CommonName': 'human',
u'ScientificName': 'Homo sapiens',
u'TaxID': '9606'},
u'NomenclatureStatus': 'Official',
u'Chromosome': '16',
u'Description': 'vitamin K epoxide reductase complex subunit 1'},
attributes={u'uid': u'79001'})
but 'attributes' is not one of the keys. I am yet to find a way to access the uid kept in attributes. Would anyone have an idea?

attributes is just an attribute of the DictElement, and you can access it with the standard dot:
record["DocumentSummarySet"]['DocumentSummary'][0].attributes

Related

Flink - Join same stream in order to filter some events

I have a stream of data that looks like this:
impressionId | id | name | eventType | timestamp
I need to filter (ignore) event of type "click" that don't have a matching 'impressionId' of type 'impression' (so basically ignore clicks event that don't have an impression)
and then count how many impressions in total I have and how many clicks I have (for an id/name pair) for a particular time window.
This is how I approached the solution:
[...]
Table eventsTable = tEnv.fromDataStream(eventStreamWithTimeStamp, "impressionId, id, name, eventType, eventTime.rowtime");
tEnv.registerTable("Events", eventsTable);
Table clicksTable = eventsTable
.where("eventType = 'click'")
.window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("impressionId, id, name, eventType, minuteWindow")
.select("impressionId as clickImpressionId, eventType as clickEventType, concat(concat(id,'_'), name) as concatClickId, id as clickId, name as clickName, minuteWindow.rowtime as clickMinute");
Table impressionsTable = eventsTable
.where("eventType = 'impression'")
.window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("impressionId, id, name, eventType, minuteWindow")
.select("impressionId as impressionImpressionId, eventType as impressionEventType, concat(concat(id,'_'), name) as concatImpId, id as impId, name as impName, minuteWindow.rowtime as impMinute");
Table filteredClickCount = clicksTable
.join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
.window(Slide.over("24.hour").every("1.minute").on("clickMinute").as("minuteWindow"))
.groupBy("concatClickId, clickMinute")
.select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");
DataStream<Test3> result = tEnv.toAppendStream(filteredClickCount, Test3.class);
result.print();
What I'm trying to do is simply create two tables, one with clicks and one with impressions, 'inner' join clicks to impressions and the one that are joined means they are the clicks that have a matching impression.
Now this doesn't work and I don't know why!?
the count produced by the last joint table are not correct. It works for the first minute but after that the counts are off by almost double.
I have then tried to modify the last table like this:
Table clickWithMatchingImpression2 = clicksTable
.join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
.groupBy("concatClickId, clickMinute")
.select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");
DataStream<Tuple3<Boolean, Tuple3>> result2 = tEnv.toRetractStream(clickWithMatchingImpression2, Test3.class);
result2.print();
And.... this works !? However I don't know why and I don't know what to do with this DataStream<Tuple3<Boolean, Test3>> format... Flink refuse to use toAppendStream when the table don't have a window.
I would like a simply structure with only the final numbers.
1 ) Is my approach correct? Is there an easier way of filtering click that don't have impressions ?
2 ) Why does the counts are not correct in my solution ?
I am not entirely sure if I understood your use case correctly, an example with some data points would definitely help here.
Let me explain what your code is doing. First the two tables calculate how many clicks/impressions there were in the last 24 hours.
For an input
new Event("1", "1", "ABC", "...", 1),
new Event("1", "2", "ABC", "...", 2),
new Event("1", "3", "ABC", "...", 3),
new Event("1", "4", "ABC", "...", 4)
You will get windows (array<eventId>, window_start, window_end, rowtime):
[1], 1969-12-31-01T00:01:00.000, 1970-01-01T00:01:00.000, 1970-01-01T00:00:59.999
[1, 2], 1969-12-31-01T00:02:00.000, 1970-01-01T00:02:00.000, 1970-01-01T00:01:59.999
[1, 2, 3], 1969-12-31-01T00:03:00.000, 1970-01-01T00:03:00.000, 1970-01-01T00:02:59.999
...
Therefore when you group both on id and name you get sth like:
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:00:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:01:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:02:59.999
...
which if you group again in 24 hours windows you will count each event with the same id multiple times.
If I understand your use case correctly and you are looking for how many impressions happened in a 1 minute period around an occurrence of a click, an interval join might be what you are looking for. You could implement your case with a following query:
Table clicks = eventsTable
.where($("eventType").isEqual("click"))
.select(
$("impressionId").as("clickImpressionId"),
concat($("id"), "_", $("name")).as("concatClickId"),
$("id").as("clickId"),
$("name").as("clickName"),
$("eventTime").as("clickEventTime")
);
Table impressions = eventsTable
.where($("eventType").isEqual("impression"))
.select(
$("impressionId").as("impressionImpressionId"),
concat($("id"), "_", $("name")).as("concatImpressionId"),
$("id").as("impressionId"),
$("name").as("impressionName"),
$("eventTime").as("impressionEventTime")
);
Table table = impressions.join(
clicks,
$("clickImpressionId").isEqual($("impressionImpressionId"))
.and(
$("clickEventTime").between(
$("impressionEventTime").minus(lit(1).minutes()),
$("impressionEventTime"))
))
.select($("concatClickId"), $("impressionEventTime"));
table
.window(Slide.over("24.hour").every("1.minute").on("impressionEventTime").as("minuteWindow"))
.groupBy($("concatClickId"), $("minuteWindow"))
.select($("concatClickId"), $("concatClickId").count())
.execute()
.print();
As for why Flink sometimes cannot produce append stream, but only retract stream see. Very briefly, if an operation does not work based on a time attribute, there is not single point in time, when the result is "valid". Therefore it must emit stream of changes instead of a single appended value. The first field in the tuple tells you if the record is an insertion(true) or retraction/deletion(false).

Googlesheet APIv4 getting empty cells

I have a googlesheet where a column may contain no information in it. While iterating through the rows and looking at that column, if the column is blank, it's not returning anything. Even worse, if I do a get of a full row and include that common, say get 5 columns, I get back only 4 columns when any of the columns are empty. How do I return either NULL or an empty string if I'm getting a row of columns and one of the cells in a column is empty?
// Build a new authorized API client service.
Sheets service = GoogleSheets.getSheetsService();
range = "Functional Users!A3:E3";
response = service.spreadsheets().values().get(spreadsheetId, range).execute();
values = response.getValues();
cells = values.get(0);
I am getting 5 cells in the row. cells.size() should ALWAYS return five. However if any of the 5 cells are blank, it will return fewer cells. Say only the cell at B3 is empty. cells.size() will be 4. Next iteration, I get A4:E4 and cell D4 is empty. Again, cells.size() will be 4. With no way to know just which cell is missing. If A4 AND D4 AND E4 are empty, cells.size() will be 2.
How do I get it to return 5 cells regardless of empty cells?
The way I solved this issue was converting the values into a Pandas dataframe. I fetched the particular columns that I wanted in my Google Sheets, then converted those values into a Pandas dataframe. Once I converted my dataset into a Pandas dataframe, I did some data formatting, then converted the dataframe back into a list. By converting the list to a Pandas dataframe, each column is preserved. Pandas already creates null values for empty trailing rows and columns. However, I needed to also convert the non trailing rows with null values to keep consistency.
# Authenticate and create the service for the Google Sheets API
credentials = ServiceAccountCredentials.from_json_keyfile_name(KEY_FILE_LOCATION, SCOPES)
http = credentials.authorize(Http())
discoveryUrl = ('https://sheets.googleapis.com/$discovery/rest?version=v4')
service = discovery.build('sheets', 'v4',
http=http,discoveryServiceUrl=discoveryUrl)
spreadsheetId = 'id of your sheet'
rangeName = 'range of your dataset'
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheetId, range=rangeName).execute()
values = result.get('values', [])
#convert values into dataframe
df = pd.DataFrame(values)
#replace all non trailing blank values created by Google Sheets API
#with null values
df_replace = df.replace([''], [None])
#convert back to list to insert into Redshift
processed_dataset = df_replace.values.tolist()
I've dabbled in Sheetsv4 and this is indeed the behavior when you're reading a range of cells with empty data. It seems this is the way it has been designed. As stated in the Reading data docs:
Empty trailing rows and columns are omitted.
So if you can find a way to write a character that represents 'empty values', like zero, then that will be one way to do it.
I experienced the same issue using V4 of the sheets api but was able to workaround this using an extra column at the end of my range and the valueRenderOption argument for the values.get API
Given three columns, A, B and C any of which might contain a null value, add an additional column, D and add an arbitrary value here such as 'blank'.
Ensure you capture the new column in your range and add the additional parameter,
valueRenderOption: 'FORMATTED_VALUE'.
You should end up with a call similar to this:
sheets.spreadsheets.values.get({
spreadsheetId: SOME_SHEET_ID,
range: "AUTOMATION!A:D",
valueRenderOption: 'FORMATTED_VALUE'
}, (err, res) => {})
This should then give you a consistent length array for each value, returning a blank string "" in the place of the empty cell value.
If you pull a range from the google sheet API v4 then empty row data IS included if its at the beginning or middle of the selected range. Only cells which have no data at the end of the range are omitted. Using this assumption you can 'fill' the no data cells in your app code.
For instance if you selected A1:A5 and A1 has no value it will still be returned in row data as {}.
If A5 is missing then you'll have an array of length 4 and so know to fill the empty A5.
If A4 & A5 are empty then you'll have an array of length 3 and so on.
If none of the range contains data then you'll receive an empty object.
I know that this is super late, but just in case someone else who has this problem in the future would like a fix for it, I'll share what I did to work past this.
What I did was increase the length of the range of cells I was looking for by one. Then within the Google Spreadsheet that I was reading off of, I added a line of "."s in the extra column (The column added to the array now that the desired range of cells has increased). Then I protected that line of periods so that it can't be changed from the "."
This way gives you an array with everything you are looking for, including null results, but does increase your array size by 1. But if that bothers you, you can just make a new one without the last index of the arrays.
The only solution I could find is writing your own function:
def _safe_get(data, r, c):
try:
return data[r][c]
except IndexError:
return ''
def read(range_name, service):
result = service[0].spreadsheets().values().get(spreadsheetId=service[1],
range=range_name).execute()
return result.get('values', [])
def safe_read(sheet, row, col, to_row='', to_col='', service=None):
range_name = '%s!%s%i:%s%s' % (sheet, col, row, to_col, to_row)
data = read(range_name, service)
if to_col == '':
cols = max(len(line) for line in data)
else:
cols = ord(to_col.lower()) - ord(col.lower()) + 1
if to_row == '':
rows = len(data)
else:
rows = to_row - row + 1
return [[_safe_get(data, r, c)
for c in range(cols)]
for r in range(rows)]
If last cell in row has a value then the row will be returned fully
for example:
Rows:
|Nick|29 years|Minsk|
|Mike| |Pinsk|
|Boby| | |
Return:
[
["Nick", "29 years", "Minsk"],
["Mike", "", "Pinsk"]
["Boby"]
]
So when you add a new line with empty cells instead of empty("" or null) just use space " "
And then when you read values just map all items from space " " to empty ""
Rows:
|Nick|29 years|Minsk|
|Mike| |Pinsk|
|Boby| |" " |
Return:
[
["Nick", "29 years", "Minsk"],
["Mike", "", "Pinsk"]
["Boby", "", " "]
]
Another option is iterating through the returned rows, checking the length of the row and appending whatever data you were expecting to be returned. I found this preferable to adding junk data to my dataset.
I am super late to the party, but here goes another alternative:
def read_sheet(service, SPREADSHEET_ID, range) -> pd.DataFrame:
result = service.spreadsheets().values().get(spreadsheetId=SPREADSHEET_ID, range=range).execute()
rows = result.get('values', [])
df = pd.DataFrame(rows[0:])
df.columns = df.iloc[0]
df = df.drop(axis=0, index=0)
return df
For this solution to work you will need headers (column names) in all columns of the spreadsheet you want to read. It will load a pandas df without a headers (column names) specification, replace the column names with the first row, and then drop it.
Sheets API V4, should return all blanks up to last filled column.
This will fill out the blanks:
values = result.get('values', [])
print(values[1:5]) # [['Spinach Lasagna', '10', '5', '', 'x'], ['Hot Dish', '10', '5', '', '', '', 'x'], ['Tuna-Noodle Casserole', '10', '5', '', 'x', '', '', 'x'], ['Sausage and Peppers', '10', '3', '', '', '', '', '', 'x']]
n_col = 14 # hard code
n_col = max([len(i) for i in values]) # if last column is occupied at least once
n_col = len(values[0]) # if you have header
values = [lst + ([''] * (n_col - len(lst))) for lst in values]
print(values[1:4]) # [['Spinach Lasagna', '10', '5', '', 'x', '', '', '', '', '', '', '', '', ''], ['Hot Dish', '10', '5', '', '', '', 'x', '', '', '', '', '', '', ''], ['Tuna-Noodle Casserole', '10', '5', '', 'x', '', '', 'x', '', '', '', '', '', '']]
Just add:
values.add("");
before:
cells = values.get(0);
This will ensure that you do not query an empty list because of blank cell or a row.

How to read column value using grails excel import plugin?

I am using Grails excel import plugin to import an excel file.
static Map propertyConfigurationMap = [
name:([expectedType: ExcelImportService.PROPERTY_TYPE_STRING, defaultValue:null]),
age:([expectedType: ExcelImportService.PROPERTY_TYPE_INT, defaultValue:0])]
static Map CONFIG_BOOK_COLUMN_MAP = [
sheet:'Sheet1',
startRow: 1,
columnMap: [
//Col, Map-Key
'A':'name',
'B':'age',
]
]
I am able to retrieve the array list by using the code snippet:
def usersList = excelImportService.columns(workbook, CONFIG_USER_COLUMN_MAP)
which results in
[[name: Mark, age: 25], [name: Jhon, age: 46], [name: Anil, age: 62], [name: Steve, age: 32]]
And also I'm able to read each record say [name: Mark, age: 25] by using usersList.get(0)
How do I read the each column value?
I know I can read something like this
String[] row = usersList.get(0)
for (String s : row)
println s
I wonder is there any thing that plugin supports so that I can read column value directly rather manipulating it to get the desired result.
Your usersList is basically a List<Map<String, Object>> (list of maps). You can read a column using the name you gave it in the config. In your example, you named column A name and column B age. So using your iteration example as a basis, you can read each column like this:
Map row = usersList.get(0)
for(Map.Entry entry : row) {
println entry.value
}
Groovy makes this easier to do with Object.each(Closure):
row.each { key, value ->
println value
}
If you want to read a specific column value, here are a few ways to do it:
println row.name // One
println row['name'] // Two
println row.getAt('name') // Three
Hint: These all end up calling row.getAt('name')

neo4jrestclient - query get id

I am using the neo4jrestclient library.
from neo4jrestclient.client import GraphDatabase
from neo4jrestclient import client
from neo4jrestclient import query
gdb = GraphDatabase("http://localhost:7474/db/data/")
q = """MATCH n RETURN n;"""
result = gdb.query(q=q)
print(result[0])
When I am executing the query "MATCH n RETURN n, the output is:
[{
'all_relationships': 'http://localhost:7474/db/data/node/1131/relationships/all',
'all_typed_relationships': 'http://localhost:7474/db/data/node/1131/relationships/all/{-list|&|types}',
'self': 'http://localhost:7474/db/data/node/1131',
'labels': 'http://localhost:7474/db/data/node/1131/labels',
'properties': 'http://localhost:7474/db/data/node/1131/properties',
'create_relationship': 'http://localhost:7474/db/data/node/1131/relationships',
'outgoing_relationships': 'http://localhost:7474/db/data/node/1131/relationships/out',
'data': {
'title': 'title',
'name': 'Poludnie'
},
'incoming_typed_relationships': 'http://localhost:7474/db/data/node/1131/relationships/in/{-list|&|types}',
'property': 'http://localhost:7474/db/data/node/1131/properties/{key}',
'paged_traverse': 'http://localhost:7474/db/data/node/1131/paged/traverse/{returnType}{?pageSize,leaseTime}',
'incoming_relationships': 'http://localhost:7474/db/data/node/1131/relationships/in',
'outgoing_typed_relationships': 'http://localhost:7474/db/data/node/1131/relationships/out/{-list|&|types}',
'traverse': 'http://localhost:7474/db/data/node/1131/traverse/{returnType}'}]
I see that the node’s id = 1131. The question is: can I obtain this id in raw forms without those links? I would like to have only the id together with the value of the ‘data’ field.
In Cypher, that could be expressed like this:
MATCH (n) RETURN {id: ID(n), name: n.name, title: n.title} as city
In the response, the data hash will contain an array and each element's row key will contain this data accessible using their given keys.
To get "just the id and data, change your query to:
MATCH (n) RETURN id(n), n.data
See if that is satisfactory.

MongoDB - Mongoid map reduce basic operation

I have just started with MongoDB and mongoid.
The biggest problem I'm having is understanding the map/reduce functionality to be able to do some very basic grouping and such.
Lets say I have model like this:
class Person
include Mongoid::Document
field :age, type: Integer
field :name
field :sdate
end
That model would produce objects like these:
#<Person _id: 9xzy0, age: 22, name: "Lucas", sdate: "2013-10-07">
#<Person _id: 9xzy2, age: 32, name: "Paul", sdate: "2013-10-07">
#<Person _id: 9xzy3, age: 23, name: "Tom", sdate: "2013-10-08">
#<Person _id: 9xzy4, age: 11, name: "Joe", sdate: "2013-10-08">
Could someone show how to use mongoid map reduce to get a collection of those objects grouped by the sdate field? And to get the sum of ages of those that share the same sdate field?
I'm aware of this: http://mongoid.org/en/mongoid/docs/querying.html#map_reduce
But somehow it would help to see that applied to a real example. Where does that code go, in the model I guess, is a scope needed, etc.
I can make a simple search with mongoid, get the array and manually construct anything I need but I guess map reduce is the way here. And I imagine these js functions mentioned on the mongoid page are feeded to the DB that makes those operations internally. Coming from active record these new concepts are a bit strange.
I'm on Rails 4.0, Ruby 1.9.3, Mongoid 4.0.0, MongoDB 2.4.6 on Heroku (mongolab) though I have locally 2.0 that I should update.
Thanks.
Taking the examples from http://mongoid.org/en/mongoid/docs/querying.html#map_reduce and adapting them to your situation and adding comments to explain.
map = %Q{
function() {
emit(this.sdate, { age: this.age, name : this. name });
// here "this" is the record that map
// is going to be executed on
}
}
reduce = %Q{
function(key, values) {
// this will be executed for every group that
// has the same sdate value
var result = { avg_of_ages: 0 };
var sum = 0; // sum of all ages
var totalnum = 0 // total number of people
values.forEach(function(value) {
sum += value.age;
});
result.avg_of_ages = sum/total // finding the average
return result;
}
}
results = Person.map_reduce(map, reduce) //You can access this as an array of maps
first_average = results[0].avg_of_ages
results.each do |result|
// do whatever you want with result
end
Though i would suggest you use Aggregation and not map reduce for such a simple operation. The way to do this is as follows :
results = Person.collection.aggregate([{"$group" => { "_id" => {"sdate" => "$sdate"},
"avg_of_ages"=> {"$avg" : "$age"}}}])
and the result will be almost identical with map reduced and you would have written a lot less code.

Resources