How to get the PDB id of a mystery sequence? - biopython

I have a bunch of proteins, from something called proteinnet.
Now the sequences there have some sort of ID, but it is clearly not a PDB id, so I need to find that in some other way. For each protein I have the amino acid sequence. I'm using biopython, but I'm not very experienced in it yet and couldn't find this in the guide.
So my question is how do I find a proteins PDB id given that I have the amino acid sequence of the protein? (Such that I can download the PDB file for the protein)

hi I was playing a little bit ago with the RCSB PDB search API,
ended up with this piece of code (can't find examples on rcsb pdb website anymore),
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 27 16:20:43 2020
#author: Pietro
"""
import PDB_searchAPI_5
from PDB_searchAPI_5.rest import ApiException
import json
#"value":"STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
# Defining the host is optional and defaults to https://search.rcsb.org/rcsbsearch/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = PDB_searchAPI_5.Configuration(
host = "http://search.rcsb.org/rcsbsearch/v1"
)
data_entry_1 = '''{
"query": {
"type": "terminal",
"service": "sequence",
"parameters": {
"evalue_cutoff": 1,
"identity_cutoff": 0.9,
"target": "pdb_protein_sequence",
"value": "STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
}
},
"request_options": {
"scoring_strategy": "sequence"
},
"return_type": "entry"
}'''
# Enter a context with an instance of the API client
with PDB_searchAPI_5.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = PDB_searchAPI_5.SearchServiceApi(api_client)
try:
# Get RCSB PDB data schema as JSON schema extended with RCSB metadata.
pippo = api_instance.run_json_queries_get(data_entry_1)
except ApiException as e:
print("Exception when calling SearchServiceApi->run_json_queries_get: %s\n" % e)
exit()
print(type(pippo))
print(dir(pippo))
pippox = pippo.__dict__
print('\n bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb \n' ,pippox)
print('\n\n ********************************* \n\n')
print(type(pippox))
pippoy = pippo.result_set
print(type(pippoy))
for i in pippoy:
print('\n',i,'\n', type(i))
print('\n LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL\n')
for i in pippoy:
for key in i:
print('\n', i['identifier'], ' score : ', i['score'])
the search module (import PDB_searchAPI_5) was generated with: openapi-generator-cli-4.3.1.jar link here
the open api specs where 1.7.3 now they are 1.7.15 see https://search.rcsb.org/openapi.json
the data_entry_1 bit was copied from rcsb pdb website but can't find it anymore,
it was saying something about mmseqs2 being the sofware doing the search, played with:
"evalue_cutoff": 1,
"identity_cutoff": 0.9,
parameters but didn't find a way to select only 100% identity
here the PDB_searchAPI_5 install it in a virtual enviroment with:
pip install PDB-searchAPI-5-1.0.0.tar.gz
was generated by openapi-generator-cli-4.3.1.jar with:
java -jar openapi-generator-cli-4.3.1.jar generate -g python -i pdb-search-api-openapi.json --additionalproperties=generateSourceCodeOnly=True,packageName=PDB_searchAPI_5
don't put spaces in --additionalproperties part (took one week to figure it out)
the README.md file is the most important part as it explain how to use the OPEN-API client
you need your fasta sequences here:
"value":"STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
the score = 1 should be the exact match,
probably the biopython blast module is easier, but it blast NIH database instead of RCSB PDB, sorry can't elaborate more on this, still need to figure out what is a JSON file, and wasnt able to find a better free tool that automatically generate a better OPEN-API python client (I believe is kind of not so easy task... but we always want more...)
to get API documentation try:
java -jar openapi-generator-cli-4.3.1.jar generate -g html -i https://search.rcsb.org/openapi.json --skip-validate-spec
You get html document or for pdf: https://mrin9.github.io/RapiPdf/
http://search.rcsb.org/openapi.json
works as well as https://search.rcsb.org/openapi.json so that you can look at the exchanges between client and server with wireshark

Related

Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8

I'm trying to read JSON-LD into Dask from Minio. The pipeline works but the strings come from Minio as binary strings
So
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read())
results in
b'\n{\n "#context": "http://schema.org/",\n "#type": "Dataset",\n ...
I can simply convert this with
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read().decode("utf-8"))
and now everything is as I expect it.
However, I am working with Dask and when reading into the a bag with
dgraphs = db.read_text('s3://bucket/prefa/prefb/*.jsonld',
storage_options={
"key": key,
"secret": secret,
"client_kwargs": {"endpoint_url":"https://example.org"}
}).map(json.loads)
I can not get the content coming from Minio to become strings vs binary strings. I need these converted before they hit the json.loads map I suspect.
I assume I can inject the "decode" in here somehow as well, but I can't resolve how.
Thanks
As the name implies, read_text opens the remote file in text mode, equivalent to open(..., 'rt'). The signature of read_text includes the various decoding arguments, such as UTF8 as the default encoding. You should not need to do anything else, but please post a specific error if you are having trouble, ideally with example file contents.
If your data isn't delimited by lines, read_text might not be right for you, and you can do something like
#dask.delayed()
def read_a_file(fn):
# or preferably open in text mode and json.load from the file
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
return json.loads(f.read().decode("utf-8"))
output = [read_a_file(f) for f in filenames]
and then you can create a bag or dataframe from this, as required.

How to have a custom connector config file be parsed as java properties format?

I have a custom connector that writes Neo4j commands from a file to Kafka and I would like to debug it. So, I downloaded Confluent v3.3.0 and took time familiarize myself with it; however, I find myself stuck trying to load the connector. When I try to load the connector with its .properties file I get the following error:
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 1, column 124
parse error: Invalid numeric literal at line 2, column 0
I have an inkling that it is trying to parse the file as a JSON file as before this error I got the following error when trying to load the connector:
Warning: Install 'jq' to add support for parsing JSON
And so I brew installed jq, and now having been getting the former error.
I would like this file to be parsed as java properties format which I thought would be implicit due to the .properties, but do I need to be explicit in a setting somewhere?
Update:
I converted the .properties to JSON as suggested by #Konstantine Karantasis, but I get the same error as before but without the first line:
parse error: Invalid numeric literal at line 2, column 0
I triple checked my formatting and did some searching on the error, but have come up short. Please let me know if I made an error in my formatting or if there is a nuance when using JSON files with Kafka Connect that I don't about.
Java properties:
name=neo4k-file-source
connector.class=neo4k.filestream.source.Neo4jFileStreamSourceConnector
tasks.max=1
file=Neo4jCommands.txt
topic=neo4j-commands
Converted to JSON:
[{
"name": "neo4k-file-source",
"connector": {
"class": "neo4k.filestream.source.Neo4jFileStreamSourceConnector"
},
"tasks": {
"max": 1
},
"file": "Neo4jCommands.txt",
"topic": "neo4j-commands"
}]
Check out https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/ for an example of a valid json file being loaded using confluent CLI
In your example, try this:
{
"name": "neo4k-file-source",
"config": {
"connector.class": "neo4k.filestream.source.Neo4jFileStreamSourceConnector",
"tasks.max": 1,
"file": "Neo4jCommands.txt",
"topic": "neo4j-commands"
}
}
Confluent CLI, which you are using to start your connector, tries to be smart about figuring out the type of your properties file. It doesn't depend on the extension name (.properties) but calls file on the input file and matches the result against the ASCII string.
This complies with the current definition of a java properties file (https://en.wikipedia.org/wiki/.properties) but it should be extended to match files encoded in UTF-8 or files that contain escaped unicode characters.
You have two options.
Transform your properties to JSON format instead.
Edit the CLI to match the file type returned when running file <yourconf.properties>

Using Jena to convert an owl file to N-Triples from terminal returns an empty file

I have generated an owl file using this generator http://swat.cse.lehigh.edu/projects/lubm/
I want to transform the file in N-triples and have done it before using
$ riot -out N-TRIPLE ~/lubm20/*.owl > lubm20.nt
for some reason now I get an empty file (lubm20.nt)
and when I use
$ rdfcat -out N-TRIPLE ~/lubm20/*.owl > lubm20.nt
I get this error
Exception in thread "main" org.apache.jena.riot.RiotException: <file:///root/lubm20/classes\University0_0.owl> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
at org.apache.jena.riot.s5ystem.IRIResolver.exceptions(IRIResolver.java:371)
at org.apache.jena.riot.system.IRIResolver.resolve(IRIResolver.java:328)
at org.apache.jena.riot.system.IRIResolver$IRIResolverSync.resolve(IRIResolver.java:489)
at org.apache.jena.riot.system.IRIResolver.resolveIRI(IRIResolver.java:254)
at org.apache.jena.riot.system.IRIResolver.resolveString(IRIResolver.java:233)
at org.apache.jena.riot.SysRIOT.chooseBaseIRI(SysRIOT.java:109)
at org.apache.jena.riot.adapters.AdapterFileManager.readModelWorker(AdapterFileManager.java:286)
at org.apache.jena.util.FileManager.readModel(FileManager.java:341)
at jena.rdfcat.readInput(rdfcat.java:328)
at jena.rdfcat$ReadAction.run(rdfcat.java:473)
at jena.rdfcat.go(rdfcat.java:231)
at jena.rdfcat.main(rdfcat.java:206)
The generator would generate a well known semantic web benchmark dataset so how can it have
UNWISE_CHARACTER s?
edit:
for the question asked
I used this line to generate the *.owl files
java edu.lehigh.swat.bench.uba.Generator -onto http://swat.cse.lehigh.edu/onto/univ-bench.owl univ 20
then moved the *.owl files to lubm20 folder
I used rdf2rdf instead of jena
java -jar rdf2rdf-1.0.1-2.3.1.jar /lubmData/lubm100/*.owl lubm100.nt
worked like a charm
enter link description here

Parsing NYC Transit/MTA historical GTFS data (not realtime)

I've been puzzling on this on and off for months and can't find a solution.
The MTA claims to provide historical data in form of daily dumps in GTFS format here:
[http://web.mta.info/developers/MTA-Subway-Time-historical-data.html][1]
See for yourself by downloading the example they provide, in this case Sep, 17th , 2014:
[https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31][1]
My problem? The file is gobbledygook. It does not follow GTFS specifications, has no extension, and when I open it using a text editor it looks like 7800 lines of this:
n
^C1.0^X �枪�^Eʞ>`
^C1.0^R^K
^A1^R^F^P����^E^R^K
^A2^R^F^P����^E^R^K
^A3^R^F^P����^E^R^K
^A4^R^F^P����^E^R^K
^A5^R^F^P����^E^R^K
^A6^R^F^P����^E^R^K
^AS^R^F^P����^E^R[
^F000001^ZQ
6
^N050400_1..S02R^Z^H20140917*^A1�>^V
^P01 0824 242/SFY^P^A^X^C^R^W^R^F^Pɚ��^E"^D140Sʚ>^F
^AA^R^AA^RR
^F000002"H
6
Per the MTA site (appears untrue)
All data is formatted in GTFS-realtime
Any idea on the steps necessary to transform this mystery file into usable GTFS data? Is there some encoding I am missing? I have looked for 10+ and been unable to come up with a solution.
Also, not to be a stickler but I am NOT referring to the MTA's realtime data feed, which is correctly formatted and usable. I am specifically referring to the historical data dumps I reference above (have received many "solutions" referring only to realtime data feed)
The file you link to is in GTFS-realtime format, not GTFS, and the page you linked to does a very bad job of explaining which format their data is actually in (though it is mentioned in your quote).
GTFS is used to store schedule data, like routes and scheduled arrival times.
GTFS-realtime is generally used to transfer actual transit performance data in real-time, like vehicle locations and expected or actual arrival times. It is a protobuf, a specification for compiled binary data publicized by Google, which means you can't usefully read it in a text editor, but you instead have to load it programmatically using the Google protobuf tools. It can be used as a historical data format in the way MTA is here, by making daily dumps of the GTFS-rt feed publicly available. It's called GTFS-realtime because various data fields in the realtime like route_id, trip_id, and stop_id are designed to link to the published GTFS schedules.
I confirmed the validity of the data you linked to by decompiling it using the gtfs-realtime.proto specification and the Google protobuf tools for Python. It begins:
header {
gtfs_realtime_version: "1.0"
timestamp: 1410960621
}
entity {
id: "000001"
trip_update {
trip {
trip_id: "050400_1..S02R"
start_date: "20140917"
route_id: "1"
}
stop_time_update {
arrival {
time: 1410960713
}
stop_id: "140S"
}
}
}
...
and continues in that vein for a total of 55833 lines (in the default string output format).
EDIT: the Python script used to convert the protobuf into string representation is very simple:
import gtfs_realtime_pb2 as gtfs_rt
f = open('gtfs-rt.pb', 'rb')
raw_str = f.read()
msg = gtfs_rt.FeedMessage()
msg.ParseFromString(raw_str)
print msg
This requires gtfs-realtime.proto to have been compiled into gtfs_realtime_pb2.py using protoc (following the instructions in the Python protobuf documentation under "Compiling Your Protocol Buffers") and placed in the same directory as the Python script. Furthermore, the binary protobuf downloaded from the MTA needs to be named gtfs-rt.pb and located in the same directory as the Python script.

Is there a "correct" folder where to place test resources in dart?

I have a simple dart class I am trying to test.
To test it I need to open a txt file, feed the content to an instance of the class and check that the output is correct.
Where do I place this txt file? The txt file is useless outside of testing.
Also, related, how do I acess its directory consistently? I tried placing it in the test folder, but the problem is that:
System.currentDirectory
Returns a different directory if I am running the test on its own or the script calling all the other test dart files on at a time
I check if System.currentDirectory is the directory containing the pubspec.yaml file, if not I move the current directory upwards until I found the directory containing the pubpsec.yaml file and then continue with the test code.
Looks like package https://pub.dev/packages/resource is also suitable for this now.
I have still not found a definitive answer to this question. I've been looking for something similar to the testdata directory in Go and the src/test/resources directory in Java.
I'm using Android studio and have settled on using a test_data.dart file at the top of my test directory. In there I define my test data (mostly JSON) and then import it into my individual tests. This doesn't help if you need to deal with binary files but it has been useful for my JSON data. I'll also inject the JSON language with //language=json so I can open the fragment in a separate window to format.
//language=json
const consolidatedWeatherJson = '''{
"consolidated_weather": [
{
"id": 4907479830888448,
"weather_state_name": "Showers",
"weather_state_abbr": "s",
"wind_direction_compass": "SW",
"created": "2020-10-26T00:20:01.840132Z",
"applicable_date": "2020-10-26",
"min_temp": 7.9399999999999995,
"max_temp": 13.239999999999998,
"the_temp": 12.825,
"wind_speed": 7.876886316914553,
"wind_direction": 246.17046093256732,
"air_pressure": 997.0,
"humidity": 73,
"visibility": 11.037727173307882,
"predictability": 73
}
]
}
''';
Using the Alt + Enter key combination will bring up the Edit JSON Fragment option. Selecting that open the fragment in a new editor and any changes made there (formatting for example) will be updated in the fragment.
Not perfect but it solves my issues.

Resources