Making a function to turn quality strings into a list of Phred scores - biopython

I'm new to Python coding, and I am having trouble making a function that turns a quality string into a list of PHRED-scaled quality scores. Hoping to get some assistance.
Here is a FASTQ read:
#SEQ_ID
AAGCGTCTGATCGGCAGAGGATACACATGCCGCACGTCGAGTATCTCGGC
+
=3:AAF>FGD1FCGGGGGFBGGGGCGGG1FE>>>E<:>/<9:CDGFG#GG
This is the function definition:
def quality_to_list(quality_string):

BioPython has a couple of good examples and documentation on Phred scores.
from Bio import SeqIO
with open('tmp.fastq', 'w') as f:
f.write("""#SEQ_ID
AAGCGTCTGATCGGCAGAGGATACACATGCCGCACGTCGAGTATCTCGGC
+
=3:AAF>FGD1FCGGGGGFBGGGGCGGG1FE>>>E<:>/<9:CDGFG#GG""")
for record in SeqIO.parse("tmp.fastq", "fastq"):
print("ID: {0}\nPhred scores: {1}".format(record.id, record.letter_annotations['phred_quality']))
Output:
ID: SEQ_ID
Phred scores: [28, 18, 25, 32, ..., 34, 35, 38, 37, 38, 31, 38, 38]

Related

Copy tensor elements of certain indices in PyTorch

The desired operation is similar in spirit to torch.Tensor.index_copy, but a little different.
It's best explained with an example.
Tensor A has original values that we will copy:
[10, 20, 30]
Tensor B has indices of A:
[0, 1, 0, 1, 2, 1]
Tensor C has same length as B, containing the indexed values of A:
[10, 20, 10, 20, 30, 20]
What's a good way to make C from A and B in PyTorch, without using loops?
Have you tried just indexing by A?
In [1]: import torch
In [2]: a = torch.tensor([20,30,40])
In [3]: b = torch.tensor([0,1,2,1,1,2,0,0,1,2])
In [4]: a[b]
Out[4]: tensor([20, 30, 40, 30, 30, 40, 20, 20, 30, 40])

Q: Creating a combined plot based on a ts object

R beginner here.
I have created a dataframe (called combined_ts2) based on a dataset I was given.See this link for the dataframe.
Based on the dataframe a made this ts object (find code at bottom of post).
I am supposed to make a plot based on the TS object. I'm thinking this can be achieved with a ts.plot function, but I can't figure out how to use this function.
The result I want is to get a bar graph for capacity and a line graph for fixtures. Does anyone know how to achieve this, and if I can actually achieve this from a ts object?
dput for dataframe
structure(list(week = c(26, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48), capacity = c(45000L,
39000L, 495500L, 855300L, 1318300L, 1301885L, 8211550L, 18515400L,
32282950L, 31568400L, 35410200L, 29867500L, 34809050L, 36420050L,
33960520L, 33987550L, 33465500L, 24599000L, 11597000L, 4553000L,
1375000L, 545000L), fixtures = c(2L, 4L, 12L, 13L, 18L, 29L,
161L, 338L, 393L, 405L, 439L, 386L, 442L, 406L, 413L, 421L, 326L,
180L, 84L, 23L, 6L, 3L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -22L))
Time series code
weekly_timeseries <- ts(combined_ts2, start=c(2019,26), frequency = 52)

Fitting a Support Vector Classifier in scikit-learn with image data produces error

I'm trying to train an SVC classifier for image data. Yet, when I run this code:
classifier = svm.SVC(gamma=0.001)
classifier.fit(train_set, train_set_labels)
I get this error:
ValueError: setting an array element with a sequence.
I produced the images into an array with Matplotlib: plt.imread(image).
The error seems like it's not in an array, yet when I check the types of the data and the labels they're both lists (I manually add to a list for the labels data):
print(type(train_set))
print(type(train_set_labels))
<class 'list'>
<class 'list'>
If I do a plt.imshow(items[0]) then the image shows correctly in the output.
I also called train_test_split from scikit-learn:
train_set, test_set = train_test_split(items, test_size=0.2, random_state=42)
Example input:
train_set[0]
array([[[212, 134, 34],
[221, 140, 48],
[240, 154, 71],
...,
[245, 182, 51],
[235, 175, 43],
[242, 182, 50]],
[[230, 152, 51],
[222, 139, 47],
[236, 147, 65],
...,
[246, 184, 49],
[238, 179, 43],
[245, 186, 50]],
[[229, 150, 47],
[205, 122, 28],
[220, 129, 46],
...,
[232, 171, 28],
[237, 179, 35],
[244, 188, 43]],
...,
[[115, 112, 103],
[112, 109, 102],
[ 80, 77, 72],
...,
[ 34, 25, 28],
[ 55, 46, 49],
[ 80, 71, 74]],
[[ 59, 56, 47],
[ 66, 63, 56],
[ 48, 45, 40],
...,
[ 32, 23, 26],
[ 56, 47, 50],
[ 82, 73, 76]],
[[ 29, 26, 17],
[ 41, 38, 31],
[ 32, 29, 24],
...,
[ 56, 47, 50],
[ 59, 50, 53],
[ 84, 75, 78]]], dtype=uint8)
Example label:
train_set_labels[0]
'Picasso'
I'm not sure what step I'm missing to get the data in the form that the classifier needs in order to train it. Can anyone see what may be needed?
The error message you are receiving:
ValueError: setting an array element with a sequence,
normally results when you are trying to put a list somewhere that a single value is required. This would suggest to me that your train_set is made up of a list of multidimensional elements, although you do state that your inputs are lists. Would you be able to post an example of your inputs and labels?
UPDATE
Yes, it's as I thought. The first element of your training data, train_set[0], corresponds to a long list (I can't tell how long), each element of which consists of a list of 3 elements. You are therefore calling the classifier on a list of lists of lists, when the classifier requires a list of lists (m rows corresponding to the number of training examples with each row made up of a list of n features). What else is in your train_set array? Is the full data set in train_set[0]? If so, you would need to create a new array with each element corresponding to each of the subelements of train_set[0], and then I believe your code should run, although I am not too familiar with that classifier. Alternatively you could try running the classifier with train_set[0].
UPDATE 2
I don't have experience with scikit-learn.svc so I wouldn't be able to tell you what the best way of preprocessing the data in order for it to be acceptable to the algorithm, but one method would be to do as I said previously and for each element of train_set, which is composed of lists of lists, would be to recurse through and place all the elements of sublist into the list above. For example
new_train_set = []
for i in range(len(train_set)):
for j in range(len(train_set[i]):
new_train_set.append([train_set[i,j])
I would then train with new_train_set and the training labels.

Haskell: How to use attoparsec in order to read a nested list from a ByteString

I have a text file (~ 300 MB large) with a nested list, similar to this one:
[[4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 87, 92, 93, 94, 95], [4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 87, 92, 93, 94],[4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 85, 87, 92, 93, 94, 95]]
Here is my program to read the file into a haskell Integer list:
import qualified Data.ByteString as ByteStr
main :: IO ()
-- HOW to do the same thing but using ByteStr.readFile for file access?
main = do fContents <- readFile filePath
let numList = readNums fContents
putStrLn (show nums)
This works for small text files, but I want to use ByteString to read the file quickly. I found out that there is no read function for ByteString, instead you should write your own parser in attoparsec, since it supports parsing ByteStrings.
How can I use attoparsec to parse the nested list?
The data seems to be in JSON format, so you can use Data.Aeson decode function which works on ByteString
import qualified Data.ByteString.Lazy as BL
import Data.Aeson
import Data.Maybe
main = do fContents <- BL.readFile filePath
let numList = decode fContents :: Maybe [[Int]]
putStrLn (show $ fromJust numList)

How to convert Google spreadsheet's worksheet string id to integer index (GID)?

To export google spreadsheet's single worksheet to CSV, integer worksheet index(GID) is required to be passed.
https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=%s&gid=%d&exportFormat=csv
But, where are those informations? With gdata.spreadsheets.client, I could find some string id for worksheet like "oc6, ocv, odf".
client = gdata.spreadsheets.client.SpreadsheetsClient()
feed = client.GetWorksheets(spreadsheet, auth_token=auth_token)
And it returns below atom XML. (part of it)
<entry gd:etag=""URJFCB1NQSt7ImBoXhU."">
<id>https://spreadsheets.google.com/feeds/worksheets/0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c/ocw</id>
<updated>2012-06-21T08:19:46.587Z</updated>
<app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-21T08:19:46.587Z</app:edited>
<category scheme="http://schemas.google.com/spreadsheets/2006" term="http://schemas.google.com/spreadsheets/2006#worksheet"/>
<title>AchievementType</title>
<content type="application/atom+xml;type=feed" src="https://spreadsheets.google.com/feeds/list/0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c/ocw/private/full"/>
<link rel="http://schemas.google.com/spreadsheets/2006#cellsfeed" type="application/atom+xml" href="https://spreadsheets.google.com/feeds/cells/0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c/ocw/private/full"/>
<link rel="http://schemas.google.com/visualization/2008#visualizationApi" type="application/atom+xml" href="https://spreadsheets.google.com/tq?key=0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c&sheet=ocw"/>
<link rel="self" type="application/atom+xml" href="https://spreadsheets.google.com/feeds/worksheets/0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c/private/full/ocw"/>
<link rel="edit" type="application/atom+xml" href="https://spreadsheets.google.com/feeds/worksheets/0AvhN_YU3r5e9dGpTWGx3UVU3MTczaXJuNEFKQjMwN2c/private/full/ocw"/>
<gs:rowCount>280</gs:rowCount>
<gs:colCount>28</gs:colCount>
</entry>
Also I tried with sheet parameter but failed with "Invalid Sheet" error.
https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=%s&sheet=XXX&exportFormat=csv
I guess there should be some magic function but could not find it. How can I convert them to integer id?? Or Can I export worksheet with string id?
EDIT: I just made convert table with python. DIRTY but working :-(
GID_TABLE = {
'od6': 0,
'od7': 1,
'od4': 2,
'od5': 3,
'oda': 4,
'odb': 5,
'od8': 6,
'od9': 7,
'ocy': 8,
'ocz': 9,
'ocw': 10,
'ocx': 11,
'od2': 12,
'od3': 13,
'od0': 14,
'od1': 15,
'ocq': 16,
'ocr': 17,
'oco': 18,
'ocp': 19,
'ocu': 20,
'ocv': 21,
'ocs': 22,
'oct': 23,
'oci': 24,
'ocj': 25,
'ocg': 26,
'och': 27,
'ocm': 28,
'ocn': 29,
'ock': 30,
'ocl': 31,
'oe2': 32,
'oe3': 33,
'oe0': 34,
'oe1': 35,
'oe6': 36,
'oe7': 37,
'oe4': 38,
'oe5': 39,
'odu': 40,
'odv': 41,
'ods': 42,
'odt': 43,
'ody': 44,
'odz': 45,
'odw': 46,
'odx': 47,
'odm': 48,
'odn': 49,
'odk': 50,
'odl': 51,
'odq': 52,
'odr': 53,
'odo': 54,
'odp': 55,
'ode': 56,
'odf': 57,
'odc': 58,
'odd': 59,
'odi': 60,
'odj': 61,
'odg': 62,
'odh': 63,
'obe': 64,
'obf': 65,
'obc': 66,
'obd': 67,
'obi': 68,
'obj': 69,
'obg': 70,
'obh': 71,
'ob6': 72,
'ob7': 73,
'ob4': 74,
'ob5': 75,
'oba': 76,
'obb': 77,
'ob8': 78,
'ob9': 79,
'oay': 80,
'oaz': 81,
'oaw': 82,
'oax': 83,
'ob2': 84,
'ob3': 85,
'ob0': 86,
'ob1': 87,
'oaq': 88,
'oar': 89,
'oao': 90,
'oap': 91,
'oau': 92,
'oav': 93,
'oas': 94,
'oat': 95,
'oca': 96,
'ocb': 97,
'oc8': 98,
'oc9': 99
}
I found your question looking for a solution to the same problem, and was surprised that those worksheet IDs actually correspond 1:1 to gids - I originally assumed they were assigned independently, instead of being an exercise in obfuscation.
I was able to find a slightly cleaner solution by reverse-engineering the formula they use to generate worksheet IDs from your table:
worksheetID = (gid xor 31578) encoded in base 36
So, some Python to go from a worksheet ID to gid:
def to_gid(worksheet_id):
return int(worksheet_id, 36) ^ 31578
This is still dirty, but will work for GIDs higher than 99 without requiring giant tables. At least as long as they don't change the generation logic (which they probably won't, as it would break existing IDs that people already use).
This code works with the new Google Sheets.
// Conversion of Worksheet Ids to GIDs and vice versa
// od4 > 2
function wid_to_gid(wid) {
var widval = wid.length > 3 ? wid.substring(1) : wid;
var xorval = wid.length > 3 ? 474 : 31578;
return parseInt(String(widval), 36) ^ xorval;
}
// 2 > od4
function gid_to_wid(gid) {
var xorval = gid > 31578 ? 474 : 31578;
var letter = gid > 31578 ? 'o' : '';
return letter + parseInt((gid ^ xorval)).toString(36);
}
I cannot add a comment to Wasilewski's post because apparently I lack reputation so here are the two conversion functions in Javascript based on Wasilewski's answer:
// Conversion of Worksheet Ids to GIDs and vice versa
// od4 > 2
function wid_to_gid(wid) {
return parseInt(String(wid),36)^31578
}
// 2> 0d4
function gid_to_wid(gid) {
// (gid xor 31578) encoded in base 36
return parseInt((gid^31578)).toString(36);
}
This is a Java adaptation of Buho's code which works with both the new Google Sheets and with the legacy Google Spreadsheets.
// "od4" to 2 (legacy style)
// "ogtw0h0" to 1017661118 (new style)
public static int widToGid(String worksheetId) {
boolean idIsNewStyle = worksheetId.length() > 3;
// if the id is in the new style, first strip the first character before converting
worksheetId = idIsNewStyle ? worksheetId.substring(1) : worksheetId;
// determine the integer to use for bitwise XOR
int xorValue = idIsNewStyle ? 474 : 31578;
// convert to gid
return Integer.parseInt(worksheetId, 36) ^ xorValue;
}
// Convert 2 to "od4" (legacy style)
// Convert 1017661118 to "ogtw0h0" (new style)
public static String gidToWid(int gid) {
boolean idIsNewStyle = gid > 31578;
// determine the integer to use for bitwise XOR
int xorValue = idIsNewStyle ? 474 : 31578;
// convert to worksheet id, prepending 'o' if it is the new style.
return
idIsNewStyle ?
'o' + Integer.toString((worksheetIndex ^ xorValue), 36):
Integer.toString((worksheetIndex ^ xorValue), 36);
}
This is a Clojure adaptation of Buho's and Julie's code which should work with both the new Google Sheets and with the legacy Google Spreadsheets.
(defn wid->gid [wid]
(let [new-wid? (> (.length wid) 3)
wid (if new-wid? (.substring wid 1) wid)
xor-val (if new-wid? 474 31578)]
(bit-xor (Integer/parseInt wid 36) xor-val)))
(defn gid->wid [gid]
(let [new-gid? (> gid 31578)
xor-val (if new-gid? 474 31578)
letter (if new-gid? "o" "")]
(str letter (Integer/toString (bit-xor gid xor-val) 36))))
If you're using Python with gspread, here's what you do:
wid = worksheet.id
widval = wid[1:] if len(wid) > 3 else wid
xorval = 474 if len(wid) > 3 else 31578
gid = int(str(widval), 36) ^ xorval
I'll probably open a PR for this.

Resources