Print 50 sequences from each line using Clustal - parsing

I have a multiple sequence alignment (Clustal) file and I want to read this file and arrange sequences in such a way that it looks more clear and precise in order.
I am doing this from Biopython using an AlignIO object:
alignment = AlignIO.read("opuntia.aln", "clustal")
print "Number of rows: %i" % len(align)
for record in alignment:
print "%s - %s" % (record.id, record.seq)
My output looks messy and long scrolling. What I want to do is print only 50 sequences in each line and continue until the end of the alignment file.
I wish to have output like this, from http://www.ebi.ac.uk/Tools/clustalw2/.

Br,
I don't have biopython on this computer, so this isn't tested, but it should work:
chunk_size = 50
for i in range(0, alignment.get_alignment_length(), chunk_size):
print ""
for record in alignment:
print "%s\t%s %i" % (record.name, record.seq[i:i + chunk_size], i + chunk_size)
Does the same trick as Eli's one - using range to set up an index to slice from then iterating over the record in the alignment for each slice.

Do you require something more complex than simply breaking record.seq into chunks of 50 characters, or am I missing something?
You can use Python sequence slicing to achieve that very easily. seq[N:N+50] accesses the 50 sequence elements starting with N:
In [24]: seq = ''.join(str(random.randint(1, 4)) for i in range(200))
In [25]: seq
Out[25]: '13313211211434211213343311221443122234343421132111223234141322124442112343143112411321431412322123214232414331224144142222323421121312441313314342434231131212124312344112144434314122312143242221323123'
In [26]: for n in range(0, len(seq), 50):
....: print seq[n:n+50]
....:
....:
13313211211434211213343311221443122234343421132111
22323414132212444211234314311241132143141232212321
42324143312241441422223234211213124413133143424342
31131212124312344112144434314122312143242221323123

Related

Get the range of string in lua in row and column

I'm trying to calculate the range of a given text in terms of row and column.
For following string,
'hello\nworld
The range should be
{
row_start = 0,
col_start = 0,
row_end = 1,
col_end = 4
}
Here, row_start and col_start are NOT important for the question. world will be in the second line hens the row_end is 1. world has 5 characters hens the col_end is 4.
So, I need a function to calculate the number of line breaks and length of the string at the last line to calculate the range.
I couldn't find any other way than calculating the number of line breaks to get row_end. Then reverse the text and find the index of the first newline character to get the col_end. Any other efficient way to do this in Lua?
Given str = "hello\nworld":
I couldn't find any other way than calculating the number of line breaks to get row_end
There is no more efficient way: You have to count the line breaks. Assuming UNIX LFs as in your example, you can simply use gmatch for this (which is presumably more efficient than abusing gsub to do the counting for you):
local row_end = 0
for _ in str:gmatch"\n" do row_end = row_end + 1 end
Then reverse the text and find the index of the first newline character to get the col_end. Any other efficient way to do this in Lua?
Yes, this is indeed needlessly inefficient. The shortest way to do this Lua would be using pattern matching:
local col_end = #str - str:find"[^\n]*$"
Explanation: Find the starting index of the longest "run" of non-newline characters. For str, this would be the index of w. Then subtract this index from the length of the string to find the 0-based index of the last character.
A probably more efficient solution would just remember the index after the last newline (and thus have no issue with possibly poor pattern matching performance):
local after_last_newline_idx = 1
for idx in str:gmatch"\n()" do -- () captures the position after the newline
after_last_newline_idx = idx
end
local col_end = #str - after_last_newline_idx
This could be merged with the first loop to only loop once:
local row_end = 0
local after_last_newline_idx = 1
for idx in str:gmatch"\n()" do -- () captures the position after the newline
row_end = row_end + 1
after_last_newline_idx = idx
end
local col_end = #str - after_last_newline_idx
... taking linear time, which is required. However this avoids creating a garbage string by reversing str. It only loops over the string once to find newlines. If gmatch is too slow for your purposes, you can easily use string:byte or string:sub and a numeric for loop to do the looping over newlines yourself.

Calculating ISIN checksum

HI I know there have been may question about this here but I wasn't able to find a detailed enough answer, Wikipedia has two examples of ISIN and how is their checksum calculated.
The part of calculation that I'm struggling with is
Multiply the group containing the rightmost character
The way I understand this statement is:
Iterate through each character from right to left
once you stumble upon a character rather than digit record its position
if the position is an even number double all numeric values in even position
if the position is an odd number double all numeric values in odd position
My understanding has to be wrong because there are at least two problems:
Every ISIN starts with two character country code so position of rightmost character is always the first character
If you omit the first two characters then there is no explanation as to what to do with ISINs that are made up of all numbers (except for first two characters)
Note
isin.org contains even less information on verifying ISINs, they even use the same example as Wikipedia.
I agree with you; the definition on Wikipedia is not the clearest I have seen.
There's a piece of text just before the two examples that explains when one or the other algorithm should be used:
Since the NSIN element can be any alpha numeric sequence (9 characters), an odd number of letters will result in an even number of digits and an even number of letters will result in an odd number of digits. For an odd number of digits, the approach in the first example is used. For an even number of digits, the approach in the second example is used
The NSIN is identical to the ISIN, excluding the first two letters and the last digit; so if the ISIN is US0378331005 the NSIN is 037833100.
So, if you want to verify the checksum digit of US0378331005, you'll have to use the "first algorithm" because there are 9 digits in the NSIN. Conversely, if you want to check AU0000XVGZA3 you're going to use the "second algorithm" because the NSIN contains 4 digits.
As to the "first" and "second" algorithms, they're identical, with the only exception that in the former you'll multiply by 2 the group of odd digits, whereas in the latter you'll multiply by 2 the group of even digits.
Now, the good news is, you can get away without this overcomplicated algorithm.
You can, instead:
Take the ISIN except the last digit (which you'll want to verify)
Convert all letters to numbers, so to obtain a list of digits
Reverse the list of digits
All the digits in an odd position are doubled and their digits summed again if the result is >= 10
All the digits in an even position are taken as they are
Sum all the digits, take the modulo, subtract the result from 0 and take the absolute value
The only tricky step is #4. Let's clarify it with a mini-example.
Suppose the digits in an odd position are 4, 0, 7.
You'll double them and get: 8, 0, 14.
8 is not >= 10, so we take it as it is. Ditto for 0. 14 is >= 10, so we sum its digits again: 1+4=5.
The result of step #4 in this mini-example is, therefore: 8, 0, 5.
A minimal, working implementation in Python could look like this:
import string
isin = 'US4581401001'
def digit_sum(n):
return (n // 10) + (n % 10)
alphabet = {letter: value for (value, letter) in
enumerate(''.join(str(n) for n in range(10)) + string.ascii_uppercase)}
isin_to_digits = ''.join(str(d) for d in (alphabet[v] for v in isin[:-1]))
isin_sum = 0
for (i, c) in enumerate(reversed(isin_to_digits), 1):
if i % 2 == 1:
isin_sum += digit_sum(2*int(c))
else:
isin_sum += int(c)
checksum_digit = abs(- isin_sum % 10)
assert int(isin[-1]) == checksum_digit
Or, more crammed, just for functional fun:
checksum_digit = abs( - sum(digit_sum(2*int(c)) if i % 2 == 1 else int(c)
for (i, c) in enumerate(
reversed(''.join(str(d) for d in (alphabet[v] for v in isin[:-1]))), 1)) % 10)

How to refactor string containing variable names into booleans?

I have an SPSS variable containing lines like:
|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|
Every line starts with pipe, and ends with one. I need to refactor it into boolean variables as the following:
var var1 var2 var3 var4 var5
|2|4|5| 0 1 0 1 1
I have tried to do it with a loop like:
loop # = 1 to 72.
compute var# = SUBSTR(var,2#,1).
end loop.
exe.
My code won't work with 2 or more digits long numbers and also it won't place the values into their respective variables, so I've tried nest the char.substr(var,char.rindex(var,'|') + 1) into another loop with no luck because it still won't allow me to recognize the variable number.
How can I do it?
This looks like a nice job for the DO REPEAT command. However the type conversion is somewhat tricky:
DO REPEAT var#i=var1 TO var72
/i=1 TO 72.
COMPUTE var#i = CHAR.INDEX(var,CONCAT("|",LTRIM(STRING(i,F2.0)),"|"))>0).
END REPEAT.
Explanation: Let's go from the inside to the outside:
STRING(value,F2.0) converts the numeric values into a string of two digits (with a leading white space where the number consist of just one digit), e.g. 2 -> " 2".
LTRIM() removes the leading whitespaces, e.g. " 2" -> "2".
CONCAT() concatenates strings. In the above code it adds the "|" before and after the number, e.g. "2" -> "|2|"
CHAR.INDEX(stringvar,searchstring) returns the position at which the searchstring was found. It returns 0 if the searchstring wasn't found.
CHAR.INDEX(stringvar,searchstring)>0 returns a boolean value indicating if the searchstring was found or not.
It's easier to do the manipulations in Python than native SPSS syntax.
You can use SPSSINC TRANS extension for this purpose.
/* Example data*/.
data list free / TextStr (a99).
begin data.
"|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|"
end data.
/* defining function to achieve task */.
begin program.
def runTask(x):
numbers=map(int,filter(None,[i.strip() for i in x.lstrip('|').split("|")]))
answer=[1 if i in numbers else 0 for i in xrange(1,max(numbers)+1)]
return answer
end program.
/* Run job*/.
spssinc trans result = V1 to V30 type=0 /formula "runTask(TextStr)".
exe.

Need advice on how to print a matrix in lisp

I have a matrix defined so if I do this
(format t "~a" (get-real-2d 0 0))
it prints out the element in the first row first column
and if I do this
(format t "~a" (get-real-2d a 0 1))
it prints out the element in first row second column
and if I do this
(format t "~a" (get-real-2d a 1 0))
it prints out the element in second row first column.
The matrix a looks like this
a =
((0 1 2)
(3 4 5)
(6 7 8))
and I was hoping you can show me exactly how to write a dotimes loop or other loop
that would in as few lines as possible would print out the matrix using the get-real-2d function so the output looks like this:
0 1 2
3 4 5
6 7 8
I'm just hoping you can show me a slick loop that would be real small that I can use to print matrices that I can use in my lisp library something real professional looking, like one that would use only variables. Something like:
(format t "~a" (get-real-2d i j))
instead of a bunch of:
(format t "~a" (get-real-2d 0 0))
(format t "~a" (get-real-2d 0 1))
(format t "~a" (get-real-2d 0 2))
;;;;LATEST EDIT;;;
to make this simple I call
(defparameter a (create-mat 3 3 +32fc1+))
to create a 3x3 matrix - create-mat is a wrapper for opencv's cvCreateMat
the output from that command at repl is
(defparameter a (create-mat 3 3 +32fc1+))
A
CL-OPENCV> a
#.(SB-SYS:INT-SAP #X7FFFD8000E00)
i/e the variable a is a pointer to the 3x3 matrix
then I run
(defparameter data (cffi:foreign-alloc :float :initial-contents
'(0.0f0 1.0f0 2.0f0 3.0f0 4.0f0 5.0f0 6.0f0 7.0f0 8.0f0)))
to create the data for the matrix - which I next will allocate to the matrix
the output from that command at repl is
CL-OPENCV> (defparameter data (cffi:foreign-alloc :float :initial-contents
'(0.0f0 1.0f0 2.0f0 3.0f0 4.0f0 5.0f0 6.0f0 7.0f0 8.0f0)))
DATA
CL-OPENCV> data
#.(SB-SYS:INT-SAP #X7FFFD8000E40)
i/e the variable a is data pointer to the data ill add to the matrix
then I call..
(set-data a data 12) to add the data to the matrix - set-data is a wrapper for opencv's cvSetData
so now when I run - (get-real-2d is a wrapper for opencv's cvGetReal2d)
(get-real-2d a 0 0) it gets the element of matrix a at row 0 col 0 which is 0.0d0
the output from that command at repl is
CL-OPENCV> (get-real-2d a 0 0)
0.0d0
and now when I run
(get-real-2d a 0 1) it gets the element of matrix a at row 0 col 1 which is is 0.0d0
the output from that command at repl is
CL-OPENCV> (get-real-2d a 0 1)
1.0d0
and when I run this loop
(dotimes (i 3)
(dotimes (j 3)
(format t "~a~%" (get-real-2d a i j))))
the output from that command at repl is
CL-OPENCV> (dotimes (i 3)
(dotimes (j 3)
(format t "~a~%" (get-real-2d a i j))))
0.0d0
1.0d0
2.0d0
3.0d0
4.0d0
5.0d0
6.0d0
7.0d0
8.0d0
NIL
but when I try your method #Svante
(dotimes (i 3)
(dotimes (j 3)
(format t "~{~{~a~^ ~}~%~}" (get-real-2d a i j))))
I get error:
The value 0.0d0 is not of type LIST.
[Condition of type TYPE-ERROR]
because the output of 1 run of get-real-2d is just a 1 number float i/e
CL-OPENCV> (get-real-2d a 0 0)
0.0d0
with that info can you help me print the matrix so it looks like this
0.0d0 1.0d0 2.0d0
3.0d0 4.0d0 5.0d0
6.0d0 7.0d0 8.0d0
You can do that directly in the format directive. The format instructions ~{ and ~} descend into a list structure.
(format t "~{~{~a~^ ~}~%~}" matrix)
The outer pair of ~{ ~} loops over the first level of the matrix, so that the directives inside get to see one row at a time. The inner pair of ~{ ~} loops over each such row, so that the directives inside get to see one element at a time. ~A prints that element. The part between ~^ and ~} gets printed only between executions of the loop body, not at the end. ~% emits a #\Newline.
EDIT as requested
Note that the ~{ ~} replace the looping, and that I named the variable matrix, not element. You need to put the entire matrix there, and it is supposed to be in the form of a nested list. I deduced this from your statement that a is ((0 1 2) (3 4 5) (6 7 8)). So, (format t "~{~{~a~^ ~}~%~}" a).
If the matrix happens not to be in the form of a nested list but rather some kind of array, you really need to loop over the indices. Nested dotimes forms should be sufficient at first:
(fresh-line)
(dotimes (i (array-dimension array 0))
(dotimes (j (array-dimension array 1))
(format t "~a " (aref array i j)))
(terpri))
I don't know how your matrices map to arrays, so you will have to replace array-dimension and aref with your versions.
Your question can be understood in two ways, and that is why it has two solutions:
Define method for printing object of type matrix (in this case it may use the knowledge about the internal structure of matrix):
(defmethod print-object ((matrix matrix) stream)
(format stream "~{~{~a~^ ~}~%~}" matrix))
Using format as is shown in the answers.
Define client function that can use the only method of your object - get-real-2d:
(defun print-matrix (matrix dimension-x dimension-y)
(dotimes (x dimension-x)
(dotimes (y dimension-y)
(princ (get-real-2d matrix x y))
(princ #\Space))
(princ #\Newline)))
Just using dotimes.
Here are just the two dotimes loops that you were asking for. The only thing that you need to pay attention for is when to print spaces and when to print newlines.
(dotimes (i 3)
(dotimes (j 3)
(princ (get-real-2d a i j))
(if (< j 2)
(princ #\Space)
(terpri))))
Alternatively, you might want to use the format directives for floating point printing to have the numbers always aligned in nice columns. You can choose between ~F that will never print an exponent, ~E that will always print one, and ~G that behaves according to the magnitude. Look for details here in the HyperSpec: http://www.lispworks.com/documentation/HyperSpec/Body/22_cc.htm.
Here's an example that uses ~F with a maximum field width of 5 and 1 fractional digit:
(dotimes (i 3)
(dotimes (j 3)
(format t "~5,1F" (get-real-2d a i j)))
(terpri))
This isn't hard, so I'd rather leave it to you to figure out, but here are some tips to make a "slick loop" Lisp-style. I would suggest one or more instances of mapc (or mapcar), rather than dotimes. This may feel odd if you're not used to functional programming, but once you're used to it, it's easier to read than dotimes, and you don't have to keep track of the indexes, so it can avoid errors. You really should learn to use mapcar/mapc if you aren't already familiar with them. They are cool. Or if you want to be really cool :-) you could use recursion to iterate over the matrix, but I think that for this purpose iterating using mapc will be easier to read. (But you should learn the recursive way for other jobs. If you find recursion confusing--I have no reason to think you do, but some people have trouble with it--my favorite tutorial is The Little Schemer.)
You may also want to use other format directives that allow you pad numbers with spaces if they don't have enough digits. The ~% directive may be useful as well. Peter Seibel has a very nice introduction to format.

Lua base converter

I need a base converter function for Lua. I need to convert from base 10 to base 2,3,4,5,6,7,8,9,10,11...36 how can i to this?
In the string to number direction, the function tonumber() takes an optional second argument that specifies the base to use, which may range from 2 to 36 with the obvious meaning for digits in bases greater than 10.
In the number to string direction, this can be done slightly more efficiently than Nikolaus's answer by something like this:
local floor,insert = math.floor, table.insert
function basen(n,b)
n = floor(n)
if not b or b == 10 then return tostring(n) end
local digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
local t = {}
local sign = ""
if n < 0 then
sign = "-"
n = -n
end
repeat
local d = (n % b) + 1
n = floor(n / b)
insert(t, 1, digits:sub(d,d))
until n == 0
return sign .. table.concat(t,"")
end
This creates fewer garbage strings to collect by using table.concat() instead of repeated calls to the string concatenation operator ... Although it makes little practical difference for strings this small, this idiom should be learned because otherwise building a buffer in a loop with the concatenation operator will actually tend to O(n2) performance while table.concat() has been designed to do substantially better.
There is an unanswered question as to whether it is more efficient to push the digits on a stack in the table t with calls to table.insert(t,1,digit), or to append them to the end with t[#t+1]=digit, followed by a call to string.reverse() to put the digits in the right order. I'll leave the benchmarking to the student. Note that although the code I pasted here does run and appears to get correct answers, there may other opportunities to tune it further.
For example, the common case of base 10 is culled off and handled with the built in tostring() function. But similar culls can be done for bases 8 and 16 which have conversion specifiers for string.format() ("%o" and "%x", respectively).
Also, neither Nikolaus's solution nor mine handle non-integers particularly well. I emphasize that here by forcing the value n to an integer with math.floor() at the beginning.
Correctly converting a general floating point value to any base (even base 10) is fraught with subtleties, which I leave as an exercise to the reader.
you can use a loop to convert an integer into a string containting the required base. for bases below 10 use the following code, if you need a base larger than that you need to add a line that mapps the result of x % base to a character (usign an array for example)
x = 1234
r = ""
base = 8
while x > 0 do
r = "" .. (x % base ) .. r
x = math.floor(x / base)
end
print( r );

Resources