Why does the 'join' method for Seq object in Biopython not work on the last element of a list? - biopython

The code below is from the Biopython tutorial. I intend to add 'N5' after every contig. Why is the trailing N10 not present after the third contig "TTGCA"?
from Bio.Seq import Seq
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)
output
Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')
expected output
Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCANNNNNNNNNN')
Doesn't the index in Python and Biopython both begin with 0?
Thank you

This has nothing to do with biopython.
This is just how string.join works:
configs = ["ATG", "ATCCCG", "TTGCA"]
spacer = "N"*10
spacer.join(configs)
Result:
ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA
As it should - according to help(str.join):
join(self, iterable, /)
Concatenate any number of strings.
The string whose method is called is inserted in between each given string.
The result is returned as a new string.
Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'

Related

Find sequence IDs of DNA subsequences in DNA-sequences from FASTA-file

I want to make a function that reads a FASTA-file with DNA sequences(possibly ambiguous) and inputs a subsequence that returns all sequence IDs of the sequences that contain the given subsequence.
To make the script more efficient, I tried to use nt_search to make give all possibilities of the ambiguous sequence from the FASTA. This seemed more efficient than producing all unambiguous possibilities, especially for larger sequences an FASTA-files.
Right now, I'm struggling to see how I can check whether the subsequence is part of the output given bynt_search.
I want to see if eg 'CGC' (input subsequence) is part of the possibilities given by nt_search: ['TA[GATC][AT][GT]GCGGT'] and return all sequence IDs of sequences for which this is true.
What I have so far:
def bonus_subsequence(file, unambiguous_sequence):
seq_records = SeqIO.parse(file,'fasta', alphabet =ambiguous_dna)
resultListOfSeqIds = []
print(f'Unambiguous sequence {unambiguous_sequence} could be a subsequence of:')
for record in seq_records:
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
couldBeSubSequence = False;
if unambiguous_sequence in nt_search(unambiguous_sequence,record):
couldBeSubSequence = True;
if couldBeSubSequence == True:
print(f'{record.id}')
resultListOfSeqIds.append({record.id})
In a second phase, I want to be able to also use this for ambiguous subsequences, but I'd be more than happy with help on this first question, thanks in advance!
I don't know if I understood You well but you can try this:
Example fasta file:
>seq1
ATGTACGTACGTACNNNNACTG
>seq2
NNNATCGTAGTCANNA
>seq3
NNNNATGNNN
Code:
from Bio import SeqIO
from Bio import SeqUtils
from Bio.Alphabet.IUPAC import ambiguous_dna
if __name__ == '__main__':
sub_seq = input('Enter a subsequence: ')
results = []
with open('test.fasta', 'r') as fh:
for seq in SeqIO.parse(fh, 'fasta', alphabet=ambiguous_dna):
if sub_seq in seq:
results.append((seq.name))
print(results, sep='\n')
Results (console):
Enter a subsequence: ATG
Results:
seq1
seq3
Enter a subsequence: NNNA
Results:
seq1
seq2
seq3

How to refactor string containing variable names into booleans?

I have an SPSS variable containing lines like:
|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|
Every line starts with pipe, and ends with one. I need to refactor it into boolean variables as the following:
var var1 var2 var3 var4 var5
|2|4|5| 0 1 0 1 1
I have tried to do it with a loop like:
loop # = 1 to 72.
compute var# = SUBSTR(var,2#,1).
end loop.
exe.
My code won't work with 2 or more digits long numbers and also it won't place the values into their respective variables, so I've tried nest the char.substr(var,char.rindex(var,'|') + 1) into another loop with no luck because it still won't allow me to recognize the variable number.
How can I do it?
This looks like a nice job for the DO REPEAT command. However the type conversion is somewhat tricky:
DO REPEAT var#i=var1 TO var72
/i=1 TO 72.
COMPUTE var#i = CHAR.INDEX(var,CONCAT("|",LTRIM(STRING(i,F2.0)),"|"))>0).
END REPEAT.
Explanation: Let's go from the inside to the outside:
STRING(value,F2.0) converts the numeric values into a string of two digits (with a leading white space where the number consist of just one digit), e.g. 2 -> " 2".
LTRIM() removes the leading whitespaces, e.g. " 2" -> "2".
CONCAT() concatenates strings. In the above code it adds the "|" before and after the number, e.g. "2" -> "|2|"
CHAR.INDEX(stringvar,searchstring) returns the position at which the searchstring was found. It returns 0 if the searchstring wasn't found.
CHAR.INDEX(stringvar,searchstring)>0 returns a boolean value indicating if the searchstring was found or not.
It's easier to do the manipulations in Python than native SPSS syntax.
You can use SPSSINC TRANS extension for this purpose.
/* Example data*/.
data list free / TextStr (a99).
begin data.
"|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|"
end data.
/* defining function to achieve task */.
begin program.
def runTask(x):
numbers=map(int,filter(None,[i.strip() for i in x.lstrip('|').split("|")]))
answer=[1 if i in numbers else 0 for i in xrange(1,max(numbers)+1)]
return answer
end program.
/* Run job*/.
spssinc trans result = V1 to V30 type=0 /formula "runTask(TextStr)".
exe.

How to format strings to print in a file with F#

This code is printing float numbers in the file with this format f,ffffff (with comma) and the numbers are in a row, but I need to print it like this f.ffffff (with dot) and after each number skip a line, so each number has its own line. Any ideas on how do I do it?
CODE EDITED
module writeFiles =
let (w:float[]) = [|-1.3231725; 1.052134922; 1.23082055; 1.457748868; -0.3481141253; -0.06886428466; -1.473392229; 0.1103078722; -1.047231857; -2.641890652; -1.335060286; -0.9839854216; 0.1844535984; 3.087001584; -0.008467130841; 1.175365466; 1.637297522; 5.557832631; -0.2906445452; -0.4052301538; 1.766454088; -2.604325471; -1.807107036; -2.471407376; -2.204730614;|]
let write secfilePath=
for j in 0 .. 24 do
let z = w.[j].ToString()
File.AppendAllText(secfilePath, z)
//File.AppendAllLines(secfilePath, z)
done
There is couple things that could be done better in your code.
You're opening the file over and over again every time you add a number
z does not need to be mutable
You can pass format pattern and/or culture to ToString call
You can iterate over filterMod.y instead of for loop and array indexer access
I would probably go with something more like
module writeFiles =
let write secfilePath=
let data = filterMod.y
|> Array.map (fun x -> x.ToString(CultureInfo.InvariantCulture))
File.AppendAllLines(secfilePath, data)
It prepares an array of strings, where every number of filterMod.y gets formatted using CultureInfo.InvariantCulture, which will make it use . as decimal separator. And later on it uses AppendAllLines to write the whole array to the file at once, where every element will be written in a separate line.

string comparison against factors in Stata

Suppose I have a factor variable with labels "a" "b" and "c" and want to see which observations have a label of "b". Stata refuses to parse
gen isb = myfactor == "b"
Sure, there is literally a "type mismatch", since my factor is encoded as an integer and so cannot be compared to the string "b". However, it wouldn't kill Stata to (i) perform the obvious parse or (ii) provide a translator function so I can write the comparison as label(myfactor) == "b". Using decode to (re)create a string variable defeats the purpose of encoding, which is to save space and make computations more efficient, right?
I hadn't really expected the comparison above to work, but I at least figured there would be a one- or two-line approach. Here is what I have found so far. There is a nice macro ("extended") function that maps the other way (from an integer to a label, seen below as local labi: label ...). Here's the solution using it:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// first, how many groups are there?
by myfactor, sort: gen ng = _n == 1
replace ng = sum(ng)
scalar ng = ng[_N]
drop ng
// now, which code corresponds to "b"?
forvalues i = 1/`=ng'{
local labi: label myfactor `i'
if "b" == "`labi'" {
scalar bcode = `i'
break
}
}
di bcode
The second step is what irks me, but I'm sure there's a also faster, more idiomatic way of performing the first step. Can I grab the length of the label vector, for example?
An example:
clear all
set more off
sysuse auto
gen isdom = 1 if foreign == "Domestic":`:value label foreign'
list foreign isdom in 1/60
This creates a variable called isdom and it will equal 1 if foreigns's value label is equal to "Domestic". It uses an extended macro function.
From [U] 18.3.8 Macro expressions:
Also, typing
command that makes reference to `:extended macro function'
is equivalent to
local macroname : extended macro function
command that makes reference to `macroname'
This explains one of the two : in the offered syntax. The other can be explained by
... to specify value labels directly in an expression, rather than through
the underlying numeric value ... You specify the label in double quotes
(""), followed by a colon (:), followed by the name of the value
label.
The quote is from Stata tip 14: Using value labels in expressions, by Kenneth Higbee, The Stata Journal (2004). Freely available at http://www.stata-journal.com/sjpdf.html?articlenum=dm0009
Edit
On computing the number of distinct observations, another way is:
by myfactor, sort: gen ng = _n == 1
count if ng
scalar sc_ng = r(N)
display sc_ng
But yours is fine. In fact, it is documented here: http://www.stata.com/support/faqs/data-management/number-of-distinct-observations/, along with more methods and comments.

How to extract data from F# list

Following up my previous question, I'm slowly getting the hang of FParsec (though I do find it particularly hard to grok).
My next newbie F# question is, how do I extract data from the list the parser creates?
For example, I loaded the sample code from the previous question into a module called Parser.fs, and added a very simple unit test in a separate module (with the appropriate references). I'm using XUnit:
open Xunit
[<Fact>]
let Parse_1_ShouldReturnListContaining1 () =
let interim = Parser.parse("1")
Assert.False(List.isEmpty(interim))
let head = interim.Head // I realise that I have only one item in the list this time
Assert.Equal("1", ???)
Interactively, when I execute parse "1" the response is:
val it : Element list = [Number "1"]
and by tweaking the list of valid operators, I can run parse "1+1" to get:
val it : Element list = [Number "1"; Operator "+"; Number "1"]
What do I need to put in place of my ??? in the snippet above? And how do I check that it is a Number, rather than an Operator, etc.?
F# types (including lists) implement structural equality. This means that if you compare two lists that contain some F# types using =, it will return true when the types have the same length and contain elements with the same properties.
Assuming that the Element type is a discriminated union defined in F# (and is not an object type), you should be able to write just:
Assert.Equal(interim, [Number "1"; Operator "+"; Number "1"])
If you wanted to implement the equality yourself, then you could use pattern matching;
let expected = [Number "1"]
match interim, expected with
| Number a, Number b when a = b -> true
| _ -> false

Resources