Search in a consensus sequence with ambiguous bases - biopython

I have a consensus sequence with ambiguous bases, in which I want to search for a pattern. Start position should be 3.
Seq="ATGARTTTTTT" -- R is A or G
Pattern="AGT"
I found a tool called nt_search in SeqUtils that can do that but it doesn't give me the coordinates of the match as illustrated in scenario #2. Below are a few test runs to demonstrate the issue.
Scenario #1: No Ambiguous Bases
from Bio import SeqUtils
pattern="TAA"
Seq="ATGTAAAGGAGG"
m=SeqUtils.nt_search(Seq,pattern)
print m
['ACG', 3]
Scenario #2: Ambiguous Bases in sequence
pattern="AGT"
Seq="ATGARTTTTTT"
m=SeqUtils.nt_search(Seq,pattern)
print m
['AGT']
Scenario #3: Ambiguous Bases in pattern
pattern="ART"
Seq="ATGAGTTTTTT"
m=SeqUtils.nt_search(Seq,pattern)
print m
['A[AG]T', 3]
The source code for nt_search is here. I am not sure how to tweak it to get the start position, in this example, that would be 3.

The help text for nt_search could be much clearer, but it returns a list giving the regular expression used and the position of any matches. e.g.
>>> from Bio.SeqUtils import nt_search
>>> print(nt_search("ATGAGTTTTTT", "ART"))
['A[AG]T', 3]
>>> print(nt_search("ATGAGTTTTTTAGT", "ART"))
['A[AG]T', 3, 11]
>>> print(nt_search("ATGAGTTTTTTAGTTTTAAT", "ART"))
['A[AG]T', 3, 11, 17]
So, if you want the first match only as an integer, you need to pull out element one from the list:
>>> from Bio.SeqUtils import nt_search
>>> print(nt_search("ATGAGTTTTTTAGT", "ART")[0])
A[AG]T
>>> print(nt_search("ATGAGTTTTTTAGT", "ART")[1])
3
Element zero would be the regular expression used.
Update: Searching against an ambiguous sequence is not (currently) supported (scenario 2 in the question), in part I would think as it would be difficult to define a sensible implementation. e.g. searching as against "NNN"with any any three letter query like "AAA", "ART", or "NNN" would give you a match.

Related

Prime factorization of integers with Maxima

I want to use Maxima to get the prime factorization of a random positive integer, e.g. 12=2^2*3^1.
What I have tried so far:
a:random(20);
aa:abs(a);
fa:ifactors(aa);
ka:length(fa);
ta:1;
pfza: for i:1 while i<=ka do ta:ta*(fa[i][1])^(fa[i][2]);
ta;
This will be implemented in STACK for Moodle as part of a online exercise for students, so the exact implementation will be a little bit different from this, but I broke it down to these 7 lines.
I generate a random number a, make sure that it is a positive integer by using aa=|a|+1 and want to use the ifactors command to get the prime factors of aa. ka tells me the number of pairwise distinct prime factors which I then use for the while loop in pfza. If I let this piece of code run, it returns everything fine, execpt for simplifying ta, that is I don't get ta as a product of primes with some exponents but rather just ta=aa.
I then tried to turn off the simplifier, manually simplifying everything else that I need:
simp:false$
a:random(20);
aa:ev(abs(a),simp);
fa:ifactors(aa);
ka:ev(length(fa),simp);
ta:1;
pfza: for i:1 while i<=ka do ta:ta*(fa[i][1])^(fa[i][2]);
ta;
This however does not compile; I assume the problem is somewhere in the line for pfza, but I don't know why.
Any input on how to fix this? Or another method of getting the factorizing in a non-simplified form?
(1) The for-loop fails because adding 1 to i requires 1 + 1 to be simplified to 2, but simplification is disabled. Here's a way to make the loop work without requiring arithmetic.
(%i10) for f in fa do ta:ta*(f[1]^f[2]);
(%o10) done
(%i11) ta;
2 2 1
(%o11) ((1 2 ) 2 ) 3
Hmm, that's strange, again because of the lack of simplification. How about this:
(%i12) apply ("*", map (lambda ([f], f[1]^f[2]), fa));
2 1
(%o12) 2 3
In general I think it's better to avoid explicit indexing anyway.
(2) But maybe you don't need that at all. factor returns an unsimplified expression of the kind you are trying to construct.
(%i13) simp:true;
(%o13) true
(%i14) factor(12);
2
(%o14) 2 3
I think it's conceptually inconsistent for factor to return an unsimplified, but anyway it seems to work here.

Is it appropriate for a parser DCG to not be deterministic?

I am writing a parser for a query engine. My parser DCG query is not deterministic.
I will be using the parser in a relational manner, to both check and synthesize queries.
Is it appropriate for a parser DCG to not be deterministic?
In code:
If I want to be able to use query/2 both ways, does it require that
?- phrase(query, [q,u,e,r,y]).
true;
false.
or should I be able to obtain
?- phrase(query, [q,u,e,r,y]).
true.
nevertheless, given that the first snippet would require me to use it as such
?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.
when using it to check a formula?
The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)
Now you are actually asking about three different uses for parsing.
A recognizer.
A parser.
A generator.
If your grammar is unambiguous then
For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
For the generator, it should be semi-deterministic so that it can generate multiple results.
Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.
Here is an example with a very simple grammar.
The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).
expr(expr(Operand_1,Operator,Operand_2)) -->
operand(Operand_1),
operator(Operator),
operand(Operand_2).
operand(operand(1)) --> "1".
operand(operand(2)) --> "2".
operator(operator(+)) --> "+".
recognizer(Input) :-
string_codes(Input,Codes),
DCG = expr(_),
phrase(DCG,Codes,[]).
parser(Input,Ast) :-
string_codes(Input,Codes),
DCG = expr(Ast),
phrase(DCG,Codes,[]).
generator(Generated) :-
DCG = expr(_),
phrase(DCG,Codes,[]),
string_codes(Generated,Codes).
:- begin_tests(expr).
recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").
test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
recognizer(Input).
recognizer_test_case_fail("2+3").
test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
recognizer(Input).
parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).
test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
parser(Input,Ast),
assertion( Ast == Expected_ast).
parser_test_case_fail("2+3").
test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
parser(Input,_).
test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
generator(Generated).
:- end_tests(expr).
The grammar is unambiguous and has only 4 valid strings which are all unique.
The recognizer is deterministic and only returns true or false.
The parser is deterministic and returns a unique AST.
The generator is semi-deterministic and returns all 4 valid unique strings.
Example run of the test cases.
?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.
To expand a little on the comment by Daniel
As Daniel notes
1 + 2 + 3
can be parsed as
(1 + 2) + 3
or
1 + (2 + 3)
So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.
With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property
e.g.
+ is left associative
?- write_canonical(1+2+3).
+(+(1,2),3)
true.
^ is right associative
?- write_canonical(2^3^4).
^(2,^(3,4))
true.
i.e.
2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352
2^3^4 != (2^3)^4 = 8^4 = 4096
The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.
The only reason why code should be non-deterministic is that your question has multiple answers. In that case, you'd of course want your query to have multiple solutions. Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible.
Here is what I mean:
"What is the smaller of two numbers?"
min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.
So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1":
?- min_a(1, 2, Min).
Min = 1.
?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.
?- min_a(2, 1, 2).
false.
?- min_a(2, 1, 1).
true ; % crap...
false.
So that's not bad code but I think it's still crap. This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog.
Similarly, say you want to ask, "What are the even numbers between 1 and 10"; you write the query:
?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.
... and that's fine, but if you then ask for the numbers that are multiple of 3, you get:
?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...
The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. In most cases, you can do something about it.
On to your actual question. If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking. When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. As a word of caution, take a look at this question. I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point.
For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog. Pay attention to several things:
The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code;
The use of cuts, where necessary, to get rid of choice points that are useless;
The implementation of number//1 (towards the bottom) that can "generate extract a number".
(Hint: use the primitives in this library when you write your own parser!)
I hope you find this unnecessarily long answer useful.

dask equivalent of df.loc[df.index.intesection(mylabels)]

When I run df.loc[mylabels] in dask I get a warning with the link to
Warning Starting in 0.21.0, using .loc or [] with a list with one or more missing labels, is deprecated, in favor of .reindex *
This page also says:
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.
In [106]: labels = [1, 2, 3]
In [107]: s.loc[s.index.intersection(labels)]
Out[107]:
1 2
2 3
dtype: int64
Dask indexes do not have an intersection method.
So hat is the recommended way to achieve the above effect in dask?
The problem with df.loc[mylabels] is that mylabels contains items not in df.index.
For now it looks like you should continue calling df.loc[labels].
It looks like things have changed upstream and probably dask.dataframe needs to follow a bit. I recommend submitting a bug report to https://github.com/dask/dask/issues/new

Prolog solving prefix arithmetic expression with unknown variable

I want to make an arithmetic solver in Prolog that can have +,-,*,^ operations on numbers >= 2. It should also be possible to have a variable x in there. The input should be a prefix expression in a list.
I have made a program that parses an arithmetic expression in prefix format into a syntax tree. So that:
?- parse([+,+,2,9,*,3,x],Tree).
Tree = plus(plus(num(2), num(9)), mul(num(3), var(x))) .
(1) At this stage, I want to extend this program to be able to solve it for a given x value. This should be done by adding another predicate evaluate(Tree, Value, Solution) which given a value for the unknown x, calculates the solution.
Example:
?- parse([*, 2, ^, x, 3],Tree), evaluate(Ast, 2, Solution).
Tree = mul(num(2), pow(var(x), num(3))) ,
Solution = 16.
I'm not sure how to solve this problem due to my lack of Prolog skills, but I need a way of setting the var(x) to num(2) like in this example (because x = 2). Maybe member in Prolog can be used to do this. Then I have to solve it using perhaps is/2
Edit: My attempt to solving it. Getting error: 'Undefined procedure: evaluate/3 However, there are definitions for: evaluate/5'
evaluate(plus(A,B),Value,Sol) --> evaluate(A,AV,Sol), evaluate(B,BV,Sol), Value is AV+BV.
evaluate(mul(A,B),Value,Sol) --> evaluate(A,AV,Sol), evaluate(B,BV,Sol), Value is AV*BV.
evaluate(pow(A,B),Value,Sol) --> evaluate(A,AV,Sol), evaluate(B,BV,Sol), Value is AV^BV.
evaluate(num(Num),Value,Sol) --> number(Num).
evaluate(var(x),Value,Sol) --> number(Value).
(2) I'd also want to be able to express it in postfix form. Having a predicate postfixform(Tree, Postfixlist)
Example:
?- parse([+, *, 2, x, ^, x, 5 ],Tree), postfix(Tree,Postfix).
Tree = plus(mul(num(2), var(x)), pow(var(x), num(5))) ,
Postfix = [2, x, *, x, 5, ^, +].
Any help with (1) and (2) would be highly appreciated!
You don't need to use a grammar for this, as you are doing. You should use normal rules.
This is the pattern you need to follow.
evaluate(plus(A,B),Value,Sol) :-
evaluate(A, Value, A2),
evaluate(B, Value, B2),
Sol is A2+B2.
And
evaluate(num(X),_Value,Sol) :- Sol = X.
evaluate(var(x),Value,Sol) :- Sol = Value.

Find all possible pairs between the subsets of N sets with Erlang

I have a set S. It contains N subsets (which in turn contain some sub-subsets of various lengths):
1. [[a,b],[c,d],[*]]
2. [[c],[d],[e,f],[*]]
3. [[d,e],[f],[f,*]]
N. ...
I also have a list L of 'unique' elements that are contained in the set S:
a, b, c, d, e, f, *
I need to find all possible combinations between each sub-subset from each subset so, that each resulting combination has exactly one element from the list L, but any number of occurrences of the element [*] (it is a wildcard element).
So, the result of the needed function working with the above mentioned set S should be (not 100% accurate):
- [a,b],[c],[d,e],[f];
- [a,b],[c],[*],[d,e],[f];
- [a,b],[c],[d,e],[f],[*];
- [a,b],[c],[d,e],[f,*],[*];
So, basically I need an algorithm that does the following:
take a sub-subset from the subset 1,
add one more sub-subset from the subset 2 maintaining the list of 'unique' elements acquired so far (the check on the 'unique' list is skipped if the sub-subset contains the * element);
Repeat 2 until N is reached.
In other words, I need to generate all possible 'chains' (it is pairs, if N == 2, and triples if N==3), but each 'chain' should contain exactly one element from the list L except the wildcard element * that can occur many times in each generated chain.
I know how to do this with N == 2 (it is a simple pair generation), but I do not know how to enhance the algorithm to work with arbitrary values for N.
Maybe Stirling numbers of the second kind could help here, but I do not know how to apply them to get the desired result.
Note: The type of data structure to be used here is not important for me.
Note: This question has grown out from my previous similar question.
These are some pointers (not a complete code) that can take you to right direction probably:
I don't think you will need some advanced data structures here (make use of erlang list comprehensions). You must also explore erlang sets and lists module. Since you are dealing with sets and list of sub-sets, they seems like an ideal fit.
Here is how things with list comprehensions will get solved easily for you: [{X,Y} || X <- [[c],[d],[e,f]], Y <- [[a,b],[c,d]]]. Here i am simply generating a list of {X,Y} 2-tuples but for your use case you will have to put real logic here (including your star case)
Further note that with list comprehensions, you can use output of one generator as input of a later generator e.g. [{X,Y} || X1 <- [[c],[d],[e,f]], X <- X1, Y1 <- [[a,b],[c,d]], Y <- Y1].
Also for removing duplicates from a list of things L = ["a", "b", "a"]., you can anytime simply do sets:to_list(sets:from_list(L)).
With above tools you can easily generate all possible chains and also enforce your logic as these chains get generated.

Resources