how to align lists of words using biopython pairwise2 - biopython

When I run the script below, output is getting split into single chars. Any idea why? It looks like the second argument gets split into single chars.
I am trying to align the word sequences.
I will have many words hence cannot map them to letters only.
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
fruits = ["orange","pear", "apple","pear","orange"]
fruits1 = ["pear","apple"]
from Bio import pairwise2
alignments = pairwise2.align.localms(fruits,fruits1,2,-1,-0.5,-0.1, gap_char=["-"])
for a in alignments:
print(format_alignment(*a))
Output:
['orange', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', 'pear', 'orange']
|||||||||
['-', 'r', 'a', 'e', 'p', 'e', 'l', 'p', 'p', 'a', '-', '-']
Score=4

You are passing a list to localms which expects a string or a Seq object, also gap_char should be a string not a list.
Try the following snippet:
import Bio.pairwise2 as pairwise2
fruits = ["orange", "pear", "apple", "pear", "orange"]
fruits1 = ["pear", "apple"]
for f0 in fruits:
for f1 in fruits1:
print('Aligning {0} and {1}'.format(f0, f1))
alignments = pairwise2.align.localms(f0, f1, 2, -1, -0.5, -0.1, gap_char="-")
for a in alignments:
print(pairwise2.format_alignment(*a))
Output
Aligning orange and pear
orange
|
pear--
Score=2
Aligning orange and apple
orange
|
-apple
Score=2
orange-
|
--apple
Score=2
Aligning pear and pear
pear
||||
pear
Score=8
[...]

Related

CountVectorizer skips letters but returns count of words

I have a list of words like below.
words = ['john', 'i', 'romeo', 'i', 'john', 'steve', 'k']
I apply CountVectorizer to get the count of words as below.
vec = CountVectorizer().fit(words)
word_library =
vec.transform(words)
sum_words = [(word, sum_words[0,
idx]) for word, idx in
vec.vocabulary.items()]
It returns
[('john', 2), ('romeo', 1),
('steve', 1)]
I would like to return the count of single letters too, they should not vanish in the process.
[('john', 2), ('i' 2), ('romeo', 1),
('steve', 1), ('k', 1)]

String Indexer, CountVectorizer Pyspark on single row

Hi I'm faced with a problem whereby I have rows with two columns of an array of words.
column1, column2
["a", "b" ,"b", "c"], ["a","b", "x", "y"]
Basically I want to count the occurrence of each word between columns to end up with two arrays:
[1, 2, 1, 0, 0],
[1, 1, 0, 1, 1]
So "a" appears once in each array, "b" appears twice in column1 and once in column2, "c" only appears in column1, "x" and "y" only in column2. So on and so forth.
I've tried to look at the CountVectorizer function from the ml library, however not sure if that works rowwise, the arrays can be very large in each column? And 0 values (where one word appears in one column but not the other) don't seem to get carried through.
Any help appreciated.
For Spark 2.4+, you can do that using DataFrame API and built-in array functions.
First, get all the words for each row using array_union function. Then, use transform function to transform the words array, where for each element calculate the number of occurences in each column using size and array_remove functions:
df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])
df.withColumn("words", array_union("column1", "column2")) \
.withColumn("occ_column1",
expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \
.withColumn("occ_column2",
expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \
.drop("words") \
.show(truncate=False)
Output:
+------------+------------+---------------+---------------+
|column1 |column2 |occ_column1 |occ_column2 |
+------------+------------+---------------+---------------+
|[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|
+------------+------------+---------------+---------------+

Google Spreadsheet: Use column + n in formular

In my spreadsheet in Column X I have the following formular:
=ImportRange('_keys'!$B$2;"2015!A200:A203")
Now I'd like to copy this formular to column X+n (in this case X+2) so that it should look like:
=ImportRange('_keys'!$B$2;"2015!C200:C203")
But it doesn't change the column and I have to change it by hand.
Is it possible to change this formular that it always uses the column where the formular is in?
You can use the COLUMN() function to get the column of the current cell as a number. Using ADDRESS() you can turn it into a cell reference string. See the docs for COLUMN and ADDRESS.
Your code becomes
=ImportRange('_keys'!$B$2;
CONCATENATE("2015!", ADDRESS(200, COLUMN()-Y, 4),
":", ADDRESS(203, COLUMN()-Y, 4))
)
where Y is the offset between column A and column X (where this formula is located). The third argument of ADDRESS makes both the row and column relative (without the $). Note that the order of arguments to ADDRESS is row then column, annoyingly.
My solution:
I wrote a simple custom function that converts numbers into letters.
/**
* Converts number of column into column letter
*
* #param {Number} aNumer Number of column
* #return {String} Letter of column
* #customfunction
*/
function COL_NR2LETTER(aNumber) {
var letterArray = ['-', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ'];
if (aNumber < 1 || aNumber > letterArray.length)
throw "column index out of bound error";
return letterArray [aNumber];
}
Now its possible to copy
=ImportRange('_keys'!$B$2;
"2015!" & COL_NR2LETTER(Column(A1)) &"200:"& COL_NR2LETTER(Column(A1)) &"203")
from Column X into a column X+n.

Why does char array insert trailing characters when converting to string in Objective C?

I'm trying to write a quick category on NSString to base64 encode the string's contents. Everything seems okay, except for extra characters showing up on the trailing end of the generated string. Can anybody explain why the following code produces the output below?
Source:
const char base64CharSet[64] = {
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z', '0', '1', '2', '3',
'4', '5', '6', '7', '8', '9', '+', '/'
};
const char *input = "Hello, World!";
int length = strlen(input);
int outlen = (length / 3) * 4;
int modlen = length % 3;
int rawlen = length - modlen;
if (modlen != 0)
outlen += 4;
char output[outlen];
char inbuf[3], outbuf[4];
int inpos = 0, outpos = 0;
for (outpos = 0, inpos = 0; inpos < rawlen; inpos += 3) {
for (int i = 0; i < 3; i++) {
int j = inpos + i;
inbuf[i] = j < length ? input[j] : 0;
}
outbuf[0] = (inbuf[0] & 0xFC) >> 2;
outbuf[1] = ((inbuf[0] & 0x03) << 4) | ((inbuf[1] & 0xF0) >> 4);
outbuf[2] = ((inbuf[1] & 0x0F) << 2) | ((inbuf[2] & 0xC0) >> 6);
outbuf[3] = (inbuf[2] & 0x3F);
output[outpos++] = base64CharSet[outbuf[0]];
output[outpos++] = base64CharSet[outbuf[1]];
output[outpos++] = base64CharSet[outbuf[2]];
output[outpos++] = base64CharSet[outbuf[3]];
}
if (modlen > 0) {
char modbuf[3] = {0, 0, 0};
for (int i = 0; i < modlen; i++) {
int j = rawlen + i;
modbuf[i] = input[j];
}
outbuf[0] = (modbuf[0] & 0xFC) >> 2;
outbuf[1] = ((modbuf[0] & 0x03) << 4) | ((modbuf[1] & 0xF0) >> 4);
outbuf[2] = ((modbuf[1] & 0x0F) << 2) | ((modbuf[2] & 0xC0) >> 6);
outbuf[3] = (modbuf[2] & 0x3F);
output[outpos++] = base64CharSet[outbuf[0]];
output[outpos++] = base64CharSet[outbuf[1]];
output[outpos++] = modlen == 2 ? base64CharSet[outbuf[2]] : '=';
output[outpos++] = '=';
}
NSLog(#"Input: '%s', Length: %zd", input, strlen(input));
NSLog(#"Output: '%s', Length: %zd, Expected Length: %d", output, strlen(output), outlen);
Output:
2013-03-19 14:46:51.568 Sandbox[19195:c07] Input: 'Hello, World!', Length: 13
2013-03-19 14:46:51.569 Sandbox[19195:c07] Output: 'SGVsbG8sIFdvcmxkIQ==wä]', Length: 23, Expected Length: 20
2013-03-19 14:46:51.569 Sandbox[19195:c07] Output: 'SGVsbG8sIFdvcmxkIQ==wä]', Length: 23, Expected Length: 20
The goober on the end is because you didn't NULL terminate the output buffer. C strings require the character after the last character in the string to be 0 (all 0 bits, not ASCII "0" :).
... appending to a full array would raise an exception ...
Welcome to C! The language is akin to running with scissors. Even when you fall down, you might not get hurt. Might not.
In this case, you aren't actually writing the NULL byte and, thus, the printing of the C string is just reading whatever happens to be on the stack after your string array. I didn't audit the code to determine if the buffer is even of the right size.
Assuming all your math is correct, you could allocate the buffer to be one byte longer than needed for your encoding and drop the terminator there.
char output[outlen + 1];
output[outlen + 1] = 0;

Intersection of two strings/ sets

As coming from python I'm looking for something equivalent to this python code (sets) in delphi5:
>>> x = set("Hello")
>>> x
set(['H', 'e', 'l', 'o'])
>>> y = set("Hallo")
>>> y
set(['a', 'H', 'l', 'o'])
>>> x.intersection(y)
set(['H', 'l', 'o'])
var
a, b, c: set of byte;
begin
a := [1, 2, 3, 4];
b := [3, 4, 5, 6];
c := a*b; // c is the intersection of a and b, i.e., c = [3, 4]
But beware:
var
a, b, c: set of integer;
will not even compile; instead, you get the 'Sets may have at most 256 elements' error. Please see the documentation for more information on Delphi sets.
Update
Sorry, forgot to mention the 'obvious' (from the point of view of a Delphi programmer):
var
a, b, c: set of char;
begin
a := ['A', 'B', 'C', 'D'];
b := ['C', 'D', 'E', 'F'];
c := a*b; // c is the intersection of a and b, i.e., c = ['C', 'D']
But your chars will all be byte chars -- that is, forget about Unicode (Delphi 5 doesn't support Unicode, so in this case this isn't really a restriction)!

Resources