Script for fasta sequence replacement based on header name - alignment

I have two fasta file (one file has about 50,000 and another has 150,000 sequences) with two kinds of header formats. I want to replace sequences of interest in one file based on header name (I have two list of headers for two fasta files as txt format). Could you please advise me what should I do?
For example header format for file 1 and 2 are as >contig10002|m.12543 and >c26528_g1_i1|m.14066, respectively, and I want to replace the related sequence of >c26528_g1_i1|m.14066 in file 2 with related sequence of >contig10002|m.12543.
Thanks in advance

One suggestion is to use BioPython. It can parse fasta files and format them and it can possibly handle headers in different formats.
For example here is how you can read a fasta file and loop over the IDs:
fasta_sequences = SeqIO.parse(open('file1.fasta'),'fasta')
for fasta in fasta_sequences:
# do something with fasta.id, e.g. >c26528_g1_i1|m.14066
Here is how you would write a fasta record:
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
SeqIO.write([fasta], output_handle, "fasta")
You might want to start reading the BioPyton Tutorial and Coookbook

Related

Reading multiline files in Apache beam separated with custom delimiters

I have a text file separated by two delimiters(#*) and one of the field contains multiline statements. ex:
test#*123#*"contain
multiline"
test#*321#*"contain
multiline"
Those are actual 2 rows but in text file it's 4 lines. The way I was trying is to retrieve the files with FileIO and then using pardo to open the file , find the last character in a line and if it's not ending with " then find the next line and append it with 1st line. my concern is beam processes the file in bundles .So if 2 lines are not in the same bundle then it will fail.
is my understanding correct ? and pls let me know the best way to handle the same.

How to remove an invalid sequence from a Genbank file containing multiple genome sequences based on ID

I have a ~3 GB Genbank file containing complete Genbank annotations for ~20,000 bacterial genome sequences. My goal is to use BioPython to parse these sequences, and write individual fasta files for non-duplicate sequences with something like the following:
from Bio import SeqIO
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
for record in records:
if seq_name not in organism_dict:
with open(output_folder + seq_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
This works perfectly fine for the first ~2,000 sequences, but then reaches an entry with an invalid footer and produces the error message ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic'.
I managed to find the sequence causing the error, so what I'd like to do is delete it from my Genbank file and manually download it as a fasta file later. However, I can't open the file in a text editor (due to its size), and I can't parse the file (due to the error), so I'm wondering if anyone has an idea of how to remove a sequence based on Genbank ID. I'm open to non-python options.
Thank you in advance,
Daniel
Try adding a Try/Except where the Except writes Record.ID to a seperate file. If the Try fails it won't write and the Except will collect Record.IDs for later download.

Delimiter for CSV file in IIB

I am developing an integration in IIB and one of the requirements for output (multiple CSV files) is a comma delimiter instead of semicollon. Semicolon is is on the input. Im using two mapping nodes to produce separate files from one input, but struggle to find option for delimiter.
There are two mapping nodes that uses xsd shemas and .maps to produce output.
First mapping creates canonical dfdl format that is ready to be parsed to multipe files in second mapping node.
There is not much code. just setup in IIB
I would like to produce comma separated CSV instead of semicollon.
Thanks in advance
I found a solution. You can simply view and edit the xsd code in text editor and change the delimiter there.

getting fasta sequences(proteome) from a file referencing another fasta file (tf)of the same organism

basically I have 2 large fasta sequences file, the first one is the proteome fasta sequences ( all the protein sequences), the second one is the transcription factor sequences fasta file of the same organism, i am just wondering if there is any way that I can extract the non transcriptional sequences as a fasta file using these two files?? many thanks
The answer is yes you can, essentially the algorithm is as follows.
Read in the transcriptional factor sequences and store as a hash or
dict.
Scan the proteome fasta sequences and if the sequence/position is
not in the hash/dict then append to array/list.
After scanning take the array/list and output in desired format.
The reason why I say hash/dict is depending on if you doing this in python or some other lang.

iOS: Read in XLS

I'm trying to figure out how to read in the contents of an XLS document and I'm able to get the bytes just fine, but I don't have any clue where to go from here. Trying [[NSString alloc] initWithBytes:data.bytes length:data.length encoding:NSUTF8StringEncoding] and [NSString stringWithUTF8String:data.bytes] both don't get me anywhere (null). What are you supposed to do to read in the contents of an XLS file?
Trying to combine two answer.
"There is no innate ability to read Excel data into a Foundation container, like an NSArray or NSDictionary. You could, however, convert the file (with Excel) to a comma-separated-value (CSV) file and then parse each line's cells on the iPhone using the NSString instance method -componentsSeparatedByString:."
"A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal TAB or comma. Usually, all records have an identical sequence of fields"
--
How to read cell data from an Excel document with objective-c
objective-c loading data from excel
Even though saving your Excel file to CSV is the easier answer, sometimes that's not really what you're looking for, so I created QZXLSReader. It's a drag-and-drop solution so it's a lot easier to use. I don't think it's as feature complete, but it worked for me.
It's basically a library that can open XLS files and parse them into Obj-C classes. Once you have the classes, it's very easy to send them to Core Data or a dictionary or what have you.
I hope it helps!

Resources