getting fasta sequences(proteome) from a file referencing another fasta file (tf)of the same organism - fasta

basically I have 2 large fasta sequences file, the first one is the proteome fasta sequences ( all the protein sequences), the second one is the transcription factor sequences fasta file of the same organism, i am just wondering if there is any way that I can extract the non transcriptional sequences as a fasta file using these two files?? many thanks

The answer is yes you can, essentially the algorithm is as follows.
Read in the transcriptional factor sequences and store as a hash or
dict.
Scan the proteome fasta sequences and if the sequence/position is
not in the hash/dict then append to array/list.
After scanning take the array/list and output in desired format.
The reason why I say hash/dict is depending on if you doing this in python or some other lang.

Related

How to remove an invalid sequence from a Genbank file containing multiple genome sequences based on ID

I have a ~3 GB Genbank file containing complete Genbank annotations for ~20,000 bacterial genome sequences. My goal is to use BioPython to parse these sequences, and write individual fasta files for non-duplicate sequences with something like the following:
from Bio import SeqIO
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
for record in records:
if seq_name not in organism_dict:
with open(output_folder + seq_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
This works perfectly fine for the first ~2,000 sequences, but then reaches an entry with an invalid footer and produces the error message ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic'.
I managed to find the sequence causing the error, so what I'd like to do is delete it from my Genbank file and manually download it as a fasta file later. However, I can't open the file in a text editor (due to its size), and I can't parse the file (due to the error), so I'm wondering if anyone has an idea of how to remove a sequence based on Genbank ID. I'm open to non-python options.
Thank you in advance,
Daniel
Try adding a Try/Except where the Except writes Record.ID to a seperate file. If the Try fails it won't write and the Except will collect Record.IDs for later download.

Script for fasta sequence replacement based on header name

I have two fasta file (one file has about 50,000 and another has 150,000 sequences) with two kinds of header formats. I want to replace sequences of interest in one file based on header name (I have two list of headers for two fasta files as txt format). Could you please advise me what should I do?
For example header format for file 1 and 2 are as >contig10002|m.12543 and >c26528_g1_i1|m.14066, respectively, and I want to replace the related sequence of >c26528_g1_i1|m.14066 in file 2 with related sequence of >contig10002|m.12543.
Thanks in advance
One suggestion is to use BioPython. It can parse fasta files and format them and it can possibly handle headers in different formats.
For example here is how you can read a fasta file and loop over the IDs:
fasta_sequences = SeqIO.parse(open('file1.fasta'),'fasta')
for fasta in fasta_sequences:
# do something with fasta.id, e.g. >c26528_g1_i1|m.14066
Here is how you would write a fasta record:
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
SeqIO.write([fasta], output_handle, "fasta")
You might want to start reading the BioPyton Tutorial and Coookbook

Which Alphabet type should I use with FASTA files in Biopython?

If I'm using the FASTA files from the link below, what Alphabet type should I use in Biopython? Would it be IUPAC.unambiguous_dna?
link to FASTA files: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/?C=S;O=A
Did you read 3.1 Sequences and Alphabets? It explains the different alphabets available, and what cases they cover.
There's a lot of sequences in the link you provided (too many for us to pore through). My recommendation would be to just go with UnambiguousDNA. If the four basic nucleotides aren't enough, the parser will complain, and you should pick a more extensive alphabet.

Mixing ASCII and Binary for record delimiters

My requirements are to write binary records inside a file. The binary records can be thought of as raw bytes in memory. I need a way to delimit each record, so that i can do something similar to binary search on the file. For example start in middle of file, find the next record delimited and start the search.
My question is that can ASCII such "START-RECORD" be used to delimit the binary record ?
START-RECORD, data-length, .......binary data...........START-RECORD, data-length, .......binary data...........
When starting from an arbitrary position within a file, i can simply search for ASCII String "START-DATA". Is this approach feasible?
Not in a single pass, since you're reading in binary mode or not. If you insert some strings or another pattern as "delimiter", you'd need to search for the binary representation of it while reading the file.

Mahout : How to convert custom document in SparseVector format for using LDA

I have a set of documents in which each line has certain number of Strings seperated with "\t|\t". Each String(may contain spaces in between) is a indivisible dictionary item. Now I have to use LDA to find the correletaion between these documents with respect to each dictionsr word(String in my vocab).
Please guide me how can I convert these documents to spares vector format and then how to apply LDA on them?
This is one of the best links that i have found that might answer your queries.
http://www.theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout

Resources