How to remove an invalid sequence from a Genbank file containing multiple genome sequences based on ID - biopython

I have a ~3 GB Genbank file containing complete Genbank annotations for ~20,000 bacterial genome sequences. My goal is to use BioPython to parse these sequences, and write individual fasta files for non-duplicate sequences with something like the following:
from Bio import SeqIO
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
for record in records:
if seq_name not in organism_dict:
with open(output_folder + seq_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
This works perfectly fine for the first ~2,000 sequences, but then reaches an entry with an invalid footer and produces the error message ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic'.
I managed to find the sequence causing the error, so what I'd like to do is delete it from my Genbank file and manually download it as a fasta file later. However, I can't open the file in a text editor (due to its size), and I can't parse the file (due to the error), so I'm wondering if anyone has an idea of how to remove a sequence based on Genbank ID. I'm open to non-python options.
Thank you in advance,
Daniel

Try adding a Try/Except where the Except writes Record.ID to a seperate file. If the Try fails it won't write and the Except will collect Record.IDs for later download.

Related

dataset import error for AutoML text classification

I have trying to import dataset into AutoML NL Text Classification. However, the Ui gave me an error of Invalid row in CSV file , Error details: Error detected: "FILE_TYPE_NOT_SUPPORTED"
I am uploading the csv file, what should I do?
Please make sure there is no hidden quotes in your dataset. Complete requirements can be found on “Preparing your training data” page.
Common .csv errors:
Using Unicode characters in labels. For example, Japanese characters are not supported.
Using spaces and non-alphanumeric characters in labels.
Empty lines.
Empty columns (lines with two successive commas).
Missing quotes around embedded text that includes commas.
Incorrect capitalization of Cloud Storage text paths.
Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
The URI of a text file points to a different bucket than the current project. > > - Only files in the project bucket can be accessed.
Non-CSV-formatted files.

How to parse a binary PDF stream of unknown length?

From the PDF docs: "The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes."
As the contents may be binary, an occurrence of endstream does not necessarily indicate the end of the stream. Now when considering this stream:
%PDF-1.4
%307쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x234+T03203T0^#A(235234˥^_d256220^314^U310^E^#[364^F!endstream
endobj
6 0 obj
30
endobj
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect. While it may help PDF writers to output PDFs sequentially, it makes parsing for PDF readers quite difficult. Considering that a PDF file is read more frequently than being written, I don't understand this.
So how can such a stream be parsed correctly?
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
This is an understandable conclusion if one assumes that the file is to be read sequentially beginning to end.
This assumption is incorrect, though, because parsing a PDF from the front and determining the PDF objects on the run is not the recommended way of parsing a PDF.
While ISO 32000-1 is a bit vague here and merely says
Conforming readers should read a PDF file from its end.
(ISO 32000-1, section 7.5.5 File Trailer)
ISO 32000-2 clearly specifies:
With the exception of linearized PDF files, all PDF files should be read using the trailer and cross-reference table as described in the following subclauses. Reading a non-linearized file in a serial manner is not reliable because of the way objects are to be processed after an incremental update. (See 6.3.2, "Conformance of PDF processors".)
(ISO 32000-2, section 7.5 File structure)
Thus, in case of your PDF excerpt, a PDF processor trying to read object 5 0
looks up object 5 0 in the cross references and gets its offset in the file,
goes to that offset and starts reading the object, first parsing the stream dictionary,
at the stream keyword recognizes that the object is a stream and retrieves its Length value which happens to be an indirect reference to 6 0,
looks up object 6 0 in the cross references and gets its offset in the file,
goes to that offset and reads the object, the number 30,
reads the stream content of the stream object 5 0 knowing its length is 30.
An approach as yours is explicitly considered "not reliable".
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect.
If there were no cross references, you'd be correct. That also is why the FDF format (which does not have mandatory cross references) specifies:
FDF is based on PDF; it uses the same syntax and has essentially the same file structure (7.5, "File structure"). However, it differs from PDF in the following ways:
[...]
The length of a stream shall not be specified by an indirect object.
(ISO 32000-2, section 12.7.8 Forms data format)
Concerning the comments:
So I'm correct that PDF cannot be parsed sequentially,
While the very original design of PDF probably was meant for sequential parsing, it has been further developed with only access via cross references in mind. PDF simply is not meant to be parsed sequentially anymore. And that was already the case when I started dealing with PDFs in the late 90s.
and the only reason is that the required length of binary streams may be defined after the stream.
That's by far not the only reason, there are more situations requiring a cross reference lookup to parse correctly.
As #mkl indicated, a parser has to read somewhere before the end of the PDF file to get startxref, hoping that it does not start parsing in the middle of a binary stream.
That's not correct. The PDF must end with "%%EOF" plus optionally an end-of-line. Before that there must be an end-of-line, before that a number, before that an end-of-line, before that startxref.
This is already expressed clearly in ISO 32000-1:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
(ISO 32000-1, section 7.5.5 File Trailer)
Thus, no danger of being "in the middle of a binary stream" if the PDF is valid.
The other thing I dislike about the format of PDF is this: When developing a parser, you usually create test files with some elements you are working on. This approach seems to work with everything but streams. The absolute file positions of syntax elements and the requirement for multiple random accesses makes this task harder.
You seem to be subject to the misconception that the PDF format is a tagged text format like HTML. This is not the case. Even though numerous syntactical elements are defined using some ASCII keyword and there are "lines", PDF is a binary format, the cross reference tables are not a gimmick but the central access hub to the objects, and optimization for random access is done by design.

Script for fasta sequence replacement based on header name

I have two fasta file (one file has about 50,000 and another has 150,000 sequences) with two kinds of header formats. I want to replace sequences of interest in one file based on header name (I have two list of headers for two fasta files as txt format). Could you please advise me what should I do?
For example header format for file 1 and 2 are as >contig10002|m.12543 and >c26528_g1_i1|m.14066, respectively, and I want to replace the related sequence of >c26528_g1_i1|m.14066 in file 2 with related sequence of >contig10002|m.12543.
Thanks in advance
One suggestion is to use BioPython. It can parse fasta files and format them and it can possibly handle headers in different formats.
For example here is how you can read a fasta file and loop over the IDs:
fasta_sequences = SeqIO.parse(open('file1.fasta'),'fasta')
for fasta in fasta_sequences:
# do something with fasta.id, e.g. >c26528_g1_i1|m.14066
Here is how you would write a fasta record:
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
SeqIO.write([fasta], output_handle, "fasta")
You might want to start reading the BioPyton Tutorial and Coookbook

SSIS: Can't handle line-feeds in CSV (Column delimiter not found)

I have some CSV files that appear OK in Notepad and Excel however seem to have extra line-feeds in them when I view them in VS2010 or Notepad++. When I attempt to process them in SSIS, the files fail with errors like this:
Error: 0xC0202055 at Merge Files, Interface [225]: The column delimiter for column "Column 48" was not found.
Here's a truncated example (there's about 50 columns, and the line-wrap appears to wrap randomly at the same position):
The questions are: how does Notepad and Excel open these files OK (and seemingly ignore the line-feeds)? Is there a way to get SSIS to process these files? Could it be an SSIS setting on code-page etc?
For me opening the file in Excel, saving as an excel file (xlsx but I am sure the old xls format would work fine too), then using the Excel Source in SSIS enabled me to load a file into a SQL table with this kind of problem.
Obviously this would not work if you need to load this kind of file regularly or if there was many of these files. In that case the first answer would be better.
The easiest solution for us was to stage the input into a SQL table, and then in a subsequent data-flow, query it back-out without line-feeds in the CSV output, e.g.
SELECT COLUMN1
,REPLACE(REPLACE([COLUMN2],CHAR(10),''),CHAR(13),'') AS [COLUMN2]
FROM TABLE

identifying problematic row of data giving mass import error

I am using activerecord-import to bulk insert a bunch of data in a .csv file into my rails app. Unfortunately, I am getting an error when I call import on my model.
ArgumentError (invalid byte sequence in UTF-8)
I know the problem is that I have a string with weird characters somewhere in the 1000+ rows of data that I am importing, but I can't figure out which row is the problem.
Does activerecord-import have any error handling built in that I could use to figure out which row/row(s) were problematic (e.g. some option I could set when calling import function on my model)? As far as I can tell the answer is no.
Alternatively, can I write some code that would check the array that I am passing into activerecord-import to determine which rows have strings that are invalid in UTF-8?
Without being able to see the data, it is only possible to guess. Most likely, you have a character combination that is not UTF-8 valid.
You should be able to check your file with
iconv -f utf8 <filename>

Resources