Working with Amino Acids - fasta

I am working with a file that contains thousands of proteins in an organism. I have code that will allow me to go through each individual protein one by one and determine the frequency of amino acids in each. Would there be a way to alter my current code to allow me to determine all of the frequencies of amino acids at once?

IIUC, you're reinventing the wheel a bit: BioPython contains utilities for handling files in various formats (FASTA in your case), and simple analysis. For your example, I'd use something like this:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for seq_record in SeqIO.parse("protein_x.txt", "fasta"):
print(seq_record.id), ProteinAnalysis(repr(seq_record.seq)).get_amino_acids_percent().items()

The answer is yes, but without showing us your code we can't give much feedback. Essentially you want to keep your counts of the amino acids persist between reading FASTA records. If you wanted probabilities you then total them up outside the loop and divide through only at the end. This is trivially accomplished without something like a "counting dictionary" in Python or incrementing a value in a hash/dict. There are also highly likely plenty of command line tools that do this for you since all you want is character level counts for any line not starting with a '>' in the file.
For example for a file that small:
grep -v '^>' yourdata.fa | perl -pe 's/(.)/$1\n/g' | sort | uniq -c

Related

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

How to identify text file format by its structure?

I have a few text file types with data such as product info, stock, supplier info etc. and they are all structured differently. There is no other identifier for the type except the structure itself (there are no headers, no filename convention etc.)
Some examples of these files:
(products and stocks)
2326 | 542212 | Bananas | 00023 | 1 | pack
2326 | 297875 | Apples | 00085 | 1 | bag
2326 | 028371 | Pineapple | 00007 | 1 | can
...
(products and prices)
12556 Meat, pork 0098.57
58521 Potatoes, mashed 0005.20
43663 Chicken wings 0009.99
...
(products and suppliers - here N is the separator)
03038N92388N9883929
28338N82367N2837912
23002N23829N9339211
...
(product information - multiple types of rows)
VIN|Mom & Pops|78 Haley str.
PIN|BLT Bagel|5.79|FRESH
LID|0239382|283746
... (repeats this type of info for different products)
And several others.
I want to make a function that identifies which of these types a given file is, using nothing but the content. Google has been no help, in part because I don't know what search term to use. Needless to say, "identify file type by content/structure" is of no help, it just gives me results on how to find jpgs, pdfs etc. It would be helpful if I saw some code that others wrote to deal with a similar problem.
What I have thought so far is to make a FileIdentifier class for each type, then when given a file try to parse it and if it doesn't work move on to the next type. But that seems error prone to me, and I would have to hardcode a lot of information. Also, what happens if another format comes along and is very similar to any of the existing ones, but has different information in the columns?
There really is no one-size-fits-all answer unless you can limit the file formats that can happen. You will always only be able to find a heuristic for identifying formats unless you can get whoever designs these formats to give it a unique identifier or you ask the user what format the file is.
That said, there are things you can do to improve your results, like make sure you try all instances of similar formats and then pick the best fit instead of the first match.
The general approach will always be the same: make each decode attempt as strictly as possible, and with as much knowledge about not just syntax, but also semantics. I. e. If you know an item can only contain one of 5 values, or numbers in a certain range, usethat knowledge for detection. Also, don‘t just call strtol() on a component and accept that, check that it parsed the entire string. If it didn‘t, either fail right there, or maintain a „confidence“ value and lower that if a file has any possibly invalid parts.
Then in the end, go through all parse results and pick the one with the highest confidence percentage. Or if you can‘t you can ask the user to pick between the most likely formats.
PS - The file command line tool on Unixes does something similar: It looks at the start of a file and identifies common sequences that indicate certain file formats.

BioPython consensus sequence with gaps coded as 'N' and polymorphisms as ambiguities

I am trying to write code to get a consensus sequence for each of the 100+ files of individual fasta alignments in a folder. To start I just wanted to get the consensus for one sequence (then I will use a for loop to process all), but I am having trouble with the alphabet of the consensus. My test fasta alignment is:
>seq1
ACGTACGATCGTTACTCCTA
>seq2
ACGTACGA---TTACTCGTA
and what I want the consensus to look like is:
ACGTACGANNNTTACTCSTA
I would like any column that contains a gap to be represented by 'N' and any column without 100% identical nucleotides to be represented by ambiguity codes.
My code that doesn't work is:
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Alphabet import IUPAC, Gapped
alphabet = Gapped(IUPAC.ambiguous_dna)
alignment = AlignIO.read(open("fasta_align_for_consensus.fa"), "fasta")
summary_align = AlignInfo.SummaryInfo(alignment)
consensus = summary_align.gap_consensus(threshold = 1.0, ambiguous = 'N', consensus_alpha \
= alphabet, require_multiple = 2)
The object 'ambiguous' only takes a string and places an 'N' in any place in the consensus where there is a polymorphism in the alignment, which I can't seem to work around. Any suggestion on how to correct this would be greatly appreciated.
Thanks!
The current simple consensus methods don't do what you want. It sounds like you're asking IUPAC ambiguity codes (perhaps with some threshold?) and special treatment of gaps. You'd have to write some code yourself, perhaps based on the existing methods.

Most efficient grep method

Currently I am grepping data from a file containing any of the following:
342163477\|405760044\|149007683\|322391022\|77409125\|195978682\|358463993\|397650460\|171780277\|336063797\|397650502\|357636118\|168490006...............
This list is longer and contaings ~700 different values.
What is the most efficient way of extracting it? I can chop it in parts of 10/20/50/100... Or are there other unix methods? This grep is piped to python for further analysis which goes fast enough.
Splitting it would only make it worse. It really doesn't matter except in degenerate cases, which this isn't, how long or how complex a regular expression is: the execution time is the same.

How do I "diff" multiple files against a single base file?

I have a configuration file that I consider to be my "base" configuration. I'd like to compare up to 10 other configuration files against that single base file. I'm looking for a report where each file is compared against the base file.
I've been looking at diff and sdiff, but they don't completely offer what I am looking for.
I've considered diff'ing the base against each file individually, but my problem then become merging those into a report. Ideally, if the same line is missing in all 10 config files (when compared to the base config), I'd like that reported in an easy to visualize manner.
Notice that some rows are missing in several of the config files (when compared individually to the base). I'd like to be able to put those on the same line (as above).
Note, the screenshot above is simply a mockup, and not an actual application.
I've looked at using some Delphi controls for this and writing my own (I have Delphi 2007), but if there is a program that already does this, I'd prefer it.
The Delphi controls I've looked at are TDiff, and the TrmDiff* components included in rmcontrols.
For people that are still wondering how to do this, diffuse is the closest answer, it does N-way merge by way of displaying all files and doing three way merge among neighboors.
None of the existing diff/merge tools will do what you want. Based on your sample screenshot you're looking for an algorithm that performs alignments over multiple files and gives appropriate weights based on line similarity.
The first issue is weighting the alignment based on line similarity. Most popular alignment algorithms, including the one used by GNU diff, TDiff, and TrmDiff, do an alignment based on line hashes, and just check whether the lines match exactly or not. You can pre-process the lines to remove whitespace or change everything to lower-case, but that's it. Add, remove, or change a letter and the alignment things the entire line is different. Any alignment of different lines at that point is purely accidental.
Beyond Compare does take line similarity into account, but it really only works for 2-way comparisons. Compare It! also has some sort of similarity algorithm, but it also limited to 2-way comparisons. It can slow down the comparison dramatically, and I'm not aware of any other component or program, commercial or open source, that even tries.
The other issue is that you also want a multi-file comparison. That means either running the 2-way diff algorithm a bunch of times and stitching the results together or finding an algorithm that does multiple alignments at once.
Stitching will be difficult: your sample shows that the original file can have missing lines, so you'd need to compare every file to every other file to get the a bunch of alignments, and then you'd need to work out the best way to match those alignments up. A naive stitching algorithm is pretty easy to do, but it will get messed up by trivial matches (blank lines for example).
There are research papers that cover aligning multiple sequences at once, but they're usually focused on DNA comparisons, you'd definitely have to code it up yourself. Wikipedia covers a lot of the basics, then you'd probably need to switch to Google Scholar.
Sequence alignment
Multiple sequence alignment
Gap penalty
Try Scooter Software's Beyond Compare. It supports 3-way merge and is written in Delphi / Kylix for multi-platform support. I've used it pretty extensively (even over a VPN) and it's performed well.
for f in file1 file2 file3 file4 file5; do echo "$f\n\n">> outF; diff $f baseFile >> outF; echo "\n\n">> outF; done
Diff3 should help. If you're on Windows, you can use it from Cygwin or from diffutils.
I made my own diff tool DirDiff because I didn't want parts that match two times on screen, and differing parts above eachother for easy comparison. You could use it in directory-mode on a directory with an equal number of copies of the base file.
It doesn't render exports of diff's, but I'll list it as a feature request.
You might want to look at some Merge components as what you describe is exactly what Merge tools do between the common base, version control file and local file. Except that you want more than 2 files (+ base)...
Just my $0.02
SourceGear Diffmerge is nice (and free) for windows based file diffing.
I know this is an old thread but vimdiff does (almost) exactly what you're looking for with the added advantage of being able to edit the files right from the diff perspective.
But none of the solutions does more than 3 files still.
What I did was messier, but for the same purpose (comparing contents of multiple config files, no limit except memory and BASH variables)
While loop to read a file into an array:
loadsauce () {
index=0
while read SRCCNT[$index]
do let index=index+1
done < $SRC
}
Again for the target file
loadtarget () {
index=0
while read TRGCNT[$index]
do let index=index+1
done < $TRG
}
string comparison
brutediff () {
# Brute force string compare, probably duplicates diff
# This is very ugly but it will compare every line in SRC against every line in TRG
# Grep might to better, version included for completeness
for selement in $(seq 0 $((${#SRCCNT[#]} - 1)))
do for telement in $(seq 0 $((${#TRGCNT[#]} - 1)))
do [[ "$selement" == "$telement" ]] && echo "${selement} is in ${SRC} and ${TRG}" >> $OUTMATCH
done
done
}
and finally a loop to do it against a list of files
for sauces in $(cat $SRCLIST)
do echo "Checking ${sauces}..."
loadsauce
loadtarget
brutediff
echo -n "Done, "
done
It's still untested/buggy and incomplete (like sorting out duplicates or compiling a list for each line with common files,) but it's definitely a move in the direction OP was asking for.
I do think Perl would be better for this though.

Resources