Using GREP to produce table from Database File (.pdb) - grep

What I have is a folder with PDB files which contain information in the following pattern:
*HEADER 'protein date ID'
TITLE 'title of document here
AUTHOR ' the authors listed here'
AUTHOR ' continued..'
SOURCE 'source organism (s)'
SOURCE 'continued'
SOURCE 'continued'
COMPND 'compound or complex studied'
COMPND 'continued'
As you can see the source and other information that is in this file expands to multiple lines. I want to create a single table with this information in these PDB files using the GREP command. I have not been able to group the multiple lines into one and and produce a table with columns such as TITLE, AUTHOR, SOURCE...etc
My reason for this is to be able to present in a table information from PDB files and filter through new research by author or source, which will save LOTS of time on the actual website.
Thank You

I don't think grep is the right tool, I'd suggest sed or awk. Here's a sed solution (or maybe not a complete solution, depending on your desired output):
sed ':r;$!{N;br};:s;s/\nSOURCE//2;ts' file.pdb
It only processes the lines with SOURCE though.
Here's a more generic version:
sed ':r;$!{N;br};:s;s/\(\n[A-Z]\+\)\(.*\)\1/\1\2/;ts' file.pdb

Related

Count Lines, grep, head, and tail inside Feather Files

Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.
Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.

How to grep for files using 'and' operator, words might not be on the same line

I have a directory /dir
which has several text files in it, These files may or may not contain the words 'rock' and 'stone', so basically some files might just contain the word 'rock', some may just contain the word 'stone', some may contain both, and some may contain neither.
How can I list all files in this directory that contain both 'rock' and 'stone'? These words might not be on the same line so I don't think piping through grep twice would work.
Appreciate any help, I was not able to find a stackoverflow post with this problem so I figured I'd ask.
To search files that match the given two (or more) words at any line anywhere in the file, you may want to try ugrep:
ugrep -F --files -e 'rock' --and -e 'stone' dir
This only matches files that have both rock and stone in them. Lines are output that have rock or stone, or you can use option -l to just list files. The -F option searches strings (like grep -F and fgrep), --files applies the --and file-wide, which you want instead of applying the --and per line. Note that we have more than one pattern in this case, so option -e should be used (like grep also requires this).
A shorter form with --bool:
ugrep -F --files --bool 'rock stone' dir
where --bool formulates a Boolean query with space as AND (or use AND).
If you want to search directory dir recursively in subdirectories, use option -r.

Find the name of the file or files with a certain text. This command can search through the directory and all sub directories to find all matches

Trying to find the name of a file by searching for a word that's in the file.
I have tried to look at many examples online but unfortunately I couldn't find a code that outputs the names of the files with that certain word in it.
grep -r 'Facebook' *
I expected the output to be many names of the files which has the word facebook in it but instead I got output of lines with the word facebook in it which is not what I wanted.
There are many possible solutions in a Unix/Linux environment:
find /path/to/base -name Facebook
ls -R /path/to/base | grep -i "[Ff]acebook"
etc.
Or there are programmatic approaches written in a language of your choice. Add more details about what you are trying to do for a better answer.
it should be something like this
grep -l "Facebook" *
this is going to search for all the files and not sub directories in the current folder and produce the name of the file where a match was found
from man page of grep:
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is
specified by POSIX.)

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

How to handle citations in Ipython Notebook?

What is the best way to take care of citations in Ipython Notebook? Ideally, I would like to have a bibtex file, and then, as in latex, have a list of shorthands in Ipython markdown cells, with the full references at the end of the notebook.
The relevant material I found is this: http://nbviewer.ipython.org/github/ipython/nbconvert-examples/blob/master/citations/Tutorial.ipynb
But I couldn't follow the documentation very well. Can anyone explain it? Thanks so much!!
Summary
This solution is largely based on Sylvain Deville's excellent blog post. It allows you to simply write [#citation_key] in markdown cells. The references will be formatted after document conversion. The only requirements are LaTeX and pandoc, which are both widely supported. While there is never a guarantee, this approach should therefore still work in many years time.
Step-by-Step Guide
In addition to a working installation of jupyter you need:
LaTeX (installation guide).
Pandoc (installation guide).
A citation style language. Download a citation style, e.g., APA. Save the .csl file (e.g., apa.csl) into the same folder as your jupyter notebook (or specify the path to the .csl file later).
A .bib file with your references. I am using a sample bib file list.bib. Save to the same folder as your jupyter notebook (or specify the path to the .bib file later).
Once you completed these steps, the rest is easy:
Use markdown syntax for references in markdown cells in your jupyter notebook. E.g., [#Sh:1] where the syntax works like this: ([#citationkey_in_bib_file]). I much prefer this syntax over other solutions because it is so fast to type [#something].
At the end of your ipython notebook, create a code cell with the following syntax to automatically convert your document (note that this is R code, use an equivalent command to system() for python):
#automatic document conversion to markdown and then to word
#first convert the ipython notebook paper.ipynb to markdown
system("jupyter nbconvert --to markdown paper.ipynb")
#next convert markdown to ms word
conversion <- paste0("pandoc -s paper.md -t docx -o paper.docx",
" --filter pandoc-citeproc",
" --bibliography="listb.bib",
" --csl="apa.csl")
system(conversion)
Run this cell (or simply run all cells). Note that the 2nd system call is simply pandoc -s paper.md -t docx -o paper.docx --filter pandoc-citeproc --bibliography=listb.bib --csl=apa.csl. I merely used paste0() to be able to spread this over multiple lines and make it nicer to read.
The output is a word document. If you prefer another document, check out this guide for alternative syntax.
#Extras
If you do not like that your converted document includes the syntax for the document conversion, insert a markdown cell above and below the code cell with the syntax for the conversion. In the cell above, enter <!-- and in the cell below enter -->. This is a regular HTML command for a comment, so the syntax will in between these two cells will be evaluated but not printed.
You can also include a yaml header in your first cell. E.g.,
---
title: This is a great title.
author: Author Name
abstract: This is a great abstract
---
You can use the Document Tools of the Calico suite, which can be installed separately with:
sudo ipython install-nbextension https://bitbucket.org/ipre/calico/downloads/calico-document-tools-1.0.zip
Read the tutorial and watch the YouTube video for more details.
Warning: only the cited references are processed. Therefore, if you fail to cite an article, it won't appear in the References section. As a little working example, copy the following in a Markdown cell and press the "book" icon.
<!--bibtex
#Article{PER-GRA:2007,
Author = {P\'erez, Fernando and Granger, Brian E.},
Title = {{IP}ython: a System for Interactive Scientific Computing},
Journal = {Computing in Science and Engineering},
Volume = {9},
Number = {3},
Pages = {21--29},
month = may,
year = 2007,
url = "http://ipython.org",
ISSN = "1521-9615",
doi = {10.1109/MCSE.2007.53},
publisher = {IEEE Computer Society},
}
#article{Papa2007,
author = {Papa, David A. and Markov, Igor L.},
journal = {Approximation algorithms and metaheuristics},
pages = {1--38},
title = {{Hypergraph partitioning and clustering}},
url = {http://www.podload.org/pubs/book/part\_survey.pdf},
year = {2007}
}
-->
Examples of citations: [CITE](#cite-PER-GRA:2007) or [CITE](#cite-Papa2007).
This should result in the following added Markdown cell:
References
^ PĂ©rez, Fernando and Granger, Brian E.. 2007. IPython: a System for Interactive Scientific Computing. URL
^ Papa, David A. and Markov, Igor L.. 2007. Hypergraph partitioning and clustering. URL
I was able to run it with the following approach:
Insert the html citation as in the tutorial you mentioned.
Create ipython.bib in the "standard" bibtex format. It goes into the same file as your *.ipynb notebook file.
Create the template file as in the tutorial, also in the same directory or else in the (distribution dependent) directory with the other templates. On my system, that's /usr/local/lib/python2.7/dist-packages/IPython/nbconvert/templates/latex.
The tutorial has the template extend latex_article.tplx. On my distribution, it's article.tplx (without latex_).
Run nbconvert with --to latex; that generates an .aux file among other things. Latex will complain about missing references.
Run bibtex yournotebook.aux; this generates yournotebook.bbl. You only need to re-run this if you change references.
Re-run nbconvert either with --to latex or with --to pdf. This generates a .tex file, or else runs all the way to a .pdf.
If you want html output, you can use pandoc to assemble the references into a tidy citation page. This may require some hand-editing to make an html page you can reference from your main document.
If you know that you will be converting your notebook to latex anyway, consider simply adding a "Raw" cell (Ctrl+M R) to the end of the document, containing the bibliography just as you would put it in pure LaTeX.
For example, when I need to reference a couple of external links, I would not even care to do a proper BibTeX thing and simply have a "Raw" cell at the end of the notebook like that:
\begin{thebibliography}{1}
\bibitem{post1}
Holography in Simple Terms. K.Tretyakov (blog post), 2015.\\
\url{http://fouryears.eu/2015/07/24/holography-in-simple-terms/}
\bibtem{book1}
The Importance of Citations. J. Smith. 2010.
\end{thebibliography}
The items can be cited in other Markdown cells using the usual <cite data-cite="post1">(KT, 2015)</cite>
Of course, you can also use proper BibTeX as well. Just add the corresponding Raw cell, e.g:
\bibliographystyle{unsrt}
\bibliography{papers}
This way you do not have to bother editing a separate template file (at the price of cluttering the notebook's HTML export with raw Latex, though).
You should have a look at the latex_envs extension in https://github.com/ipython-contrib/IPython-notebook-extensions (install from this repo, it is the most recent version). This extension contains a way to integrate bibliography using bibtex files and standard latex notation, and generates a bibliography section at the end of the notebook. Style of citations can be (to some extent) customized. Some documentation here https://rawgit.com/jfbercher/latex_envs/master/doc/latex_env_doc.html

Resources