Generating keywords from the article - keyword

How can we generate important keywords for any random article ?Does there exist any existing algorithm or tool to get the important keywords from the given text.

If you are using linux you can just use grep command to get the lines which has important keywords.
Eg: $ cat file_name.txt |grep key_word
The above command will display only the lines which has the specified key_word.
Please specify more details, like what type of file(eg:txt or doc files etc) More information about which programming language and Operating system you use to get a proper answer

Related

Bash - grep command inconsistent with man page

I am trying to understand and read the man page. Yet everyday I find more inconsistent syntax and I would like some clarification to whether I am misunderstanding something.
Within the man page, it specifies the syntax for grep is grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...]
I got a working example that recursively searches all files within a directory for a keyword.
grep -rnw . -e 'memes
Now this example works, but I find it very inconsistent with the man page. The directory (Which the man page has written as [FILE...] but specifies the use case for if file == directory in the man page) is located last. Yet in this example it is located after [OPTIONS] and before [-e PATTERN].... Why is this allowed, it does not follow the specified regex fule of using this command?
Why is this allowed, it does not follow the specified regex fule of using this command?
The lines in the SYNOPSIS section of a manpage are not to be understood as strict regular expressions, but as a brief description of the syntax of a utility's arguments.
Depending on the particular application, the parser might be more or less flexible on how it accepts its options. After all, each program can implement whatever grammar they like for their arguments. Therefore, some might allow options at the beginning, at the end, or even in-between files (typically with ways to handle ambiguity that may arisa, e.g. reading from the standard input with -, filenames starting with -...).
Now, of course, there are some ways to do it that are common. For instance, POSIX.1-2017 12.1 Utility Argument Syntax says:
This section describes the argument syntax of the standard utilities and introduces terminology used throughout POSIX.1-2017 for describing the arguments processed by the utilities.
In your particular case, your implementation of grep (probably GNU's grep) allows to pass options in-between the file list, as you have discovered.
For more information, see:
https://unix.stackexchange.com/questions/17833/understand-synopsis-in-manpage
Are there standards for Linux command line switches and arguments?
https://www.gnu.org/software/libc/manual/html_node/Getopt-Long-Options.html
You can also leverage .
grep ‘string’ * -lR

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

Why are some command options with one dash and others are with two dashes

Some command options are with one dash e.g. ruby -c (check syntax) and ruby --copyright (print copyright). Is there any pattern to this?
These are known as short and long options. Which name/format a developer uses for options of his program is totally up to him.
However, there are some widespread conventions. Like -v/--version for printing version number, -h/--help for printing usage instructions, etc.
Sadly, most commandline tools on OSX seem not to conform to -v/-h.
Good CLI (command-line interface) design dictates that options of a program that are most useful should have two formats, short and long. You use short format in your everyday life (because it's faster to type).
ps aux | grep ruby
Long ones are for scripts that you write and rarely touch (they're easier to read and understand).
mongod --logpath /path/to/logs --dbpath /path/to/db --fork --smallfiles
Many less used options may have only the long version (because, you know, there are only 26 letters in latin alphabet).
On many rails commands there is a pattern. One dash is an abbreviation for a two dash option, e.g. rspec -o FILE is a synonym for rspec --out FILE.

How to handle citations in Ipython Notebook?

What is the best way to take care of citations in Ipython Notebook? Ideally, I would like to have a bibtex file, and then, as in latex, have a list of shorthands in Ipython markdown cells, with the full references at the end of the notebook.
The relevant material I found is this: http://nbviewer.ipython.org/github/ipython/nbconvert-examples/blob/master/citations/Tutorial.ipynb
But I couldn't follow the documentation very well. Can anyone explain it? Thanks so much!!
Summary
This solution is largely based on Sylvain Deville's excellent blog post. It allows you to simply write [#citation_key] in markdown cells. The references will be formatted after document conversion. The only requirements are LaTeX and pandoc, which are both widely supported. While there is never a guarantee, this approach should therefore still work in many years time.
Step-by-Step Guide
In addition to a working installation of jupyter you need:
LaTeX (installation guide).
Pandoc (installation guide).
A citation style language. Download a citation style, e.g., APA. Save the .csl file (e.g., apa.csl) into the same folder as your jupyter notebook (or specify the path to the .csl file later).
A .bib file with your references. I am using a sample bib file list.bib. Save to the same folder as your jupyter notebook (or specify the path to the .bib file later).
Once you completed these steps, the rest is easy:
Use markdown syntax for references in markdown cells in your jupyter notebook. E.g., [#Sh:1] where the syntax works like this: ([#citationkey_in_bib_file]). I much prefer this syntax over other solutions because it is so fast to type [#something].
At the end of your ipython notebook, create a code cell with the following syntax to automatically convert your document (note that this is R code, use an equivalent command to system() for python):
#automatic document conversion to markdown and then to word
#first convert the ipython notebook paper.ipynb to markdown
system("jupyter nbconvert --to markdown paper.ipynb")
#next convert markdown to ms word
conversion <- paste0("pandoc -s paper.md -t docx -o paper.docx",
" --filter pandoc-citeproc",
" --bibliography="listb.bib",
" --csl="apa.csl")
system(conversion)
Run this cell (or simply run all cells). Note that the 2nd system call is simply pandoc -s paper.md -t docx -o paper.docx --filter pandoc-citeproc --bibliography=listb.bib --csl=apa.csl. I merely used paste0() to be able to spread this over multiple lines and make it nicer to read.
The output is a word document. If you prefer another document, check out this guide for alternative syntax.
#Extras
If you do not like that your converted document includes the syntax for the document conversion, insert a markdown cell above and below the code cell with the syntax for the conversion. In the cell above, enter <!-- and in the cell below enter -->. This is a regular HTML command for a comment, so the syntax will in between these two cells will be evaluated but not printed.
You can also include a yaml header in your first cell. E.g.,
---
title: This is a great title.
author: Author Name
abstract: This is a great abstract
---
You can use the Document Tools of the Calico suite, which can be installed separately with:
sudo ipython install-nbextension https://bitbucket.org/ipre/calico/downloads/calico-document-tools-1.0.zip
Read the tutorial and watch the YouTube video for more details.
Warning: only the cited references are processed. Therefore, if you fail to cite an article, it won't appear in the References section. As a little working example, copy the following in a Markdown cell and press the "book" icon.
<!--bibtex
#Article{PER-GRA:2007,
Author = {P\'erez, Fernando and Granger, Brian E.},
Title = {{IP}ython: a System for Interactive Scientific Computing},
Journal = {Computing in Science and Engineering},
Volume = {9},
Number = {3},
Pages = {21--29},
month = may,
year = 2007,
url = "http://ipython.org",
ISSN = "1521-9615",
doi = {10.1109/MCSE.2007.53},
publisher = {IEEE Computer Society},
}
#article{Papa2007,
author = {Papa, David A. and Markov, Igor L.},
journal = {Approximation algorithms and metaheuristics},
pages = {1--38},
title = {{Hypergraph partitioning and clustering}},
url = {http://www.podload.org/pubs/book/part\_survey.pdf},
year = {2007}
}
-->
Examples of citations: [CITE](#cite-PER-GRA:2007) or [CITE](#cite-Papa2007).
This should result in the following added Markdown cell:
References
^ Pérez, Fernando and Granger, Brian E.. 2007. IPython: a System for Interactive Scientific Computing. URL
^ Papa, David A. and Markov, Igor L.. 2007. Hypergraph partitioning and clustering. URL
I was able to run it with the following approach:
Insert the html citation as in the tutorial you mentioned.
Create ipython.bib in the "standard" bibtex format. It goes into the same file as your *.ipynb notebook file.
Create the template file as in the tutorial, also in the same directory or else in the (distribution dependent) directory with the other templates. On my system, that's /usr/local/lib/python2.7/dist-packages/IPython/nbconvert/templates/latex.
The tutorial has the template extend latex_article.tplx. On my distribution, it's article.tplx (without latex_).
Run nbconvert with --to latex; that generates an .aux file among other things. Latex will complain about missing references.
Run bibtex yournotebook.aux; this generates yournotebook.bbl. You only need to re-run this if you change references.
Re-run nbconvert either with --to latex or with --to pdf. This generates a .tex file, or else runs all the way to a .pdf.
If you want html output, you can use pandoc to assemble the references into a tidy citation page. This may require some hand-editing to make an html page you can reference from your main document.
If you know that you will be converting your notebook to latex anyway, consider simply adding a "Raw" cell (Ctrl+M R) to the end of the document, containing the bibliography just as you would put it in pure LaTeX.
For example, when I need to reference a couple of external links, I would not even care to do a proper BibTeX thing and simply have a "Raw" cell at the end of the notebook like that:
\begin{thebibliography}{1}
\bibitem{post1}
Holography in Simple Terms. K.Tretyakov (blog post), 2015.\\
\url{http://fouryears.eu/2015/07/24/holography-in-simple-terms/}
\bibtem{book1}
The Importance of Citations. J. Smith. 2010.
\end{thebibliography}
The items can be cited in other Markdown cells using the usual <cite data-cite="post1">(KT, 2015)</cite>
Of course, you can also use proper BibTeX as well. Just add the corresponding Raw cell, e.g:
\bibliographystyle{unsrt}
\bibliography{papers}
This way you do not have to bother editing a separate template file (at the price of cluttering the notebook's HTML export with raw Latex, though).
You should have a look at the latex_envs extension in https://github.com/ipython-contrib/IPython-notebook-extensions (install from this repo, it is the most recent version). This extension contains a way to integrate bibliography using bibtex files and standard latex notation, and generates a bibliography section at the end of the notebook. Style of citations can be (to some extent) customized. Some documentation here https://rawgit.com/jfbercher/latex_envs/master/doc/latex_env_doc.html

GREP - finding all occurrences of a string

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

Resources