Search Jupyter notebook markdown cells from command line

Search Jupyter notebook markdown cells from command line - grep

I use ag to search through my notes. My notes are written down in Markdown files and Markdown cells contained within Jupyter notebooks.
I can search the Markdown files conveniently with ag --markdown .... It would be very handy if something similar could be done with the Jupyter notebook files. But this would require that ag understands the format of these notebooks.
My question: is there a way to search only the Markdown cells for a given string in a Jupyter notebook file? Any pattern matcher used in the solution is acceptable for me (ag, grep, ack, ...).
p.s. The notebooks are composed in JSON. Here's a sample:
$ head notebook.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"THIS IS A MARKDOWN STRING"
]
},
{

I'd look to use jq to filter out all markdown cells of a python notebook. For instance, if you just wanted to spit out all the markdown source, you could use the following:
$< notebook.ipynb | jq '.cells[]|select(.cell_type == "markdown")|.source[]'
jq is fast, and used for far more elaborate solutions when saving ipython notebooks to git, for example: Using IPython notebooks under version control

I don't know if ag can be interfaced with a filter, but to
get the Markdown out of a notebook file the following Python code will suffice
import nbformat
from sys import argv
nb = nbformat.read(argv[1], nbformat.NO_CONVERT)
for cell in nb.cells:
if cell.cell_type == 'markdown' : print(cell.source)

Related

How to typeset argmin and argmax in Markdown?

There are posts in TeX.SE that shows how argmin and argmax with limits can be typesetted using the \DeclareMathOperator* command. But how to do this in Markdown?
I am especially interested in doing this in Jupyter Notebook when I'm documenting in Markdown.
Is this possible?

The way to do this is by using the \underset command.
Syntax:
\underset{<constraints>}{\operatorname{<argmax or argmin>}}
Example:
$\underset{c\in C}{\operatorname{argmax}}$

Maybe this is what you look for
$ \hat{\theta}^{MLE}=\underset{a}{\operatorname{\argmax}} P(D|\theta) = \frac{a_1}{a_1+a_0} $
The trick is done by \underset, which put a character under an expression, as you see below.
Anyway, I see widely used the simple underscore expression, along with \argand \max, as you see below.
If you want to know something more about the many ways you can choose, take a look at here: Command for argmin or argmax?
Hope to be helpful

This worked for me when using pandoc for markdown-to-PDF conversion. (The subscript 'a' is correctly placed underneath the 'argmin'):
---
header-includes:
- \newcommand{\argmin}{\mathop{\mathrm{argmin}}\limits}
---
Some text.
$A = \argmin_a f(a)$

ksh - search for multiple strings and write lines to file

Any help would be greatly appreciated. I can read code and figure it out, but I have trouble writing from scratch.
I need help starting a ksh script that would search a file for multiple strings and write each line containing one of those strings to an output file.
If I use the following command:
$ grep "search pattern" file >> output file
...that does what I want it to. But I need to search multiple strings, and write the output in the order listed in the file.
Again... any help would be great! Thank you in advance!

Have a look at the regular expression manuals. You can specify multiple strings in the search expression such as grep "John|Bill"
Man grep will teach you a lot about regular expressions, but there are several online sites where you try them out, such as regex101 and (more colorful) regexr.

Sometimes you need egrep.
egrep "first substring|second substring" file
When you have a lot substrings you can put them in a variable first
findalot="first substring|second substring"
findalot="${findalot}|third substring"
findalot="${findalot}|find me too"
skipsome="notme"
skipsome="${skipsome}|dirty words"
egrep "${findalot}" file | egrep -v "${skipsome}"

Use "-f" in grep .
Write all the strings you want to match in a file ( lets say pattern_file , the list of strings should be one per line)
and use grep like below
grep -f pattern_file file > output_file

Addressing a specific occurrence of a character in sed

How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...

If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file

awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10

A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?

I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.

I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w

latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.

To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)

In Texmaker interface you can get the word count by right clicking in the PDF preview:

Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:

I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.

If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.

For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Convert LaTeX to MediaWiki syntax

I need to convert LaTeX into MediaWiki syntax. The formulas should stay the same, but I need to transform, for example \chapter{something} into = something =.
Although this can be obtained with a bit of sed, things get a little dirty with the itemize environment, so I was wondering if a better solution can be produced.
Anything that can be useful for this task?

Pandoc should be able to do it:
$ pandoc -f latex -t mediawiki << END
> \documentclass{paper}
> \begin{document}
> \section{Heading}
>
> Hello
>
> \subsection{Sub-heading}
>
> \textbf{World}!
> \end{document}
> END
== Heading ==
Hello
=== Sub-heading ===
'''World'''!

pandoc can get your file converted between several different mark-up languages pretty easily, including mediawiki

I found this: plasTeX. With a bit of hacking probably I can produce a renderer for the mediawiki syntax

Yes Pandoc would be the easiest to do that.
pandoc -f latex -t mediawiki --metadata link-citations --bibliography=bibl.bib --csl=cslstyle.csl test.tex -o test.wiki
--metadata link-citations creates hyperlinks with your in-text citations and the bibliography. You can delete this part if not needed.
bibl.bib is the file of the bibliography you used
cslstyle.csl is the style of citation you want. There are lots of choices that can be downloaded from editor.citationstyles.org
test.tex is the file you want to convert from
test.wiki is the output file you want
all the files should be in the same folder otherwise locations should be specified

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Search Jupyter notebook markdown cells from command line - grep

I don't know if ag can be interfaced with a filter, but to get the Markdown out of a notebook file the following Python code will suffice import nbformat from sys import argv nb = nbformat.read(argv[1], nbformat.NO_CONVERT) for cell in nb.cells: if cell.cell_type == 'markdown' : print(cell.source)

Related

How to typeset argmin and argmax in Markdown?

ksh - search for multiple strings and write lines to file

Addressing a specific occurrence of a character in sed

Correct word-count of a LaTeX document

Convert LaTeX to MediaWiki syntax

Categories

Resources