Convert Chinese docx files to markdown with Pandoc - latex

I am batch converting documents that are primarily written in Chinese from docx to markdown using Pandoc within a Powershell script. However, all Chinese characters are being converted to question marks - "?"
The command I am using is
pandoc.exe -f docx -t $converter-simple_tables-multiline_tables-grid_tables+pipe_tables -i $fullexportpath -o "$($fullfilepathwithoutextension).md" --wrap=none --atx-headers --extract-media="$($mediaPath)"
It works perfectly for English documents, so I presume I just need to add some sort of modifier to handle Chinese characters.
I have seen various posts about needing to use --latex-engine=xelatex and/or -V CJKmainfont="Font Name" (I've been using it with "Microsoft YaHei", the font in Word), but no combinations of these seems to matter (in fact, using --latex-engine at all breaks the conversion). It seems that they are for converting to PDF.
Any suggestions?
Edit: the issue was not with Pandoc but with the script I was using that did some find and replace conversion. Works fine with English but not Chinese characters.

Related

HTML homework created with pandoc from LaTeX includes the answers. Oops

I have a fairly large number (dozens) of homeworks written in LaTeX that I compile to PDF. We have recently adopted a LMS (Canvas) that needs HTML files. Conversion to HTML is easy peasy with pandoc, using the following command.
pandoc -s core3-hw08.tex --mathjax -c test.css -o core3hw08.html
Unfortunately, I included the homework answers following \end{document} in the tex files. This works great with PDF because my command (pdflatex) ignores everything after the end of the document. Pandoc doesn't, and the result is that the homework answers are converted to HTML along with the homework questions. Oops.
I could create an entire additional set of homework tex files without the answers, but that seems a bad solution --- having two identical sets of files with the only difference that one set as the answers and the other doesn't. I also want to keep the answers and questions in the same source document.
Is there any way to tell pandoc to ignore everything after \end{document}? I don't mind revising the source documents, which can be done with a script.

Pandoc not generating new lines in markdown with latex

I am working on a .md file which includes latex. The file looks like this:
$$
1+1 = 2
\\
2+2 = 4
$$
The File:
When viewing it as markdown the file looks perfectly fine with the new line properly added.
Although when I use pandoc to write the file to a pdf the following happens:
PDF File (from pandoc)
As you can see the new line has been completely removed and makes the latex hard to read.
I am using the following pandoc command:
pandoc --wrap=preserve in.md -o out.pdf
The --wrap=preserve does not seem to be working as it ignores a new line. I have also tried to use \newline \linebreak instead of \\ and neither seem to be working.
How can I specify a line break so pandoc will make sure to keep the breaks rather than keeping everything inline?
Whatever file preview tool you are using: it is lying to you. Double-backslash is not the correct way to insert newlines into math.
Pandoc either parses the math and converts it into the target format, or just passes the code through, depending on the output format. For PDF output via LaTeX, the equation is just passed through. (You can check by running pandoc with --verbose, which, among other things, will print the generated raw LaTeX code.) So it is clear that the problem lies with the input.
There are multiple ways to add linebreaks into math in LaTeX. One of them is the align* environment:
\begin{align*}
1+1 = 2
\\
2+2 = 4
\end{align*}
This will give you the expected PDF output, but has the downside of not showing up in other output formats like HTML. I'm not aware of any method which would produce linebreaks in math equations across all possible pandoc output formats. You'll have to use multiple single-line equations if you need that.

Converting Asciidoc to LaTeX

I want to convert Asciidoc to LaTeX, then use an existing toolchain that includes LaTeX modules to convert the resulting document further to the final format. Asciidoc's native LaTeX conversion is "experimental" according to their documentation, and it also doesn't work for me. There is another toolchain supported by Asciidoc, which is converting to Docbook first, then use dblatex to convert it further. However, it includes a lot of formatting in its LaTeX output, which clashes with the formatting of my toolchain.
Is there any way to convert Asciidoc to LaTex in a way that the content is included in the resulting document, but without any exact formatting rules (expect those explicitly specified in the document). I don't want the LaTeX result to contain any information about fonts, page layout and so on, because for those I already have a toolchain.
I get acceptable, almost good results with this toolchain using pandoc convertor:
edit your document in asciidoc or asciidoctor
convert your document to docbook: asciidoctor -b docbook5 your asciidoc document.
convert your docbook document to (xe)latex using pandoc: pandoc -f docbook your docbook document --pdf-engine=xelatex
You can customize your latex layout and modules in a pandoc configuration file or convert your docbook file into a latex file with pandoc. The converted latex file is quite clean (because its source is docbook).

Tilde over n when when converting from markdown to latex with pandoc

I have a markdown document that I convert to PDF via pandoc's latex engine. I'm trying to render an n with a tilde over it, as in "niño", with markdown like the following:
ni\~{n}o
...but this just gets rendered in the PDF as "ni~no" -- i.e. the tilde gets interpreted literally. I've also tried escaping the backslash (ni\\~{n}o), surrounding everything in brackets (ni{\~{n}}o), and basically what I think is every possible combination of escaping characters in this sequence, but nothing works. It also fails even when the sequence is on its own (i.e. \~{n}).
But, other similar sequences that are based on letters rather than symbols work just fine (e.g. Otter\r{a} gets rendered correctly to "Otterå"). Pandoc is specifically failing to handle the tilde (or maybe more generally non-letter-based latex character sequences -- I haven't tested others).
The command I'm using to build the pdf is pandoc file.md -o file.pdf. I've also tried specifying -f markdown+raw_tex, but it still fails (nor should I need to, since the \r{a} works without it, and I think raw_tex is enabled by default anyway).
Any thoughts? I know I can use xetex to just enter these characters directly, but that's not really a satisfying solution...
Besides using the ñ character directly (which apparently works in native Pandoc because it's magic!), an alternative is to create a simple LaTeX \newcommand for forcing native TeX interpretation.
\newcommand{\tex}[1]{#1}
ni\tex{\~n}o
Thanks to John McFarlane for introducing me to this clever workaround!

How to handle citations in Ipython Notebook?

What is the best way to take care of citations in Ipython Notebook? Ideally, I would like to have a bibtex file, and then, as in latex, have a list of shorthands in Ipython markdown cells, with the full references at the end of the notebook.
The relevant material I found is this: http://nbviewer.ipython.org/github/ipython/nbconvert-examples/blob/master/citations/Tutorial.ipynb
But I couldn't follow the documentation very well. Can anyone explain it? Thanks so much!!
Summary
This solution is largely based on Sylvain Deville's excellent blog post. It allows you to simply write [#citation_key] in markdown cells. The references will be formatted after document conversion. The only requirements are LaTeX and pandoc, which are both widely supported. While there is never a guarantee, this approach should therefore still work in many years time.
Step-by-Step Guide
In addition to a working installation of jupyter you need:
LaTeX (installation guide).
Pandoc (installation guide).
A citation style language. Download a citation style, e.g., APA. Save the .csl file (e.g., apa.csl) into the same folder as your jupyter notebook (or specify the path to the .csl file later).
A .bib file with your references. I am using a sample bib file list.bib. Save to the same folder as your jupyter notebook (or specify the path to the .bib file later).
Once you completed these steps, the rest is easy:
Use markdown syntax for references in markdown cells in your jupyter notebook. E.g., [#Sh:1] where the syntax works like this: ([#citationkey_in_bib_file]). I much prefer this syntax over other solutions because it is so fast to type [#something].
At the end of your ipython notebook, create a code cell with the following syntax to automatically convert your document (note that this is R code, use an equivalent command to system() for python):
#automatic document conversion to markdown and then to word
#first convert the ipython notebook paper.ipynb to markdown
system("jupyter nbconvert --to markdown paper.ipynb")
#next convert markdown to ms word
conversion <- paste0("pandoc -s paper.md -t docx -o paper.docx",
" --filter pandoc-citeproc",
" --bibliography="listb.bib",
" --csl="apa.csl")
system(conversion)
Run this cell (or simply run all cells). Note that the 2nd system call is simply pandoc -s paper.md -t docx -o paper.docx --filter pandoc-citeproc --bibliography=listb.bib --csl=apa.csl. I merely used paste0() to be able to spread this over multiple lines and make it nicer to read.
The output is a word document. If you prefer another document, check out this guide for alternative syntax.
#Extras
If you do not like that your converted document includes the syntax for the document conversion, insert a markdown cell above and below the code cell with the syntax for the conversion. In the cell above, enter <!-- and in the cell below enter -->. This is a regular HTML command for a comment, so the syntax will in between these two cells will be evaluated but not printed.
You can also include a yaml header in your first cell. E.g.,
---
title: This is a great title.
author: Author Name
abstract: This is a great abstract
---
You can use the Document Tools of the Calico suite, which can be installed separately with:
sudo ipython install-nbextension https://bitbucket.org/ipre/calico/downloads/calico-document-tools-1.0.zip
Read the tutorial and watch the YouTube video for more details.
Warning: only the cited references are processed. Therefore, if you fail to cite an article, it won't appear in the References section. As a little working example, copy the following in a Markdown cell and press the "book" icon.
<!--bibtex
#Article{PER-GRA:2007,
Author = {P\'erez, Fernando and Granger, Brian E.},
Title = {{IP}ython: a System for Interactive Scientific Computing},
Journal = {Computing in Science and Engineering},
Volume = {9},
Number = {3},
Pages = {21--29},
month = may,
year = 2007,
url = "http://ipython.org",
ISSN = "1521-9615",
doi = {10.1109/MCSE.2007.53},
publisher = {IEEE Computer Society},
}
#article{Papa2007,
author = {Papa, David A. and Markov, Igor L.},
journal = {Approximation algorithms and metaheuristics},
pages = {1--38},
title = {{Hypergraph partitioning and clustering}},
url = {http://www.podload.org/pubs/book/part\_survey.pdf},
year = {2007}
}
-->
Examples of citations: [CITE](#cite-PER-GRA:2007) or [CITE](#cite-Papa2007).
This should result in the following added Markdown cell:
References
^ Pérez, Fernando and Granger, Brian E.. 2007. IPython: a System for Interactive Scientific Computing. URL
^ Papa, David A. and Markov, Igor L.. 2007. Hypergraph partitioning and clustering. URL
I was able to run it with the following approach:
Insert the html citation as in the tutorial you mentioned.
Create ipython.bib in the "standard" bibtex format. It goes into the same file as your *.ipynb notebook file.
Create the template file as in the tutorial, also in the same directory or else in the (distribution dependent) directory with the other templates. On my system, that's /usr/local/lib/python2.7/dist-packages/IPython/nbconvert/templates/latex.
The tutorial has the template extend latex_article.tplx. On my distribution, it's article.tplx (without latex_).
Run nbconvert with --to latex; that generates an .aux file among other things. Latex will complain about missing references.
Run bibtex yournotebook.aux; this generates yournotebook.bbl. You only need to re-run this if you change references.
Re-run nbconvert either with --to latex or with --to pdf. This generates a .tex file, or else runs all the way to a .pdf.
If you want html output, you can use pandoc to assemble the references into a tidy citation page. This may require some hand-editing to make an html page you can reference from your main document.
If you know that you will be converting your notebook to latex anyway, consider simply adding a "Raw" cell (Ctrl+M R) to the end of the document, containing the bibliography just as you would put it in pure LaTeX.
For example, when I need to reference a couple of external links, I would not even care to do a proper BibTeX thing and simply have a "Raw" cell at the end of the notebook like that:
\begin{thebibliography}{1}
\bibitem{post1}
Holography in Simple Terms. K.Tretyakov (blog post), 2015.\\
\url{http://fouryears.eu/2015/07/24/holography-in-simple-terms/}
\bibtem{book1}
The Importance of Citations. J. Smith. 2010.
\end{thebibliography}
The items can be cited in other Markdown cells using the usual <cite data-cite="post1">(KT, 2015)</cite>
Of course, you can also use proper BibTeX as well. Just add the corresponding Raw cell, e.g:
\bibliographystyle{unsrt}
\bibliography{papers}
This way you do not have to bother editing a separate template file (at the price of cluttering the notebook's HTML export with raw Latex, though).
You should have a look at the latex_envs extension in https://github.com/ipython-contrib/IPython-notebook-extensions (install from this repo, it is the most recent version). This extension contains a way to integrate bibliography using bibtex files and standard latex notation, and generates a bibliography section at the end of the notebook. Style of citations can be (to some extent) customized. Some documentation here https://rawgit.com/jfbercher/latex_envs/master/doc/latex_env_doc.html

Resources