I've recently taken on a project of document conversion to HTML. That is, a client gives me a .DOC file, and I need to convert the contents to one long HTML file - no styling, no CSS, just clean HTML with paragraph tags, header tags tags, etc.
I found an application that does a pretty good job of automating the first part of it. The problem is that I need to do some advanced find and replace based on strings using variables.
For instance, I have footnotes that were converted properly. They're currently displayed as superscript numbers with the
I'd like to change how the footnote is displayed. Instead of a superscript number 6 for the 6th footnote, I'd like it to show (Note 6)
To do that on the entire document (hundreds of footnotes), I'm wondering if I can do something like:
FIND:
<sup><a name="FN[0-9]" href="FNR[0-9]">[0-9]</a></sup>
REPLACE:
<a name="FN%1" href="FNR%2">(Note %3)</a>
The problem is, I can't find a Find and Replace tool that lets me maintain the variables in the replace area. All I get is the superscript 6 appearing as (Note %3), as well as every other footnote doing the same thing.
Anyone have any ideas on how I can accomplish my task efficiently?
In Perl it would look roughly like this on the command line (I have NOT tested this):
perl -i -p -e's{<sup><a name="(FN\d)" href="(FNR\d)">(\d)</a></sup>}{<a name="$1" href="$2">(Note $3)</a>}' filenames....
-i says "Edit this file in place", -p means "print each line after we do whatever is in the -e switch".
That's assuming you're only looking for a single digit where you have [0-9]. If you want to match FN427, then you change (FN\d) to (FN\d+), for example.
This also assumes that the HTML that are you parsing looks EXACTLY LIKE THAT. If you get some HTML that is <a href=... name=... (with the attributes in opposite order than you have) then it will break. In that case, you'll want to use an HTML parser.
I hope that gives you enough to start with.
Related
I'm trying to convert latex code embedded in an HTML document (Intended to be used with a Javascript shim) into MathML. Pandoc seems like a great tool. Following this example: http://pandoc.org/demos.html,
pandoc input.html -s --latexmathml -o output.html
Produces no change in the file. I even made a barebones blank HTML file with various text expressions to test; no change in the output. What am I missing?
http://math.etsu.edu/LaTeXMathML/ This site, linked to by Pandoc, appears to show documentation for a standalone case, but it uses a JS shim instead of outputting the MathML directly. (I think it has the browser render dynamically-rendered MathML, but doesn't actually output it to the file) It's also missing some basic functionality, like own-line functions with \begin{equation}.
I've spent several hours googling ways to accomplish this. Any ideas? The only fully-working solution I've found is https://www.mathtowebonline.com/ This website. There's also a python module called latex2mathml, but it's also missing large chunks of the spec.
You'll need the --mathml flag (not the --latexmathml flag) to generate MathML and the tex_math_dollars extension enabled for reading the math between dollar signs:
$ echo '<p>$$x = 4$$</p>' | pandoc -f html+tex_math_dollars -t html --mathml
<p>
<math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
<semantics>
<mrow><mi>x</mi><mo>=</mo><mn>4</mn></mrow><annotation encoding="application/x-tex">x = 4</annotation>
</semantics>
</math>
</p>
Or maybe you're better off using somehting like snuggleTeX or LaTeXMathML.js...
I have a code block in an org document
#+NAME: result_whatever
#+BEGIN_SRC python :session data :results value :exports none
return(8.1 - 5)
#+END_SRC
which I evaluate inline:
Now, does this work? Let's see: call_result_whatever(). I'd be surprised ...
When exporting to LaTeX, this generates the following:
Now, does this work? Let's see: \texttt{3.1}. I'd be surprised \ldots{}
However, I don't want the results to be displayed in monospace. I want it to be formatted in "normal" upright font, without any special markup.
How can I achieve this?
You should be able to get it work using the optional header arguments which can be added to call_function().
I don't have LaTeX installed on this system so can't fully test the outputs to ensure they come out exactly as desired, I'm using the plain text output to compare instead. However you can use the following syntax as part of your call to modify the results.
Now, does this work? Let's see call_results_whatever()[:results raw].
I'd be surprised ...
Without the [:results raw] the output to Plain Text (Ascii buffer) is Let's see `3.0999999999999996'.. With the added results it becomes Let's see 3.0999999999999996.
For full details of the available results keywords as well as other optional header arguments for the inline blocks please see Evaluation Code Blocks and Results arguments.
this is 5 years later. apparently in org-mode 8.2 or so, a new variable was introduced (documenting in "Evaluating Code Blocks" in the org-mode manual, but this from etc/ORG-NEWS in the source tree):
*** New option: org-babel-inline-result-wrap
If you set this to the following
: (setq org-babel-inline-result-wrap "$%s$")
then inline code snippets will be wrapped into the formatting string.
so, to eliminate \texttt{}
(setq org-babel-inline-result-wrap "%s")
The problem of this type can be solved in two ways:
1: Easy does it:
A plain query-replace on the exported buffer.
Once you're in the LaTeX buffer,
beginning-of-buffer or M-<
query-replace or M-%
enter \texttt as the string that you want to replace
enter nothing as the replacement
continue to replace each match interactively
with y/n or just replace everything with !
2: But I wanna!
The second way is to nag the org-mode mailing list into
implementing a switch or an option for your specific case.
While it's necessary sometimes, it also produces a system
with thousands of switches, which can become unwieldy.
You can try, but I don't recommend.
I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?
RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)
Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).
It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.
I'm using odt file as some kind of template and Libre Office as tool to create this template. It usually works fine except one thing.
Let assume our odt file has a paragraph of text.
There is my text.
XML file may or may not look (seems random) like this (messy, not very good thing for for parsing or as a template):
<text:p text:style-name="P7">There is</text:p><text:p text:style-name="P7"> my text<text:p text:style-name="P7">.</text:p></text:p>
Sometimes it's (again seems random) like this (expected result, makes sense after all):
<text:p text:style-name="P7">There is my text.</text:p>
Is there any way to get rid superfluous xml tags? Or at least can user see a raw document in LibreOffice/OpenOffice to manually remove redundancy?
The key is to provide easy tool for a user, to detect and fix artefacts like this.
Have you tried Ctrl-M? If all formatting is defined in styles and style formatting is not manually overridden, it should not disturb the formatting but should remove redundant tags.
A tedious user process would be to cut and paste-special as text and apply style again.
Finally, a macro would definitely do the trick.
I'd like to read the Rails 3 source code on printed paper (and preferably in color).
For example, xv6 did a nice job printing their code. It even has line numbers and an index. The only thing I would like to add is syntax highlighting.
Anyone know how any of this is possible?
Here are two possibilities I found:
1. The Listings Package (could this also generate other formats besides PDF, like HTML?)
ftp://ftp.tex.ac.uk/tex-archive/macros/latex/contrib/listings/listings.pdf o
2. Highlight (does it do indexing?)
http://www.andre-simon.de/
We could even add a rake task, print, that generates an up-to-date PDF.
I recomend you use highlight to turn the code into LaTeX and then use Listings to make it into a PDF, then print!