I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language
So I have to write my lab report in Italian for my lab class. In class they taught us how to use gnuplot to create graphs, so I'm using it to produce our graphs, which then I need to put in my latex document. The problem is that I have to set the label on the y axes as "velocità", and when I then save the file in ps and convert in pdf the 'à' disappears or is substituted by something else. What I've tried doing is using variations of the commands
set encoding iso_8859_1
set ylabel "velocit\340"
then I saved the plot using set term postscript color, set output "graf.ps", replot, and from the wsl terminal, using ps2pdf, I converted it into a pdf, but when I open the pdf, the letter 'à' doesn't appear anymore, even though it did show in graph previously generated by gnuplot. What should I do? In case, is there another way I can attach the original graph in my latex document?
Gnuplot provides several LaTeX-friendly terminal types. Postscript is not one of them. Postscript's character encodings are idiosyncratic at best. If your goal is to include gnuplot output in latex, then choose a terminal type that is designed for it. Some terminal types (e.g. cairolatex) work only with latex because they depend on latex to do all the text processing. Others (e.g. pdf, png, tikz) produce output that is fully compatible with latex but already has the text embedded in it. It is best to use UTF-8 encoding for everything, including your accented characters. For example:
set term pdf size 7cm,5cm
set output 'myfigure.pdf'
set encoding utf8
set ylabel "velocità"
set xlabel "tempo"
plot [0:10] x**2 title "velocità"
Then in your latex document, something like:
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
...
My TeX document.
\begin{figure}[h]
\includegraphics{myfigure}
\end{figure}
...
I'm playing around with ANSI escape sequences, e.g.
echo -e "\e[91mHello\e[m"
on a Linux console to display colored text.
Now I try to use superscript and subscript output like a=b².
I read here and here about: Partial Line Down (subscript) and Partial Line Up (superscript) but I'm not sure about the exact syntax and even which terminal client might supports this.
Any suggestions about this?
Possibly some commercial product supports it, but it's not supported by any terminal emulator you'll encounter (unless someone modifies one just to prove a point).
The standard describes possible escape sequences, but there is no requirement that any given sequence is supported by any terminal. There are commonly supported (and assumed) sequences such as clearing the screen, but even for that, not all terminals have supported the feature.
The reason is that terminal emulators are generally used with applications (such as text editors) which assume a regular set of rows/columns, and that the text is shown compactly (no extra space such as would be needed to allow for partial line movement. Back in the day when people used typewriters, it was common to have 1.5 or 2.0 line-spacing, and get no more than 33 lines on a page. That changed, long ago.
The need for subscripts/superscripts didn't go away — Unicode provides a usable set of characters with that representation (see Superscripts and Subscripts
Range: 2070–209F)
Further reading:
Your New Royal Portable (1953).
Line Spacing - Butterick's Practical Typography
console_codes - Linux console escape and control sequences
I'm currently trying to save a stress vs. strain curve using Octave. On this plot, I want to include text showing the equation for calculating engineering stress and engineering strain. Both of these require greek letters (\sigma and \epsilon respectively) as well as subscripts for the formulae.
Currently, using print with -deps, -dpng, or any other device, it creates a file, however the greek letters appear as the words "sigma" and "epsilon", and wherever I have a subscript, such as 0, it just appears as "_0". This looks very unprofessional.
Since I'm generating some 25 graphs, I don't want to have to go through and do a screenshot for each one. Does octave support saving the generated figure as displayed? I intend to use the generated files in a LaTeX document later (preferably as png so I can email them separately too).
I've also tried changing the "graphics_toolkit" option between fltk and gnuplot however it doesn't seem to help.
Attached to this post is a screenshot of the desired results and the actual results.
I am currently "not allowed" to post images, so I'll link them:
http://i.imgur.com/Tjt5Ecn.png (screenshot, desired result) and http://i.imgur.com/SP3hekd.png (directly saved, actual result)
Does anyone know a good way to print a figure from Octave which includes greek characters and subscripts in the titles?
Since you plan to use your graph in a Latex document, generating the graphs with -depslatex and converting them to pdf is a good idea . (Results look slightly better than direct -dpdflatex).
With -depslatex, you can include Latex code in your figures that will be written to a separate tex file.
Note that you need to use double backslashes \\ to export a single backslash.
graphics_toolkit("gnuplot");
...
legend("$\\varepsilon$");
print(sprintf("graph%s_%d.eps", name, type), '-depslatex', '-S200,270', '-F:9');
system(sprintf("epstopdf graph%s_%d.eps", name, type));
On the Latex side, you then \input the tex file generated by Octave. On the plus side, since you need 25 graphs, you can automatize this process on both sides Octave and Latex.
\newcommand{\mygraph}[1]{%
\graphicspath{{./figures/}}
\resizebox{0.495\linewidth}{!}{\relscale{1.0}\small%
\input{./figures/#1.tex}
}%
}
\mygraph{graph1_1}
Here, a Latex command \mygraph is defined to scale and include a figure located in a subfolder.
(I am using Octave 4.0.0 with gnuplot 4.4 on Ubuntu 12)
When I write math in LaTeX I often need to perform simple arithmetic on numbers in my LaTeX source, like 515.1544 + 454 = ???.
I usually copy-paste the LaTeX code into Google to get the result, but I still have to manually change the syntax, e.g.
\frac{154,7}{25} - (289 - \frac{1337}{42})
must be changed to
154,7/25 - (289 - 1337/42)
It seems trivial to write a program to do this for the most commonly used operations.
Is there a calculator which understand this syntax?
EDIT:
I know that doing this perfectly is impossible (because of the halting problem). Doing it for the simple cases I need is trivial. \frac, \cdot, \sqrt and a few other tags would do the trick. The program could just return an error for cases it does not understand.
WolframAlpha can take input in TeX form.
http://blog.wolframalpha.com/2010/09/30/talk-to-wolframalpha-in-tex/
The LaTeXCalc project is designed to do just that. It will read a TeX file and do the computations. For more information check out the home page at http://latexcalc.sourceforge.net/
The calc package allows you to do some calculations in source, but only within commands like \setcounter and \addtolength. As far as I can tell, this is not what you want.
If you already use sage, then the sagetex package is pretty awesome (if not, it's overkill). It allows you get nicely formatted output from input like this:
The square of
$\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}$
is \sage{matrix([[1, 2], [3,4]])^2}.
The prime factorization of the current page number is \sage{factor(\thepage)}
As Andy says, the answer is yes there is a calculator that can understand most latex formulas: Emacs.
Try the following steps (assuming vanilla emacs):
Open emacs
Open your .tex file (or activate latex-mode)
position the point somewhere between the two $$ or e.g. inside the begin/end environment of the formula (or even matrix).
use calc embedded mode for maximum awesomeness
So with point in the formula you gave above:
$\frac{154,7}{25} - (289 - \frac{1337}{42})$
press C-x * d to duplicate the formula in the line below and enter calc-embedded mode which should already have activated a latex variant of calc for you. Your buffer now looks like this:
$\frac{154,7}{25} - (289 - \frac{1337}{42})$
$\frac{-37651}{150}$`
Note that the fraction as already been transformed as far as possible. Doing the same again (C-x * d) and pressing c f to convert the fractional into a floating point number yields the following buffer:
$\frac{154,7}{25} - (289 - \frac{1337}{42})$
$\frac{-37651}{150}$
$-251.006666667$
I used C-x * d to duplicate the formula and then enter embedded mode in order to have the intermediate values, however there is also C-x * e which avoids the duplication and simply enters embedded mode for the current formula.
If you are interested you should really have a look at the info page for Emacs Calc - Embedded Mode. And in general the help for the Gnu Emaca Calculator together with the awesome interactive tutorial.
You can run an R function called Sweave on a (mostly TeX with some R) file that can replace R expressions with their results in Tex.
A tutorial can be found here: http://www.scribd.com/doc/6451985/Learning-to-Sweave-in-APA-Style
My calculator can do that. To get the formatted output, double-click the result formula and press ctrl+c to copy it.
It can do fairly advanced stuff too (differentiation, easy integrals (and not that easy ones)...).
https://calculator-algebra.org/
A sample computation:
https://calculator-algebra.org:8166/#%7B%22currentPage%22%3A%22calculator%22%2C%22calculatorInput%22%3A%22%5C%5Cfrac%7B1%2B2%7D%7B3%7D%3B%20d%2Fdx(arctan%20(2x%2B3))%22%2C%22monitoring%22%3A%22true%22%7D
There is a way to do what you want just not quite how you describe.
You can use the fp package (\usepackage[options]{fp}) the floating point package will do anything you want; solving equations, adding dividing and many more. Unfortunately it will not read the LaTeX math you instead have to do something a little different, the documentation is very poor so I'll give an example here.
for instance if you want to do (2x3)/5 you would type:
\FPmul\p{2}{3} % \p is the assignment of the operation 2x3
\FPupn\p{\p{} 7 round} % upn evaluates the assignment \p and rounds to 7dp
\FPdiv\q{\p}{5} % divides the assigned value p by 5 names result q
\FPupn\q{\q{} 4 round} % rounds the result to 4 decimal places and evaluates
$\frac{2\times3}{5}=\FPprint\q$ % This will print the result of the calculations in the math.
the FP commands are always ibvisible, only FPprint prints the result associated with it so your documents will not be messy, FP commands can be placed wherever you wish (not verb) as long as they are before the associated FPprint.
You could just paste it into symbolab which as a bonus has free step by step solutions. Also since symbolab uses mathquill it instantly formats your latex.
Considering that LaTeX itself is a Turing-complete markup language I strongly doubt you can build something like this that isn't built directly into LaTeX. Furthermore, LaTeX math matkup itself has next to no semantic meaning, it merely describes the visual appearance.
That being said, you can probably hack together something which recognizes a non-programmable subset of LaTeX math markup and spits out the result in the same way. If all you're interested in is simple arithmetics with fractions and integers (careful with decimal fractions, though, as they may appear as 3{,}141... in German texts :)) this shouldn't be too hard. But once you start with integrals, matrices, etc. I fear that LaTeX lacks expressiveness to accurately describe your intentions. It is a document preparation system, after all and thus not very suitable as input for computer algebra systems.
Side note: You can switch to Word which has—in its current version—a math markup language which is sufficiently LaTeX-like (by now it even supports LaTeX markup) and yet still Google-friendly for simpler terms:
With the free Microsoft Math add-in you can even let Word calculate expressions in-place:
There is none, because it is generally not possible.
LaTeX math mode markup is presentational markup and there are cases in which it does not provide enough information to calculate the expression.
That was one of the reasons MathML content markup was created and also why MathML is used in Mathematica. MathML actually is sort of two languages in one:
presentation markup
content markup
To accomplish what you are after you'll have to have MathML with comibned presentation and content markup (see MathML spec).
In my opinion your best bet is to use MathML (even if it is verbose) and convert to LaTeX when necessary. That said, I also like LaTeX syntax best and maybe what we need is a compact syntax for MathML (something similar in spirit to RelaxNG compact syntax).
For calculations with LaTeX you can use a CalcTeX package.
This package understand elements of LaTeX language and makes an calculations, for example your problem is avialble on
http://sg.bzip.pl/CalcTeX/examples/frac.tgz
or just please write
\noindent
For calculation please use following enviromentals
$515.1544 + 454$
or
\[ \frac{154.7}{25}-(289-\frac{1337}{42.})
\]
or
\begin{equation}
154.7/25-(289-1337/42.)
\end{equation}
For more info please visite project web site or contact author of this project.
For performing the math within your LaTeX itself, you might also look into the pgfmath package, which is more powerful and convenient than the calc package. You can find out how to use it from Part VI of The TikZ and PGF Packages Manual, which you can find here (version 2.10 currently): http://mirror.unl.edu/ctan/graphics/pgf/base/doc/generic/pgf/pgfmanual.pdf
Emacs calc-mode accepts latex-input. I use it daily. Press "d", followed by "L" to enter latex input mode. Press "'" to open a prompt where you can paste your tex.
Anyone saing it is not possible is wrong.
IIRC Mathematica can do it.
There is none, because it is generally not possible. LaTeX math mode
markup is presentational markup and there are cases in which it does
not provide enough information to calculate the expression.
You are right. LaTeX as it is does not provide enough info to make any calculations.Moreover, it does not represent any information to do it. But nobody prevents to wright in LaTeX format a text that contains such an information.
It is a difficult path, because you need to build a system of rules superimposed on what content ofthe text in Latex format needs to contain that it would be recognizable by your interpreter. And then convince the user that it is necessary to learn, etc. etc...
The easiest way to create a logical and intuitive calculator of mathematical expressions. And the expression is already possible to convert latex. It's almost like what you said. This is implemented in the program which I have pointed to. AnEasyCalc allows to type an expression as you type the plane text in any text editor. It checks, calculates and generate LateX string by its own then. Its very easy and rapid work. Just try and you will see that.
This is not exactly what you are asking for but it is a nice package
that you can include in a LaTeX document to do all kind of operations including arithmetic, calculus and even vectors and matrices:
The package name is "calculator"
http://mirror.unl.edu/ctan/macros/latex/contrib/calculator/calculator.pdf
The latex2sympy2 Python library can parse LaTeX math expressions.
from latex2sympy2 import latex2sympy
tex_str = r"""YOUR TEX MATH HERE"""
tex_str = r"\frac{9\pi}{\ln(12)}+22" # example TeX math
sympy_object = latex2sympy(tex_str)
evaluated_tex = float(sympy_object.evalf())
print(evaluated_tex)
This Python 3 code evaluates 9𝜋/ln(12)+22 (in its LaTeX from above) to 33.37842899841745.
The snippet above only handles basic algebraic simplification (math expressions without variables). Since the library converts LaTeX math to SymPy objects, the above code can easily be tweaked and extended to handle much more complicated LaTeX math (including solving derivatives, integrals, etc...).
The latex2sympy2 library can be installed via the pip command: pip install --user latex2sympy2
<>
try the AnEasyCalc program. It allows to get the latex formula very easy:
http://steamandwater.od.ua/AnEasyCalc/
:)