Special mark on wrapping a content using XSL-FO - xslt-2.0

I am using XSLT to transform my XML to FO output. For a particular element, I am using the attribute wrap-option set as wrap, for it to wrap in the output if it exceeds the line limit. It gets wrapped properly in the output.
But, I would like to have an additional feature that, if the text is wrapped in the output, it should have some indication to the user on the wrapping. ie, if a particular line is wrapped to next line, it should have a "+" symbol in the end of the line wherever it is wrapped.
Sample input:
Testing the wrapped input specification for understanding the wrapping has happened.
Normal line without wrapping.
Again a lengthy line which exceeds the line limit.
Current output:
Testing the wrapped input specification
for understanding the wrapping has happened.
Normal line without wrapping.
Again a lengthy line which exceeds
the line limit.
Required output:
Testing the wrapped input specification+
for understanding the wrapping has happened.
Normal line without wrapping.
Again a lengthy line which exceeds+
the line limit.
How can I achieve this result?

If you are use AH Formatter, you can use the axf:line-continued-mark extension (https://www.antennahouse.com/product/ahf64/ahf-ext.html#line-continued-mark).
There's sample FO and PDF demonstrating how to use axf:line-continued-mark available in the 'Comprehensive XSL-FO Tutorials and Samples Collection' at https://www.antennahouse.com/antenna1/comprehensive-xsl-fo-tutorials-and-samples-collection/.

Related

Gibberish table output in tabula-java for Japanese PDF but works in standalone Tabula

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language

In Lua, what are the different forms to print?

I'm learning Lua, and I want to know the difference of print() and = or print() and io.write().
print is used for outputting text messages. It joins its arguments with a tab character, and automatically inserts a newline.
io.write is more simple. While it also accepts any number of arguments, it simply concatenates them (without inserting any characters) and doesn't add any newline. Think of it as file:write applied to the standard output.
These lines are equivalent:
io.write("abc")
io.write("a", "b", "c")
io.write("a") io.write("b") io.write("c")
I'd recommend using print for outputting normal text messages, or for debug, and io.write when you either want to print a number of strings without concatenating them explicitly (using io.write saves more memory), be able to write parts of a text separately, or outputting binary data via strings.
This short paragraph from "Programming in Lua" explains some differences:
21.1 The Simple I/O Model
Unlike print, write adds no extra characters to the output, such as
tabs or newlines. Moreover, write uses the current output file,
whereas print always uses the standard output. Finally, print
automatically applies tostring to its arguments, so it can also show
tables, functions, and nil.
There is also following recommendation:
As a rule, you should use print for quick-and-dirty programs, or for
debugging, and write when you need full control over your output
Essentially, io.write calls a write method using current output file, making io.write(x) equivalent to io.output():write(x).
And since print can only write data to the standard output, its usage is obviously limited. At the same time this guarantees that message always goes to the standard output, so you don't accidently mess up some file content, making it a better choice for debug output.
Another difference is in return value: print returns nil, while io.write returns file handle. This allows you to chain writes like that:
io.write('Hello '):write('world\n')

Howto disable hyphenation for typewriter text in doxygen?

I found out, that doxygen add hyphenation hints for latex when outputting text of "\c" command, like:
{\ttfamily on\-Ready\-State\-Change\-Listener}
I want to disable this behavior (so onReadyStateChangeListener won't be hyphenated). Is that possible and how?
No this is not possible. Without hyphenation hints LaTeX will often run long identifiers off the page and into the margin, which is the reason why they were introduced.
If you really want to get rid of it have a look at the function filterLatexString() in src/utils.cpp and remove the if in the default case at the end of the function.
I found this is possible in Doxygen 1.8.9.1, using a small workaround job.
Create a custom header.tex file for use with Doxygen. (Instructions)
Find the line in the header.tex file that starts \newcommand{\+}. If you don't find that text, insert an empty line towards the top of the document.
Replace that line with the following text:
\newcommand{\+}{}
Use the header.tex file with your Doxygen output (Instructions)
This effectively disables all of the hyphenation marks that Doxygen adds to the
words.
NOTES: This is for words with \+ added (e.g. D\+O\+X\+Y\+G\+E\+N). It may work for \- if you just substitute the minus sign into the steps above, but I have not verified that.
I found some itentifiers to still be hyphenated after applying this, but in more reasonable places.
Also, do watch out for text running into the margins, as noted by #doxygen.

LaTeX: Prevent line break in a span of text

How can I prevent LaTeX from inserting linebreaks in my \texttt{...} or \url{...} text regions? There's no spaces inside I can replace with ~, it's just breaking on symbols.
Update: I don't want to cause line overflows, I'd just rather LaTeX insert linebreaks before these regions rather than inside them.
\mbox is the simplest answer. Regarding the update:
TeX prefers overlong lines to adding too much space between words on a line; I think the idea is that you will notice the lines that extend into the margin (and the black boxes it inserts after such lines), and will have a chance to revise the contents, whereas if there was too much space, you might not notice it.
Use \sloppy or \begin{sloppypar}...\end{sloppypar} to adjust this behavior, at least a little. Another possibility is \raggedright (or \begin{raggedright}...\end{raggedright}).
Surround it with an \mbox{}
Also, if you have two subsequent words in regular text and you want to avoid a line break between them, you can use the ~ character.
For example:
As we can see in Fig.~\ref{BlaBla}, there is nothing interesting to see. A~better place..
This can ensure that you don't have a line starting with a figure number (without the Fig. part) or with an uppercase A.
Use \nolinebreak
\nolinebreak[number]
The \nolinebreak command prevents LaTeX from
breaking the current line at the point of the command. With the
optional argument, number, you can convert the \nolinebreak command
from a demand to a request. The number must be a number from 0 to 4.
The higher the number, the more insistent the request is.
Source: http://www.personal.ceu.hu/tex/breaking.htm#nolinebreak
Define myurl command:
\def\myurl{\hfil\penalty 100 \hfilneg \hbox}
I don't want to cause line overflows,
I'd just rather LaTeX insert linebreaks before
\myurl{\tt http://stackoverflow.com/questions/1012799/}
regions rather than inside them.

Maintaining page breaks

In my Rails app, I have a lot of data that is declared as text in the migration file. But when I print these attributes/fields out in the view, all the line breaks are lost and I get one large chunk of text. How do I maintain the line breaks?
This is an HTML problem. Its rules state that consecutive whitespace is converted to a single space.
Rails has a simple_format function that wraps blocks of text in tags so you get the separation you want

Resources