R/exams unicode char in *.Rnw question files are not propoerly displayed: é displayed as <U+00E9> in final PDF - character-encoding

I am struggling to produce an exam sheet in French using exams2nops. There are accents in the text provided in the intro and title argument of this function and also in the Rnw files containing the function. The formers are correctly displayed in the resulting PDF, but not the later, for example é from a Rnw file is displayed as <U+00E9>.
The call to exams2nops looks like this:
exams2nops(file=myexam, n = N.students, dir = '.',
name = paste0('exam-', exam.date),
title = "Examen écrit",
course = course.id,
institution = "",
logo = paste(exams.dir, 'input/logo.jpg', sep='/'),
date = exam.date,
replacement = TRUE,
intro = intro,
blank=round(length(myexam)/4),
duplex = TRUE, pages = NULL,
usepackage = NULL,
language = "fr",
encoding = "UTF-8",
startid = 1,
points = c(1), showpoints = TRUE,
samepage = TRUE,
twocolumn = FALSE,
reglength = 9,
header=NULL)
Note that "Examen écrit" is correctly displayed in the final PDF, the problem is with the accent in the Rnw files. The function call yields no error.
The *.tex files by generated by exams2nops, already have the problem. For example, the sentense 'Quarante patients ont été inscrits' in the original Rnw file, becomes 'Quarante patients ont <U+00E9>t<U+00E9> inscrits' in the tex file.
I use exams_2.4-0 with R 4.2.2 with TeXShop 4.70 on OSX 11.6.
I checked that Rnw are utf-8 encoded, for example:
$ file -I question1.Rnw
question1.Rnw: text/x-tex; charset=utf-8
It seems they are utf-8-encoded, indeed. These files were translated with deepl or google translate, then edited in emacs.
I tried setting the encoding parameter of exams2nops to latin-1. It did not help.
Any Idea?

The problem disapeared after setting R 'locales' properly. A recurrent problem with OSX R installs. The symptome is:
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
at start up. This thread explains how to fix it: Installing R on Mac - Warning messages: Setting LC_CTYPE failed, using "C".

I'm collecting a few further comments here in addition to the existing answer:
The only encoding (beyond ASCII) supported by R/exams, starting from version 2.4-0, is UTF-8. Support for other encodings like latin1 etc. has been discontinued.
As only UTF-8 is supported the encoding does not have to be specified in R/exams function calls anymore (as still might be advised in older tutorials).
To leverage this support of UTF-8, R has to be configured with a suitable locale. A "C" locate (see the answer by #vdet) is not sufficient.
When using R/LaTeX (Rnw) exercises all issues with encodings can also be avoided entirely by using LaTeX commands for special characters, e.g., {\'e}t{\'e} instead of été. The latter is of course more convenient but the former can be more robust, especially when working with teams of instructors living on different operating systems with different locale settings.
When using LaTeX commands instead of special characters in R strings (as opposed to the exercise files), then remember that the backslash has to be escaped. For example, the argument title = "Examen écrit" becomes title = "Examen {\\'e}crit".

Related

Reading text file in Lua using Luacom and ADODB: error

I am constructing a general purpose function to read a text file, which may be Ascii, UTF-8 or UTF-16. (The encoding is known when the function is invoked). The file name may contain UTF8 characters, so the standard lua io functions are not a solution. I have no control over the Lua implementation (5.3) or the binary modules available in the environment.
My current code is:
require "luacom"
local function readTextFile(sPath, bUnicode, iBits)
local fso = luacom.CreateObject("Scripting.FileSystemObject")
if not fso:FileExists(sPath) then return false, "" end --check the file exists
local so = luacom.CreateObject("ADODB.Stream")
--so.CharSet defaults to Unicode aka utf-16
--so.Type defaults to text
so.Mode = 1 --adModeRead
if not bUnicode then
so.CharSet = "ascii"
elseif iBits == 8 then
so.CharSet = "utf-8"
end
so:Open()
so:LoadFromFile(sPath)
local contents = so:ReadText()
so:Close()
return true, contents
end
--test Unicode(utf-16) files
local file = "D:\\OneDrive\\Desktop\\utf16.txt" --this exists
local booOK, factsetcontents = readTextFile(file, true, 16)
When executed I get the error: COM exception:(d:\my\lua\luacom-master\src\library\tluacom.cpp,382):Operation is not allowed in this context on line 19 [local stream = so:LoadFromFile(sPath)]
I've pored over the ADO documentation and am obviously missing something that is staring me in the face! Is what I'm trying to do impossible?
ETA: If I comment out the line so.Mode = 1, this works. Which is great, but I don't understand why, which meaans I may end up making the same mistake unwittingly, whatever that mistake is!
I don't know about AdoDB Stream.Mode and why the function failed. But I think it's rather tricky to use a ADODB COM object on Windows to read ASCII/UTF8/UNICODE encoded files.
You can instead :
use standard Lua io.open function in binary mode and use manual decoding of the bytes content
use a binary module to do all the work
use a specific Lua implementation for Windows that can read/write those kind of encoded files natively, like LuaRT

How to encode a STRING variable into a given code page

I've got a string variable containing a text that I need to encode and write to a file, in UTF-16LE code page.
Currently the following code generates a UTF-8 file and I don't see any option in the statement OPEN DATASET to generate the file in UTF-16LE.
REPORT zmyprogram.
DATA(filename) = `/tmp/myfile`.
OPEN DATASET filename IN TEXT MODE ENCODING DEFAULT FOR OUTPUT.
TRANSFER 'HELLO WORLD' TO filename.
CLOSE DATASET filename.
I guess one solution is to first encode the string in memory, then write the encoded bytes to the file.
Generally speaking, how to encode a string of characters into a given code page, in memory?
In the first part, I explain how to encode a string of characters into a given code page (all is done in memory), and in the second part, I explain specifically how to write files to the application server in a given code page.
General way (all in memory)
If a string of characters (type STRING) has to be encoded, the result has to be stored in a string of bytes, which corresponds to the built-in data type XSTRING.
There are several possibilities which depend on the ABAP version:
Since 7.53, use the class CL_ABAP_CONV_CODEPAGE:
DATA(xstring) = cl_abap_conv_codepage=>create_out( codepage = `UTF-16LE` )->convert( source = `ABCDE` ).
Since 7.02, use the class CL_ABAP_CODEPAGE:
DATA xstring TYPE xstring.
xstring = cl_abap_codepage=>convert_to( source = `ABCDE` codepage = `UTF-16LE` ).
Before 7.02, use the class CL_ABAP_CONV_OUT_CE (documentation provided with the class):
First, instantiate the conversion object, use a SAP code page number instead of the ISO name (list of values shown hereafter):
DATA: conv TYPE REF TO CL_ABAP_CONV_OUT_CE, xstring TYPE xstring.
conv = CL_ABAP_CONV_OUT_CE=>CREATE( encoding = '4103' ). "4103 = utf-16le
Then encode the string and retrieve the bytes encoded:
conv->RESET( ).
conv->WRITE( data = `ABCDE` ).
xstring = conv->GET_BUFFER( ).
Eventually, instead of using RESET, WRITE and GET_BUFFER, the method CONVERT was added in 6.40 and retroported :
conv->CONVERT( EXPORTING data = `ABCDE` IMPORTING buffer = xstring ).
With the class CL_ABAP_CONV_OUT_CE, you need to use the number of the SAP Code Page, not the ISO name. Here are the most common SAP code pages and their equivalent ISO names:
1100: ISO-8859-1
1101: US-ASCII
1160: Windows-1252 ("ANSI")
1401: ISO-8859-2
4102: UTF-16BE
4103: UTF-16LE
4104: UTF-32BE
4105: UTF-32LE
4110: UTF-8
Etc. (the possible values are defined in the table TCP00A, in lines with column CPATTRKIND = 'H').
 
Writing a file on the application server in a given code page
In ABAP, OPEN DATASET can directly specify the target code page, most code pages are supported including UTF-8, but not other UTF (code pages 41xx) which can be done only by the solution explained in 2.3 below (by first encoding in memory).
2.1) IN TEXT MODE ENCODING ...
Possible ENCODING values:
UTF-8: in this mode, it's possible to add the Byte Order Mark if needed, via the option WITH BYTE-ORDER MARK.
DEFAULT: will be UTF-8 in a SAP "Unicode" system (that you can check via the menu System > Status > Unicode System Yes/No), NON-UNICODE otherwise.
NON-UNICODE: will depend on the current ABAP linguistic environment; for language English, it's the character encoding iso-8859-1, for language Polish, it's the character encoding iso-8859-2, etc. (the equivalences are shown in table TCP0C.)
Example in ABAP version 7.52 to write to UTF-8 with the byte order mark:
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_utf_8`.
OPEN DATASET filename IN TEXT MODE ENCODING UTF-8 WITH BYTE-ORDER MARK FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
Example in ABAP version 7.52 to write to iso-8859-2 (Polish language here):
REPORT zmyprogram.
SET LOCALE LANGUAGE 'L'. " Polish
DATA(filename) = `/tmp/dataset_nonunicode_pl`.
OPEN DATASET filename IN TEXT MODE ENCODING NON-UNICODE FOR OUTPUT.
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.2) IN LEGACY TEXT MODE CODE PAGE ...
Use any code page number except code pages 41xx (i.e. UTF-8 and other UTF; see workaround in 2.3 below).
Example in ABAP version 7.52 to write to iso-8859-2 (code page 1401) :
REPORT zmyprogram.
DATA(filename) = `/tmp/dataset_iso_8859_2`.
OPEN DATASET filename IN LEGACY TEXT MODE CODE PAGE '1401' FOR OUTPUT. " iso-8859-2
TRY.
TRANSFER `Witaj świecie` TO filename.
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
ENDTRY.
CLOSE DATASET filename.
2.3) UTF = general way + IN BINARY MODE
Example in ABAP version 7.52:
REPORT zmyprogram.
TRY.
DATA(xstring) = cl_abap_codepage=>convert_to( source = `Witaj świecie` codepage = `UTF-16LE` ).
CATCH cx_sy_conversion_codepage INTO DATA(lx).
" Character not supported in language code page
BREAK-POINT.
ENDTRY.
DATA(filename) = `/tmp/dataset_utf_16le`.
OPEN DATASET filename IN BINARY MODE FOR OUTPUT.
TRANSFER xstring TO filename.
CLOSE DATASET filename.

Broken utf8 conversion?

I am trying to force UTF-8 like this:
to_utf8(X) when is_list(X) ->
unicode:characters_to_binary(X, utf8);
to_utf8(X) when is_binary(X) ->
to_utf8(binary_to_list(X));
to_utf8(X) -> X.
And testing it like this:
<<"é"/utf8>> = to_utf8(<<"é">>),
<<"Ø"/utf8>> = to_utf8(<<"Ø">>),
<<"œ"/utf8>> = to_utf8(<<"œ">>),
When using R16B03 everything works fine. However after upgrading to Erlang 17.5, the function stopped working for characters like "œ" or "Ā" even though they are available in UTF-8
Since R17 uses utf-8 as default encoding instead of latin-1 for R16 this should work the same as before.
Did I overlooked something ?
Thanks :)
I'm going to use œ as the example unicode character in the examples below:
<<197,147>> = <<"œ"/utf8>>.
[197,147] = binary_to_list(<<"œ"/utf8>>).
<<195,133,194,147>> = unicode:characters_to_binary(binary_to_list(<<"œ"/utf8>>), utf8).
Prior to R17, the default encoding of latin1 is what allowed this to work in conjunction with binary_to_list/1. The new default is unicode.
The list [197,147] is not in the format expected by the implied output encoding unicode in unicode:characters_to_binary/2. If we want to use binary_to_list/1, we have to specify that the output encoding should be latin1 as was the default for R16 and below:
<<197,147>> = unicode:characters_to_binary(binary_to_list(<<"œ"/utf8>>), latin1, latin1).
Another solution would be to make use of unicode:characters_to_list/1 instead of binary_to_string:
[339] = unicode:characters_to_list(<<"œ"/utf8>>).
<<197,147>> = unicode:characters_to_binary(unicode:characters_to_list(<<"œ"/utf8>>), utf8).
A better solution would be to just use unicode:characters_to_binary/1,2,3 directly as there is no need to convert binaries to lists:
<<"œ"/utf8>> = unicode:characters_to_binary(<<197,147>>).
<<"œ"/utf8>> = unicode:characters_to_binary("œ").

Adding symbols to word document using docx4j

I am trying to add a Wingdings symbol using the following code to the docx.
P symp = factory.createP();
R symr = factory.createR();
Sym sym = factory.createRSym();
sym.setFont("Wingdings");
sym.setChar("FOFC");
symr.getContent().add(sym);
symp.getContent().add(symr);
mainPart.getContent().add(symp);
I get invalid content errors on opening the document. When I tried to add the symbols directly to a word docx, unzipped the docx and looked at the document.xml, I see the paragraph has rsidR and rsidDefault attributes. When I read about these attributes from this link, How to generate RSID attributes correctly in Word .docx files using Apache POI?, I see that they are random and only necessary to track changes in the document. So then, why does Microsoft word keeps expecting it and gives me the errors?
Any ideas/suggestions?
I wonder whether Sym support is in docx4j in the way you expect.
I tried your code and got the same issue, but I must confess to not having investigated symbols before. As an experiment, I added a symbol using the relevant “Insert” menu command in Word 2010, and then checked the resulting OpenXML—it’s really quite different to the mark-up expected when inserting an Sym element.
Rather than manipulating symbols, have you tried inserting the text directly instead? For example, this will insert a tick character (not sure if that’s what you’re after):
P p = factory.createP();
R r = factory.createR();
RPr rpr = factory.createRPr();
Text text = factory.createText();
RFonts rfonts = factory.createRFonts();
rfonts.setAscii("Wingdings");
rfonts.setCs("Wingdings");
rfonts.setHAnsi("Wingdings");
rpr.setRFonts(rfonts);
r.setRPr(rpr);
text.setValue("\uF0FC");
r.getContent().add(text);
p.getContent().add(r);
mainPart.getContent().add(p);

How to encode special characters in .vdproj registry string?

We got a .vdproj file that produces an .msi file. Upon installing, strings in variuous languages are added in the registry. But the special charactes comes out all wrong.
I cannot open the .vdproj as it requires VS 2005. But in text it looks like this, note the value:
"Values"
{
"{ADCFDA98-8FDD-45E4-90BC-E3D20B029870}:_58F50CEB3EC74D5E9E6301A39929D9FE"
{
"Name" = "8:Description"
"Condition" = "8:"
"Transitive" = "11:FALSE"
"ValueTypes" = "3:1"
"Value" = "8:Låter dig söka efter information."
}
When built, this looks like the following in the generated .msi file (viewed in InstallShield):
The Swedish letters are misrepresented, and they look the same in the registry after installation:
How do I get around this? Is there a setting I could set, or an encoding I could use, directly in the vdproj value?
I solved this for now by rephrasing without using special characters. The issue remains though, as I cannot rephrase in all languages.
The alternatives I looked at included installing Visual Studio 2005, to be able to open and edit the vbproj file, or to convert it all to WiX.

Resources