Detecting encoding before opening a file

Detecting encoding before opening a file - character-encoding

I got a file with an unknown character encoding. Running file -bi test.trace returns text/plain; charset=us-ascii but using
(with-open-file (stream "/home/*/test.trace" :external-format :us-ascii)
(code-to-work-with-file))
gives me an exception:
:ASCII stream decoding error on
#<SB-SYS:FD-STREAM for "file /home/*/test.trace" {10208D2723}>:
the octet sequence #(194) cannot be decoded. [Condition of type SB-INT:STREAM-DECODING-ERROR]
How can I detect the encoding of a file before opening it?
I can open the file with emacs,less and nano just fine so it seems to be a miss-detection of the encoding or a difference in what file and sbcl think an encoding should look like.
I currently avoid this problem by forcing every file to have a utf8 encoding with vim +set nobomb | set fenc=utf8| x file-path. But even after this file still thinks it is an us-ascii encoding. Additional this is not a valid permanent solution, rather a dirty hack to make it work.

As pointed in prorgammers stackexchange in here,
Files generally indicate their encoding with a file header. There are
many examples here. However, even reading the header you can never be
sure what encoding a file is really using.
I looked for trace files in my system and found this one but this not have any funny thing
2016-06-22 13:10:07 ☆ |ruby-2.2.3#laguna| Antonios-MacBook-Pro in ~/learn/lisp/stackoverflow/scripts
○ → file -I resources/hello.trace
resources/hello.trace: text/plain; charset=us-ascii
2016-06-22 13:11:50 ☆ |ruby-2.2.3#laguna| Antonios-MacBook-Pro in ~/learn/lisp/stackoverflow/scripts
○ → cat resources/hello.trace
println! { "Hello, World!" }
print! { concat ! ( "Hello, World!" , "\n" ) }
So With this code I can read it:
CL-USER> (with-open-file (in "/Users/toni/learn/lisp/stackoverflow/scripts/resources/hello.trace" :external-format :us-ascii)
(when in
(loop for line = (read-line in nil)
while line do (format t "~a~%" line))))
println! { "Hello, World!" }
print! { concat ! ( "Hello, World!" , "\n" ) }
NIL
or even in chinese or whatever it was:
we can read the ascci character like this
CL-USER> (format nil "~{~C~}" (mapcar #'code-char '(194)))
"Â"
Or any other strange character so it seems that can be characters with accents I add this to the file:
println! { "Hello, World!" }
print! { concat ! ( "Hello, World!" , "\n" ) }
Â
patatopita
and I get the same error:
:ASCII stream decoding error on
for "file /Users/toni/learn/lisp/stackoverflow/scripts/resources/hello.trace"
{1003994043}>:
the octet sequence #(195) cannot be decoded.
[Condition of type SB-INT:STREAM-DECODING-ERROR]
So at this point you can work with contitions and restarts, to change the character, there is an option, I not specialist in this kind of code but there could be a restart with
Restarts:
0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character boundary and continue.
1: [FORCE-END-OF-FILE] Force an end of file.
2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue.
3: [*ABORT] Return to SLIME's top level.
4: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {10050E0003}>)
Input-replacement, if not try wit european like latin-1 or ISO....
CL-USER> (with-open-file (in "/Users/toni/learn/lisp/stackoverflow/scripts/resources/hello.trace" :external-format :latin-1)
(when in
(loop for line = (read-line in nil)
while line do (format t "~a~%" line))))
println! { "Hello, World!" }
print! { concat ! ( "Hello, World!" , "\n" ) }
Â¬
patatopita
NIL
And it should work, good luck
so let's read with european charset

Related

lua table string concat not correct

I have a simple function to read lines from .txt file:
function loadData(file_name, root_path)
-- here, file_name is './list.txt', path is '../data/my/'
for line in io.lines(file_name) do
local data = {}
base_path = root_path .. line
-- so, base_path is something like ../data/my/001
data.file = base_path .. '_color.png'
print(data)
end
end
I expect the data should be {file: "../data/my/001_color.png"}, but I got {_color.png" ../data/my/001}
Can anyone help me? Thanks!

Check your ./list.txt file content for EOL (end of line) as it may be produced on windows (EOL=CR LF) an interpreted on linux (EOL=LF). io.lines takes CR character into line string on linux!
Your programm makes everything correct, but your data is not.
Let assume your first line in ./list.txt is ../data/my/001\r\n
line variable is ../data/my/001\r (print(#line) gives 15 instead of 14 ).
Carriage return (CR) in print moves cursor to start line position witout changing line.
Your print output in this case is something simmilar to {file: "../data/my/001\r_color.png"} (as it depends on print implementation) and you get output:
{file: "../data/my/001
_color.png"} <-- on the same line
Let's combine it:
_color.png"}ata/my/001
To correct this:
provide file without CR (works correctly on all systems)
add in loop on first row: line = line:gsub('[\r\n]','') to remove CR LF

LUA Write contents of file to another (Garry's Mod)

local myvariable = print( file.Read( "dir1/file1.txt" ) )
file.Write( "dir2/file2.txt", "myvariable" )
This code will only write "myvariable" to file2.txt but I want the contents from file1.txt to be written to file2.txt. Can I make a "string" be read as a variable? Any other ideas to make this work?
Note: This is for Garry's Mod so its LUA can be referenced here: http://wiki.garrysmod.com/page/Main_Page

Try
local myvariable = file.Read( "dir1/file1.txt" )
file.Write( "dir2/file2.txt", myvariable )

I found out that LUA can use really long strings, so I put all the text in the script, rather than in a file of its own.
The cl.lua file has the client read the file and displays the text as desired (in my case it opened in a box i.e. vgui.Create("DFrame")):
...
text:SetText(file.Read("dir/to/large_text_file.txt", "DATA"))
...
In the .lua file that actually writes the giant string to the file that (as previously mentioned) is read:
...
file.Write( "dir/to/large_text_file.txt", "really long string that is written to said file... this string ended up over 10,000 characters for me" )

How to handle iOS string characters parsing in node (Japanese characters)?

I'm running into issues uploading an ios strings file (english -> japanese) to a node server for parsing.
The file is UTF-16LE but when parsed as a string the character encoding loses characters. This may have something to do with express using utf8 to read in the request file data which malforms the file data.
When the file is loaded in atom/sublime w/ utf16 encoding it works great
When the file is loaded in utf8 things break down.
Any help would be awesome.

After doing some research and digging.
Utilizing npm module iconv-lite to parse the file buffer one should:
1) parse the buffer as utf16le
2) down convert to utf8
3) conver to a string.
if (encoding === 'utf-16le') {
str = iconv.decode(buffer, 'utf16le');
body = iconv.encode(str, 'utf8').toString();
} else if (encoding === 'utf-16be') {
str = iconv.decode(buffer, 'utf16be');
body = iconv.encode(str, 'utf8').toString();
} else {
body = Buffer.concat(file.data).toString();
}

How to detect and convert DOS/Windows end of line to UNIX end of line in Ruby

I have implemented a CSV upload in Ruby (on Rails) that works fine when the file is uploaded from a browser that runs on UNIX-like systems
However I have a file that as uploaded by a real customer contains the famous ^M as end of lines (I guess it was uploaded from Windows)
I need to detect this situation and replace the character before the file is processed
Here is the code that creates the file
# create the file on the server
path = File.join(directory, name)
# write the file
File.open(path, 'wb') { |f| f.write(uploadData.read) }
Do I need to change the "wb" to "w" and this would solve the problem ?

The CR (^M as you say it) char is "\r" in Ruby (and many other languages), so if you're sure your line endings also have the LF char (Windows uses CRLF as the line ending) then you can just remove all the CRs at the ends of the lines ($ matches at the end of a line, before the last "\n"):
uploadData.read.gsub /\r$/, ''
If you're not sure you're going to have the LF (eg. MacOS 9 used to use a plain CR at the end of the line) then replace any CR optionally followed by a LF with an LF:
uploadData.read.gsub /\r\n?/, "\n"

Creating a POST with url elisp package in emacs: utf-8 problem

I'm currently creating a Rest client for making blog posts much in the spirit of pastie.el. The main objective is for me to write a textile in emacs and make a post to a Rails application that will create it. It is working fine until I type anything in either spanish or japanese, then I get a 500 error. pastie.el has this same problem also by the way.
Here is the code:
(require 'url)
(defun create-post()
(interactive)
(let ((url-request-method "POST")
(url-request-extra-headers '(("Content-Type" . "application/xml")))
(url-request-data (concat "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
"<post>"
"<title>"
"Not working with spanish nor japanese"
"</title>"
"<content>"
;; "日本語" ;; not working
;; "ñ" ;; not working either
"h1. Textile title\n\n"
"*Textile bold*"
"</content>"
"</post>"))
) ; end of let varlist
(url-retrieve "http://127.0.0.1:3000/posts.xml"
;; CALLBACK
(lambda (status)
(switch-to-buffer (current-buffer)))
)))
The only way I can imagine right now that the problem could be fixed is by making emacs encode the utf-8 characters so that a 'ñ' becomes '&#241' (which works by the way).
What could be a work around for this problem?
EDIT: '*' is not equivalent to &ast;'. What I meant was that if I encoded to utf-8 with emacs using for example 'sgml-char' it would make the whole post become utf-8 encoded. Like &ast;Textile bold&ast; thus making RedCloth being unable to convert it into html. Sorry, it was very bad explained.

A guess: does it work if you set url-request-data to
(encode-coding-string (concat "<?xml etc...") 'utf-8)
instead?
There's nothing really to tell url what coding system you use, so I guess you have to encode your data yourself. This should also give a correct Content-length header, as that just comes from (length url-request-data), which would obviously give the wrong result for most UTF-8 strings.

Thanks to #legoscia I know now that I have to encode the data by myself. I'll post the function here for future reference:
(require 'url)
(defun create-post()
(interactive)
(let ((url-request-method "POST")
(url-request-extra-headers '(("Content-Type" . "application/xml; charset=utf-8")))
(url-request-data
(encode-coding-string (concat "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
"<post>"
"<title>"
"Not working with spanish nor japanese"
"</title>"
"<content>"
"日本語\n\n" ;; working!!!
"ñ\n\n" ;; working !!!
"h1. Textile title\n\n"
"*Textile bold*"
"</content>"
"</post>") 'utf-8)
)
) ; end of let varlist
(url-retrieve "http://127.0.0.1:3000/posts.xml"
;; CALLBACK
(lambda (status)
(switch-to-buffer (current-buffer))
)))) ;let

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Detecting encoding before opening a file - character-encoding

Related

lua table string concat not correct

LUA Write contents of file to another (Garry's Mod)

How to handle iOS string characters parsing in node (Japanese characters)?

How to detect and convert DOS/Windows end of line to UNIX end of line in Ruby

Creating a POST with url elisp package in emacs: utf-8 problem

Categories

Resources