TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!
TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
I would like to use FORTRAN streaming I/O to make a program that tells me how many lines a text-file has. The idea is to make something like this:
OPEN(UNIT=10,ACCESS='STREAM',FILE='testfile.txt')
nLines=0
bContinue=.TRUE.
DO WHILE (bContinue)
READ(UNIT=10) cCharacter
IF (cCharacter.EQ.{EOL-char}) nLines=nLines+1
IF (cCharacter.EQ.{EOF-char}) bContinue=.FALSE.
ENDDO
(I didn't include variable declaration but I think you get the idea of what they are; the only important clarification would be that that cCharacter has LEN=1)
My problem is that I don't know how to check if the character I just read from the file is an end-of-line or end-of-file (the "ifs" in the code). When you read and print characters this way, you eventually get newlines in the same place you had them in the original text, so I think it does read and recognize them as "characters", somehow. Perhaps turning the characters into integers and comparing to the appropriate number? Or is there a more direct way?
(I know that you can use the register reading (EDIT: I meant record reading) to do a program that reads lines more easily and add an IOstatus to check for eof, but the "line counter" is just a useful example, the idea is to learn how to move in a more controlled way through a textfile)
Checking for a specific character as line terminator makes you program OS dependent. It would be better to use the facilities of the language so that your program is compiler and OS dependent. Since lines are basically records, why do this with steam I/O? That request seems to make an easy job into a hard one. If are can use regular IO, here is an example program to count the lines in a text file.
EDIT: the code fragment was changed into a program to answer questions in the comments. With "line" as a character variable, when I test the program with gfortran and ifort I don't see a problem when the input file has empty or blank lines.
program test_lc
use, intrinsic :: iso_fortran_env
integer :: LineCount, Read_Code
character (len=200) :: line
open (unit=51, file="temp.txt", status="old", access='sequential', form='formatted', action='read' )
LineCount = 0
ReadLoop: do
read (51, '(A)', iostat=Read_Code) line
if ( Read_Code /= 0 ) then
if ( Read_Code == iostat_end ) then
exit ReadLoop ! end of file --> line count found
else
write ( *, '( / "read error: ", I0 )' ) Read_Code
stop
end if
end if
LineCount = LineCount + 1
write (*, '( I0, ": ''", A, "''" )' ) LineCount, trim (line)
if ( len_trim (line) == 0 ) write (*, '("The above is an empty or all blank line.")' )
end do ReadLoop
write (*, *) "found", LineCount, " lines"
end program test_lc
If you want to do further processing of the file, you can rewind it.
P.S.
The main reason that I have used Fortran Stream IO is to read files produced by other languages, e.g., C
Portable methods are provided to write new-line boundaries; I'm not aware of a portable method to test for such.
I'm trying to understand what identifiers represent and what they don't represent.
As I understand it, an identifier is a name for a method, a constant, a variable, a class, a package/module. It covers a lot. But what can you not use it for?
Every language differs in terms of what entities/abstractions can or cannot be named and reused in that language.
In most languages, you can't use an identifier for infix arithmetic operations.
For example, plus is an identifier and you can make a function named plus. But write you can write a = b + c;, there's no way to define an operator named plus to make a = b plus c; work because the language grammar simply does not allow an identifier there.
An identifier allows you to assign a name to some data, so that you can reference it later. That is the limit of what identifiers do; you cannot "use" it for anything other than a reference to some data.
That said, there are a lot of implications that come from this, some subtle. For example, in most languages functions are, to some degree or another, considered to be data, and so a function name is an identifier. In languages where functions are values, but not "first-class" values, you can't use an identifier for a function in an place you could use an identifier for something else. In some languages, there will even be separate namespaces for functions and other data, and so what is textually the same identifier might refer to two different things, and they would be distinguished by the context in which they are used.
An example of what you usually (i.e., in most languages) cannot use an identifier for is as a reference to a language keyword. For example, this sort of thing generally can't be done:
let during = while;
during (true) { print("Hello, world."); }
You could say it's used for everything that you'll want to refer to multiple times, or maybe even once (but use it to clarify the referent's purpose).
What can/can't be named differs per language, it's often quite intuitive, IMHO.
An "Anonymous" entity is something which is not named, although referred to somehow.
#!/usr/bin/perl
$subroutine = sub { return "Anonymous subroutine returning this text"; }
In Perl-speak, this is anonymous - the subroutine is not named, but it is referred to by the reference variable $subroutine.
PS: In Perl, the subroutine would be named like this:
sub NAME_HERE {
# some code...
}
Say, in Java your cannot write something like:
Object myIf = if;
myIf (a == b) {
System.out.println("True!");
}
So, you cannot name some code statement, giving it an alias. While in REBOL it is perfectly possible:
myIf: if
myIf a = b [print "True!"]
What can and what can't be named depends on language, as you see.
as its name implifies, an identifier is used to identify something. so for everything that can be identified uniquely, you can use an identifier. But for example a literal (e.g. string literal) is not unique so you can't use an identifier for it. However you can create a variable and assign a string literal to it.
Making soup out them is rather foul.
In languages such as Lisp, an identifier exists in its own right as an symbol, whereas in languages which are not introspective identifiers don't exist in the runtime.
You write a literal identifier/symbol by putting a single quote in front of it:
[1]> 'a
A
You can create a variable and assign a symbol literal to it:
[2]> (setf a 'Hello)
HELLO
[3]> a
HELLO
[4]> (print a)
HELLO
HELLO
You can set two variables to the same symbol
[10]> (setf b a)
HELLO
[11]> b
HELLO
[12]> a
HELLO
[13]> (eq b a)
T
[14]> (eq b 'Hello)
T
Note that the values bound to b and a are the same, and the value is the literal symbol 'Hello
You can bind a function to the symbol
[15]> (defun hello () (print 'hello))
HELLO
and call it:
[16]> (hello)
HELLO
HELLO
In common lisp, the variable binding and the function binding are distinct
[19]> (setf hello 'goodbye)
GOODBYE
[20]> hello
GOODBYE
[21]> (hello)
HELLO
HELLO
but in Scheme or JavaScript the bindings are in the same namespace.
There are many other things you can do with identifiers, if they are reified as symbols. I suspect that someone more knowledgable than me in Lisp will be able to demonstrate any of the things that you 'can't do with identifiers' exist.
But even Lisp can not make identifier soup.
Sort of a left-field thought, but JSON has all those quotations in it to eliminate the danger of a JavaScript keyword messing up the parsing.