How to check whether input is a string in Erlang? - erlang

I would like to write a function to check if the input is a string or not like this:
is_string(Input) ->
case check_if_string(Input) of
true -> {ok, Input};
false -> error
But I found it is tricky to check whether the input is a string in Erlang.
The string definition in Erlang is here:
Any suggestions?
Thanks in advance.

In Erlang a string can be actually quite a few things, so there are a few ways to do this depending on exactly what you mean by "a string". It is worth bearing in mind that every sort of string in Erlang is a list of character or lexeme values of some sort.
Encodings are not simple things, particularly when Unicode is involved. Characters can be almost arbitrarily high values, lexemes are globbed together in deep lists of integers, and Erlang iolist()s (which are super useful) are deep lists of mixed integer and binary values that get automatically flattened and converted during certain operations. If you are dealing with anything other than flat lists of printable ASCII values then I strongly recommend you read these:
Unicode module docs
String module docs
IO Library module docs
So... this is not a very simple question.
What to do about all the confusion?
Quick answer that always works: Consider the origin of the data.
You should know what kind of data you are dealing with, whether it is coming over a socket or from a file, or especially if you are generating it yourself. On the edges of your system you may need some help purifying data, though, because network clients send all sorts of random trash from time to time.
Some helper functions for the most common cases live in the io_lib module:
io_lib:char_list/1: Returns true if the input is a list of characters in the unicode range.
io_lib:deep_char_list/1: Returns true if the input is a deep list of legal chars.
io_lib:deep_latin1_char_list/1: Returns true if the input is a deep list of Latin-1 (your basic printable ASCII values from 32 to 126).
io_lib:latin1_char_list/1: Returns true if the input is a flat list of Latin-1 characters (90% of the time this is what you're looking for)
io_lib:printable_latin1_list/1: Returns true if the input is a list of printable Latin-1 (If the above isn't what you wanted, 9% of the time this is the one you want)
io_lib:printable_list/1: Returns true if the input is a flat list of printable chars.
io_lib:printable_unicode_list/1: Returns true if the input is a flat list of printable unicode chars (for that 1% of the time that this is your problem -- except that for some of us, myself included here in Japan, this covers 99% of my input checking cases).
For more particular cases you can either use a regex from the re module or write your own recursive function that zips through a string for those special cases where a regex either doesn't fit, is impossible, or could make you vulnerable to regex attacks.

In erlang, string can be represented by list or binary.
If string is used as list then you can use following function to check:
is_string([C|T]) when (C >= 0) and (C =< 255) ->
is_string([]) ->
is_string(_) ->
If string is used as binary in code then is_binary(Term) in build function can be used.


Why is the following piece of Lua code, completely valid?

From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
-- Most likely, copy+paste won't work. Download the file from
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ = math.sin
T.math.â¬â€‹â­â¬â­â«â®â€­â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Lua 5.1 JIT
â¬â€‹â­â¬â­â«â®â€­â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.

Parsing text that requires lookahead using nom

tl;dr: I'm struggling to find documentation or examples of text parsers that require lookahead using nom.
Long version
I'm using nom to parse 6502 assembly. I'm struggling with creating a parser that can parse the various addressing modes. Any given opcode will have the following format:
Where XXX is a three-character mnemonic and AM is the operand. The operand can take many forms and is referred to as the "addressing mode." I've defined an enum for the operands, an enum for the addressing modes, and an OpCode tuple struct containing these values, which is ultimately the result returned when parsing.
The addressing mode can be omitted completely, in which case the addressing mode is Implied, it can have a literal value of A, which is the Accumulator addressing mode.
Many of the addressing modes refer to memory locations, and it's these addressing modes I'm struggling to parse. In particular, if an addressing mode specifies a single byte in the form of $00, it is a ZeroPage addressing mode, whereas an operand specifying two bytes in the form of $0000 is an Absolute addressing mode. To complicate the matter, there are indexed variants of these addressing modes in the form of $00,X, $00,Y, $0000,X, etc.
Are there any good examples of existing text parsers that would illustrate the correct way to parse values that all start similarly ($00...) but are differentiated by how they end? The nom documentation is not very comprehensive, and the best example I've found is the INI parser, which isn't doing anything as complex as I'm trying to accomplish. I've also look at the syn source code, but it's using a lot of custom macros and is a pretty complex beast, making it hard to learn from.
One way of doing this is with the alt!() macro.
The idea is have a parser which tries each alternative in sequence. So if you already have parsers for each of the addressing modes separately, you can combine them into a parser for any of them:
// The sub-parsers all return Operand too.
named!(parse_operand<&str, Operand>,
alt!(parse_absolute_indexed |
parse_absolute |
parse_zeropage_indexed |
parse_zeropage |
Some notes:
The order may be important; I've put parse_absolute after parse_absolute_indexed since the former would match the initial part of the operand and return too early.
A variant would be to include the end of line (including comments if applicable) matching into each sub parser. Then it couldn't match early.
If you're parsing to the end of the input without a byte/character which terminates the pattern (such as a newline) then you may need to use alt_complete!() instead of alt!(). The reason for this is that if you try matching ADD $00, the parser which might match ADD $0000 has to assume that it might still match if more input arrives, and alt!() won't then skip to the next case. Using alt_complete!(), or alternatively wrapping the inner matchers in complete!(), is saying that an incomplete match is a non-match.
If the parsers were very complicated it might mean doing extra work (trying each parse in sequence) compared to a parser generated by eg the venerable yacc, but I don't think it's an issue in this case.

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
return convString
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

What strategies are there for escaping character entities?

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>>"(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
>>> uuid.uuid1(node=0x1234567890)
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).
