Lua pattern match around comma - lua

I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l will not work.

There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.
The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.
If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.
In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.
Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.

Related

iPhone - Localizable.strings - string with single quotes inside

I need to add some french translation into iOS Application. But I don know how to use single qute char in the Localizable.strings file.
For example text :
"Invalid username or password."="Nom d'utilisateur ou mot de passe incorrect.";
Causes an error. I've tried adding backslashes, but it havn't worked as well.
Using Special Characters in String Resources Just as in C, some
characters must be prefixed with a backslash before you can include
them in the string. These characters include double quotation marks,
the backslash character itself, and special control characters such as
linefeed (\n) and carriage returns (\r).
From:
https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/LoadingResources/Strings/Strings.html
Did you try escaping it with a backslash?
"\'"
As a last resort you could use direct U codes:
You can include arbitrary Unicode characters in a value string by
specifying \U followed immediately by up to four hexadecimal digits.
The four digits denote the entry for the desired Unicode character;
for example, the space character is represented by hexadecimal 20 and
thus would be \U0020 when specified as a Unicode character. This
option is useful if a string must include Unicode characters that for
some reason cannot be typed. If you use this option, you must also
pass the -u option to genstrings in order for the hexadecimal digits
to be interpreted correctly in the resulting strings file. The
genstrings tool assumes your strings are low-ASCII by default and only
interprets backslash sequences if the -u option is specified.
The apostrophe should be \U0027

The origin on why '%20' is used as a space in URLs

I am interested in knowing why '%20' is used as a space in URLs, particularly why %20 was used and why we even need it in the first place.
It's called percent encoding. Some characters can't be in a URI (for example #, as it denotes the URL fragment), so they are represented with characters that can be (# becomes %23)
Here's an excerpt from that same article:
When a character from the reserved set (a "reserved character") has
special meaning (a "reserved purpose") in a certain context, and a URI
scheme says that it is necessary to use that character for some other
purpose, then the character must be percent-encoded.
Percent-encoding a reserved character involves converting the
character to its corresponding byte value in ASCII and then
representing that value as a pair of hexadecimal digits. The digits,
preceded by a percent sign ("%") which is used as an escape character,
are then used in the URI in place of the reserved character. (For a
non-ASCII character, it is typically converted to its byte sequence in
UTF-8, and then each byte value is represented as above.)
The space character's character code is 32:
> ' '.charCodeAt(0)
32
Which is 20 in base-16:
> ' '.charCodeAt(0).toString(16)
"20"
Tack a percent sign in front of it and you get %20.
Because URLs have strict syntactic rules, like / being a special path separator character, spaces not being allowed in a URL and all characters having to be a certain subset of ASCII. To embed arbitrary characters in URLs regardless of these restrictions, bytes can be percent encoded. The byte x20 represents a space in the ASCII encoding (and most other encodings), hence %20 is the URL-encoded version of it.
It uses percent encoding. You can see the Percent Encoding part of the RFC for Uniform Resource Identifier (URI): Generic Syntax
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component. A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value. For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP).

In what charset is 0xE1 an "a" with an umlaut?

I am trying to identify an extended ascii charset where 0xE1 is an "a" with an umlaut (8859-1 character E4)
and 0xF5 is a "u" with an umlaut (8859-1 character FC).
Has anyone seen this charset before? It quite possibly dates back to the 80's.
To my knowledge, there are no standard ASCII character sets which use those symbols. Here is a list of the standard character sets: http://www.columbia.edu/kermit/csettables.html
However, the character set you referenced is in fact used for interfacing with some LEDs, for example HT1632-compliant LEDs use the same character set: http://blog.thiseldo.co.uk/wp-filez/USB_HT1632_Matrix.pde
I hope this helps.
These are in the commonly used ISO-8859-1 character set, also known as latin-1.

Load testing tool URL encoding system

I have a load testing tool (Borland's SilkPerformer) that is encoding / character as \x252f. Can anyone please tell me what encoding system the tool might be using?
Two different escape sequences have been combined:
a C string hexadecimal escape sequence.
an URL-encoding scheme (percent encoding).
See this little diagram:
+---------> C escape in hexadecimal notation
: +------> hexadecimal number, ASCII for '%'
: : +---> hexadecimal number, ASCII for '/'
\x 25 2f
Explained step for step:
\ starts a C string escape sequence.
x is an escape in hexadecimal notation. Two hex digits are expected to follow.
25 is % in ASCII, see ASCII Table.
% starts an URL encode, also called Percent-encoding. Two hex digits are expected to follow.
2f is the slash character (/) in ASCII.
The slash is the result.
Now I don't know why your software chooses to encode the slash character in such a weird way. Slash characters in urls need to be url encoded if they don't denote directory separators (the same thing the backslash does for Windows). So you will often find the slash character being encoded as %2f. That's normal. But I find it weird and a bit suspicious that the percent character is additionally encoded as a hexadecimal escape sequence for C strings.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources