I have created a textarea where I've set the maxlength to 350 characters and row="5". It works perfectly if I enter characters normally.
Then when I copy-pasted the data with 2-3 Enters, it stopped taking characters less than 350.
I just want to understand how it works and what all use-cases that I need to cover to implement it correctly. Thanks in advance!
Usually the line break in a text is encoded with invisible characters. In Windows operating systems, it's a combination of two characters : CR (carriage return) and LF (line feed). In unix-like operating systems, there is only one character LF (line feed).
CR and LF characters are counted in the maxlength, and that's the reason why your textarea seems to take less than 350 characters.
You can do some tests with this JSFiddle.
<body>
1<br>2<br>3<br>4<br>5<br>6<br>7<br>8<br>9<br>
<textarea rows="5" maxlength="10"></textarea>
</body>
If you copy-paste the full columns of digits into the textarea, only 5 of them will be pasted in an unix-like operating system, and even less in a windows operating system.
Related
I've been thinking about ASCII and memory lately and couldn't find a solid answer to this question.
When a script compiled, do ASCII characters use up different amounts of memory? And if so: what ASCII character uses up the most memory?
ASCII characters are a fixed width character encoding with each character represented by 7 bits. So to answer your question the different ASCII characters will all take the same amount of memory regardless of the implementation.
Because of the way in which our processor architectures are designed we typically store ASCII character in a single byte (the reason for doing so is because aligned memory access is a lot faster than having to do bitwise operations, see tripleee's comment). This means that typically any ASCII character will take up one byte of space on common computing platforms.
In contrast to this are the variable width encodings such as UTF8. For future readers who come across this page it might be worth noting that the ASCII characters 0 through to 127 are represented with the same binary as they are in UTF8. This was done to help maintain backwards compatibility. Therefore in the context of UTF8 encoding, the ASCII characters 0 through 127 will take up less space than other UTF8 characters.
Further I haven't heard of a mainstream compiler/interpreter that compresses strings stored with ASCII characters. This would impose a runtime performance hit that many would find unacceptable. Such a space optimization would therefore be left to the user to perform.
The ASCII wikipedia page has a good summary of the ASCII character set.
﷽ is probably the most space-consuming character. Im not sure about the coding, but it is a huge single-character. It is called "Basmala" and it means "In the name of Allah, the Most Gracious, the Most Merciful."
According to a Reddit user who has now deleted their account: “It's an Arabic ligature commonly used in Urdu. It was added so someone using an Urdu keyboard can type it easier.”
I love to use this in Discord raids, because imagine 2000 Basmala characters, vs 2000 regular characters. It fills their server up a LOT. Glad I could help.
I am using Clozure Cl on Mac os x 10.9 and Portable allegro serve
I have a file with text has characters like ı ç ş ö (these are some characters Turkish also have) and some Arabic characters. I cannot serve them. when i visit from the browser this kind of characters are not displayed at all, only part of text showed is the ones until the first ı in the text.
In Lisp i use a function composed with a do and read-lines and format (or i have tried print princ prin1 also) reads entire document and when i set the :external-format :utf-8 it shows the read characters properly in Lisp. Problem is in serving them, if i can serve them as i read on Lisp it will be done.
Also If do not set :external-formatat all, in Lisp it is read improperly, as expected, however, this time the browser can show all the text but with wrong characters in place of above described characters.
How to fix that and use external-formats character encodings properly?
See http://www.xach.com/lisp/allegro-cl/2001-3/964.html for an example on how to use :external-format in AllegroServe.
Cheers
Frank
P.S. I also posted an answer to the same question newsgroup comp.lang.lisp .
In our Visual Basic 6.0 program we used function chr(11) appended with some string and is displayed in text box.
In Windows 2003 Server, the value in text box is displayed as "a box(for chr(11)) followed by string"
In windows 7, the value in text box is displayed as "♂(for chr(11) followed by string"
Can anyone advice why it is behaving like this?
Thanks in advance.
It is probably a difference in fonts.
Even when the same "face name" is used the actual installed font can differ in terms of things like which glyphs are suported.
Note that your program isn't using ASCII in any sense of the word, but ANSI. The mapping from Unicode in your program to ANSI for display varies with Locale and Charset settings as well. Charset might also be a factor here.
Chr(11) says "take 11 and treat it as an ANSI character in the current codepage, convert that to Unicode, then return it as a Variant String."
Chr$(11) removes some of that overhead by returning a String, and ChrW$(11) is even cleaner, skipping the laundering through ANSI-to-Unicode conversion as well.
Faster yet is to just used the named constant for this character vbVerticalTab instead.
But none of that impacts display. It's more a question of avoiding unnecessary overhead.
You're relying on something that isn't reliable, i.e. that non-printable characters will always have a glyph. That "box" symbol you see means there is no glyph available for the character.
Even the Character Map applet doesn't display the glyph mapping for values below 33 (&H21).
What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.
I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.