I'm currently parsing what looks to be a proprietary file format from a third-party commercial application. They seem to use a funny character encoding system and I need some help determining what it is, assuming it's not a proprietary encoding system as well.
I don't have a whole lot of different characters to analyze from but here is what I have so far:
__b -> blank space
__f -> forward slash
So for example, "Hello World" become "Hello__bWorld".
Does anybody have any idea what this is?
If not do you know of a resource on the web that can help me? Maybe there is a tool out there than can help in identifying character encoding?
It seems to be a proprietary encoding used by Numara FootPrints. This list of mappings comes from the FootPrints User Group forum. There is also a Perl script for decoding it.
Code Character
__b (space)
__a ' (single quote)
__q " (double quote)
__t ` (backquote)
__m # (at-sign)
__d . (period)
__u - (hyphen-minus)
__s ;
__c :
__p )
__P (
__3 #
__4 $
__5 %
__6 ^
__7 &
__8 *
__0 ~ (tilde)
__f / (slash)
__F \ (backslash)
__Q ?
__e ]
__E [
__g >
__G <
__B !
__W {
__w }
__C =
__A +
__I | (vertical line)
__M , (comma)
__Ux_ Unicode character with value 'x'
Related
I am writing a simple scanner in flex. I want my scanner to print out "integer type seen" when it sees the keyword "int". Is there any difference between the following two ways?
1st way:
%%
int printf("integer type seen");
%%
2nd way:
%%
"int" printf("integer type seen");
%%
So, is there a difference between writing if or "if"? Also, for example when we see a == operator, we print something. Is there a difference between writing == or "==" in the flex file?
There's no difference in these specific cases -- the quotes(") just tell lex to NOT interpret any special characters (eg, for regular expressions) in the quoted string, but if there are no special characters involved, they don't matter:
[a-z] printf("matched a single letter\n");
"[a-z]" printf("matched the 5-character string '[a-z]'\n");
0* printf("matched zero or more zero characters\n");
"0*" printf("matched a zero followed by an asterisk\n");
Characters that are special and mean something different outside of quotes include . * + ? | ^ $ < > [ ] ( ) { } /. Some of those only have special meaning if they appear at certain places, but its generally clearer to quote them regardless of where they appear if you want to match the literal characters.
I'm very new to all of this so please excuse any mistakes.
I'm working on on a mac.
I'm trying to follow this tutorial here
When I type in tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < hug_tol.fasta > hug_tol.clean.fasta
I get the error message tr:misplaced sequence asterisk
I'm guessing that something in the file must be wrong, but since I'm trying to remove those characters the error message doesn't make sense.
I haven't found anything on Google so maybe someone can help me.
The author of the tutorial appears to be using quasi-regex character class syntax for tr. tr is much more limited in it's scope than that. It only accepts a few escape characters and special characters. Simplify your command to
tr "%,;():=.*[]\"\' \\\\\-" "_" < hug_tol.fasta > hug_tol.clean.fasta
The - character does have special meaning, so put it at the end: in the beginning it will be interpreted as a command-line argument, while in the middle it specifies a character range. In bash, * won't be expanded in double quotes. For tr, to specify a plain \, you need a double \ (since it's the escape character). To get that through bash, you need \\\\.
You may also want to consider using the -c option to specify the complement set (the characters you want to keep), since it is probably much smaller:
tr -c "A-Za-z0-9_" "_" < hug_tol.fasta > hug_tol.clean.fasta
or more tersely
tr -c "[:alnum:]" "_" < hug_tol.fasta > hug_tol.clean.fasta
I have to display "#" in the UITextview content and after to put some information.
I looked on the internet via google but I didn't find an explanation which make
me understand the good approach.
Can you help me with some extra advice ?
Thanks !
You can simply do it like this:
_textView.text=#"#Hi Hello"; result will be #Hi Hello
However if you want to use " in the text you need to append it to backslash \
_textView.text=#"#Hi \"Hello"; result will be #Hi "Hello
You can enter almost all special characters without any problem, but you need to take care for the double quotes:
_textView.text=#"#Hi \"Hello * ! # # $ % ^ & ( ) _ + - [ ] ; ' {} <> ,. / ? : \" ";
I had the regular expression for email validating following rules
The local-part of the e-mail address may use any of these ASCII characters:
Uppercase and lowercase English letters (a-z, A-Z)
Digits 0 to 9
Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i
It is working in Javascript but in Ruby http://rubular.com/ it gives error "Premature end of char-class".
How can i resolve this?
Brackets are part of regex syntax. If you want to match a literal bracket (or any other special symbol, for that matter), escape with a backslash.
this should work :
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i
You should escape opening square brackets as well as closings inside the symbol range:
# ⇓ ⇓
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)…/
This should be:
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)…/
Hope it helps.
irb(main):016:0> /[[e]/
SyntaxError: (irb):16: premature end of char-class: /[[e]/
from /ms/dist/ruby/PROJ/core/2.0.0-p195/bin/irb:12:in `<main>'
In JavaScript regular expression engine, you don't need to escape [ inside a character group []. However, you have to use \[ in Ruby regular expression.
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i
I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between € and Ă or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range € to ā + (seldom) ῁ to ‶.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.