I heard UTF-8 and UTF-16. But where do the following characters come?
1 ⊙●○①⊕◎Θ⊙¤㊣★☆♀◆◇◣◢◥▲▼△▽⊿◤ ◥
2 ▆ ▇ █ █ ■ ▓ 回 □ 〓≡ ╝╚╔ ╗╬ ═ ╓ ╩ ┠ ┨┯ ┷┏ .
3 ┓┗ ┛┳⊥﹃﹄┌ ┐└ ┘∟「」↑↓→←↘↙♀♂┇┅ ﹉﹊﹍﹎╭ .
4 ╮╰ ╯ *^_^* ^*^ ^-^ ^_^ ^︵^ ∵∴‖︱ ︳︴﹏﹋﹌︵︶︹︺ .
5 【】〖〗@﹕﹗/ " _ `,·。≈{}~ ~() _ -『』√ $ # * & # ※ .
6 卐 々∞Ψ ∪∩∈∏ の ℡ ぁ §∮〝〞ミ灬ξ№∑⌒ξζω*
7 ¡ Þ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○△ ▲ ☆ ★ ◇ ◆ ■ ▄ █ ▌ ♀ ♥ ⊙ ◎ ↔ ◊ の ★☆⊕◎Θ\﹏﹋﹌【】〖〗※-『』√∴卐 ≈ ∵∴§∮•.•♠♣♂ ◊ ♠ ♣ の ☆→ ぃ £ ..
Some of the characters are from the IBM Extended Ascii character set.
http://ascii-table.com/ascii-extended-pc-table.php
Related
I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ shows as ŒïŒõŒõŒóŒùŒôŒöŒë ŒùŒïŒë Œ§Œ©Œ°Œë. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε - Œï
Λ - Œõ
And so on. So it's almost always Œ and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε') and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is macintosh, and its Windows code page number is 100000.
Here's a Python function that will decode macintosh to utf-8 strings:
def macToUtf8(s):
return bytes(s, 'macintosh').decode('utf-8')
print(macToUtf8('ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ'))
# outputs: ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?
This works so far:
http://JENKINS_SERVER/job/YOUR_JOB_NAME/buildWithParameters?myparam=Hello
But when the value of myparam contains whitespaces like in Hello word it doesn't work:
myparam=Hello word
Full line:
http://JENKINS_SERVER/job/YOUR_JOB_NAME/buildWithParameters?myparam=Hello world
How can I pass this parameter value?
You just need to replace the blank space by %20 :
http://JENKINS_SERVER/job/YOUR_JOB_NAM/buildWithParameters?myparam=Hello%20world
This is known as Url Enconding used for unsafe or special characters.
This is a summary table :
character encoded equivalence
backspace %08
tab %09
space %20
! %21
" %22
# %23
$ %24
% %25
& %26
' %27
( %28
) %29
* %2A
+ %2B
, %2C
- %2D
. %2E
/ %2F
: %3A
; %3B
< %3C
= %3D
> %3E
? %3F
# %40
[ %5B
\ %5C
] %5D
^ %5E
_ %5F
` %60
{ %7B
| %7C
} %7D
¿ %BF
References:
https://blogs.msdn.microsoft.com/oldnewthing/20100331-00/?p=14443
https://perishablepress.com/stop-using-unsafe-characters-in-urls/
complete url encoded values : https://www.degraeve.com/reference/urlencoding.php
I am facing an issue while compiling WRF-DA code (code is here)
The compilation line which fails -
ftn -c -ip -O3 -w -ftz -fno-alias -align all -FR -convert big_endian -r8 -real-size `expr 8 \* 8` -i4 -I../external/crtm_2.2.3/libsrc -I/opt/cray/pe/hdf5/1.10.0.3/INTEL/16.0/include -L/opt/cray/pe/hdf5/1.10.0.3/INTEL/16.0/lib/ -lhdf5hl_fortran -lhdf5_fortran -lhdf5 da_radiance.f
da_radiance.f(5884): error #6285: There is no matching specific subroutine for this generic subroutine call. [H5DREAD_F]
call H5Dread_f(dhnd1, &
-----------^
I tried searching relevant symbol in library, and as expected the symbol was not present (h5dread_f_c is present instead).
nm /opt/cray/pe/hdf5/1.10.0.3/INTEL/16.0/lib/libhdf5*|grep -i h5dread_f
nm: /opt/cray/pe/hdf5/1.10.0.3/INTEL/16.0/lib/libhdf5.settings: File format not recognized
nm: /opt/cray/pe/hdf5/1.10.0.3/INTEL/16.0/lib/libhdf5_cpp_intel_160.la: File format not recognized
U h5dread_f_c
U h5dread_f_c
0000000000001290 T h5dread_f_c
0000000000035320 T h5dread_f_c
U h5dread_f_c
U h5dread_f_c
0000000000001290 T h5dread_f_c
0000000000035320 T h5dread_f_c
U h5dread_f_c
U h5dread_f_c
0000000000001290 T h5dread_f_c
0000000000035320 T h5dread_f_c
0000000000035320 T h5dread_f_c
0000000000035320 T h5dread_f_c
I tried compiling hdf5-1.10.2. With a quick peek at the code, I saw that the function seems to have been declared (& commented) in fortran/src/H5Dff.F90 as -
! M. Scot Breitenfeld
! September 17, 2011
!
! Fortran2003 Interface:
!! SUBROUTINE h5dread_f(dset_id, mem_type_id, buf, hdferr, &
!! mem_space_id, file_space_id, xfer_prp)
!! INTEGER(HID_T), INTENT(IN) :: dset_id
!! INTEGER(HID_T), INTENT(IN) :: mem_type_id
!! TYPE(C_PTR) , INTENT(INOUT) :: buf
!! INTEGER , INTENT(OUT) :: hdferr
!! INTEGER(HID_T), INTENT(IN) , OPTIONAL :: mem_space_id
!! INTEGER(HID_T), INTENT(IN) , OPTIONAL :: file_space_id
!! INTEGER(HID_T), INTENT(IN) , OPTIONAL :: xfer_prp
!*****
SUBROUTINE h5dread_ptr(dset_id, mem_type_id, buf, hdferr, &
mem_space_id, file_space_id, xfer_prp)
has this function been phased out in latest versions of HDF5?
If yes then please share an appropriate (older) version of library (& relevant compilation flags) for the HDF5 in which i can find this symbol.
Please let me know if i can provide any further information.
h5dread_f is an interface, which maps to one of the following
INTERFACE h5dread_f
MODULE PROCEDURE h5dread_reference_obj
MODULE PROCEDURE h5dread_reference_dsetreg
MODULE PROCEDURE h5dread_char_scalar
MODULE PROCEDURE h5dread_ptr
END INTERFACE
It seems that there are invalid types being passed into the function.
(thanks to Dave Allured from HDF5 group)
i want to send binary data (ESC/POS command) via EPSON Send Data Tools (senddat.exe)
according to there website / Manual from command prompt
If the printer is set as a USB printer class:
senddat.exe scriptfile USBPRN
(C:\senddat.exe sample.txt ESDPRT001)
file:sample.txt
' Sample script of senddat
' Version 0.01
'Comment line is starting ' character
!Display line is starting ! character
.Pause line is starting . character
'Decimal data
48 49 50 51 CR LF
'Hexadecimal data
30h 31h 32h CR LF
0x33 0x34 0x35 CR LF
$36 $37 $38 CR LF
'String data 1
string1 CR LF
'String data 2
"string2" CR LF
'Special characters
"\"" CR LF
"\'" CR LF
"\\" CR LF
"\0" CR LF
which should be printing.
0123
012
345
678
String1
String2
“
‘
BUT it does not print any thing only creating out put in file (file name is same as PORT name in same directory) like my above command is making file c:\ESDPRT001
can any body help me in this.
To out put the data to USB Printer Class printer, you need to set like below
senddat.exe sample.txt USBPRN0
"USBPRN0" is just example, you need to set correct # with your test PC environment
The stanford parser (http://nlp.stanford.edu/software/lex-parser.html), version 3.6.0, comes with trained grammars for Engish, German and other languages. To parse german text the stanford parser provides the tool lexparser-lang.sh
./lexparser-lang.sh
Usage: lexparser-lang.sh lang len grammar out_file FILE...
lang : Language to parse (Arabic, English, Chinese, German, French)
len : Maximum length of the sentences to parse
grammar : Serialized grammar file (look in the models jar)
out_file : Prefix for the output filename
FILE : List of files to parse
So I call it with these options:
sadik#sadix:stanford-parser-full-2015-12-09$ ./lexparser-lang.sh German 500 edu/stanford/nlp/models/lexparser/germanFactored.ser.gz factored german_test.txt
The input file german_test.txt contains a single German sentence:
Fußball findet um 8 Uhr in der Halle statt.
But the "ß" results in a warning and a wrong result. Same with "ä", "ö" and "ü". Now, lexparser-lang.sh is supposed to be designed to deal with German text as input. Is there any option I am missing?
How it is:
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ...
done [3.8 sec].
Parsing file: german_test.txt
Apr 01, 2016 12:48:45 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: (U+9F, decimal: 159)
Parsing [sent. 1 len. 11]: Fuà ball findet um 8 Uhr in der Halle statt .
Parsed file: german_test.txt [1 sentences].
Parsed 11 words in 1 sentences (32.07 wds/sec; 2.92 sents/sec).
With a parse tree that looks like crap:
(S (ADV FuÃ) (ADV ball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))
How it should be
When written "Fussball", there is no problem (except incorrect orthography)
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ...
done [3.5 sec].
Parsing file: german_test.txt
Parsing [sent. 1 len. 10]: Fussball findet um 8 Uhr in der Halle statt .
Parsed file: german_test.txt [1 sentences].
Parsed 10 words in 1 sentences (40.98 wds/sec; 4.10 sents/sec).
The correct tree:
(S (NN Fussball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))
The demo script is not running the tokenizer with the correct character set. So if your text is pre-tokenized, you can add the option "-tokenized" and it will just use space as the token delimiter.
Also you want to tell the parser to use "-encoding ISO-8859-1" for German.
Here is the full java command (alter the one found in the .sh script):
java -Xmx2g -cp "./*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 500 -tLPP edu.stanford.nlp.parser.lexparser.NegraPennTreebankParserParams -hMarkov 1 -vMarkov 2 -vSelSplitCutOff 300 -uwm 1 -unknownSuffixSize 2 -nodeCleanup 2 -writeOutputFiles -outputFilesExtension output.500.stp -outputFormat "penn" -outputFormatOptions "removeTopBracket,includePunctuationDependencies" -encoding ISO_8859-1 -tokenized -loadFromSerializedFile edu/stanford/nlp/models/lexparser/germanFactored.ser.gz german_example.txt
I get this output:
(NUR
(S (NN Fußball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle) (ADJA statt.))))
UPDATED AGAIN:
Make sure to separate "statt." into "statt ." since we are now saying the tokens are white space separated. If we apply this fix we get this parse:
(S (NN Fußball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))
So just to summarize, basically the issue is we need to tell the PTBTokenizer to use ISO_8859-1 and LexicalizedParser to use ISO_8859-1.
I would recommend just using the full pipeline to accomplish this.
Download Stanford CoreNLP 3.6.0 from here:
http://stanfordnlp.github.io/CoreNLP/
Download the German model jar from here:
http://stanfordnlp.github.io/CoreNLP/download.html
Run this command:
java -Xmx3g -cp "stanford-corenlp-full-2015-12-09/*:stanford-corenlp-3.6.0-models-german.jar" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,parse -props StanfordCoreNLP-german.properties -file german_example_file.txt -outputFormat text
This will tokenize and parse the text and use the correct character encoding.