I'm querying the MediaWiki API to get Wikipedia data into my Filemaker database. When I load the data into a browser, the characters show up properly but when it comes into Filemaker, characters with diacriticals get converted to these odd characters: á is converted to √° (square root symbol + degree symbol), é is converted to √© (square root symbol + copyright symbol), í is converted to √≠ (square root symbol + not equals symbol) and more. What character encoding is that? Thank you!!
As #Joni suggests in his comment, this is UTF-8 misinterpreted as MacRoman. Letter á is C3 A1 (hex.) in UTF-8, and C3 is “√” in MacRoman, A1 is “°”. So you should just try to set the program to interpret the data as UTF-8.
I'm sure this isn't the full list, but it did what I needed. Here is a lookup for the codes:
√© é e
√° á a
√≠ í i
√≥ ó o
√∂ ö o
√º ü u
√¥ ô o
√® è e
√ß ç c
√± ñ n
√∏ ø o
√´ ë e
√§ ä a
√• å a
√Å Á A
√∫ ú u
√ª û u
√Ø ï i
√â É E
√† à a
√¶ æ ae
√Æ î i
√¢ â a
√£ ã a
√î Ô O
√ü ß ss
√ì Ó O
√≤ ò o
√Ω ý y
√ñ Ö O
√™ ê e
√Ä À A
√ò Ø O
√Ö Å A
√∞ ð eth
√á Ç C
√Ç Â A
√π ù u
√í Ò O
√¨ ì i
√ú Ü U
√à È E
√û Þ Th
You're all correct about the misinterpreted characters, the Troi URL FMP plugin I was using to set FMP's user agent (as MediaWiki API requires) was responsible for pulling in the garbled characters. Solution was to bypass the plugin: FMP script performs Applescript "do shell script curl -A" to set user agent and query API and pull response back into FMP and all characters come through properly!
Related
Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)
I recently downloaded gigabytes of data (text) in multiple files that I want to automatically process. However, the charset or actual encoding of text is wrong. The problem is that text editors such as Notepad++, SublimeText 3 or Word detect it simply as ANSI. I've tried all charsets there were available, but there are still parts that are amiss across files.
Default ANSI encoding (wrong special characters):
OBJEVUJE SE ZELENÁ KNÍ®KA
Frantík Severýn sedí na prázdných bednách od cukru, pohupuje bosýma
nohama a naslouchá kázání páně Bočanovu. Kázání nepatří jemu, nýbrľ
paní Bílkové, která stojí před pultem. Frantík se tváří, jako by se
nezajímal o nic jiného neľ o své zablácené klátící se nohy. Zatím vąak
napíná uąi, aby mu neuąlo ani slovíčko.
»Tak to dál nepůjde, milá paní,« křičí hokynář a jeho tlustý zátylek
je rudý zlostí. »Jedno zboľí nezaplatíte a uľ zas chcete nové na dluh.
Copak si myslíte, ľe kradu?«
ISO 8859-2 encoding (wrong quotation marks):
OBJEVUJE SE ZELENÁ KNÍŽKA
Frantík Severýn sedí na prázdných bednách od cukru, pohupuje bosýma
nohama a naslouchá kázání páně Bočanovu. Kázání nepatří jemu, nýbrž
paní Bílkové, která stojí před pultem. Frantík se tváří, jako by se
nezajímal o nic jiného než o své zablácené klátící se nohy. Zatím však
napíná uši, aby mu neušlo ani slovíčko.
ťTak to dál nepůjde, milá paní,Ť křičí hokynář a jeho tlustý zátylek
je rudý zlostí. ťJedno zboží nezaplatíte a už zas chcete nové na dluh.
Copak si myslíte, že kradu?Ť
DESIRED OUTPUT:
OBJEVUJE SE ZELENÁ KNÍŽKA
Frantík Severýn sedí na prázdných bednách od cukru, pohupuje bosýma
nohama a naslouchá kázání páně Bočanovu. Kázání nepatří jemu, nýbrž
paní Bílkové, která stojí před pultem. Frantík se tváří, jako by se
nezajímal o nic jiného než o své zablácené klátící se nohy. Zatím však
napíná uši, aby mu neušlo ani slovíčko.
»Tak to dál nepůjde, milá paní,« křičí hokynář a jeho tlustý zátylek
je rudý zlostí. »Jedno zboží nezaplatíte a už zas chcete nové na dluh.
Copak si myslíte, že kradu?«
What character encoding is this?
After reading this I suspect that it might be an older/legacy one, but I am not sure how to fix it as I don't know any software that supports it. Another option is that it might be just corrupt, because all quotation marks seem to be encoded as ť/Ť. How can I verify this?
EDIT: hex information:
KNͮKA = 4B 4E CD AE 4B 41
»Tak to dál nepůjde = BB 54 61 6B 20 74 6F 20 64 E1 6C 20 6E 65 70 F9 6A 64 65
co má chu» vstát = 63 6F 20 6D E1 20 63 68 75 BB 20 76 73 74 E1 74
Use UTF-8, not ascii, not iso-..., not latin....
latin1 comes close, but misses the ř.
You say it was "downloaded". Can you show us the hex for the characters in question?
»Žřč converts to hex:
C2BB C5BD C599 C48D in UTF-8 -- the only one that can handle all chars
BB 8E 3F 3F in latin1
BB 8E F8 3F in cp1250
3F AE F8 E8 in latin2
Note: 3F is ?, meaning conversion problems.
Hex BB is ť in latin2.
characters from inside a view are misencoded by the razor renderer. for instance:
MyPartial.cshtml: á á á
MyView.cshtml: á á á #Html.Partial("_MyPartial")
The final output is rendered as á á á � � �
Is this an expected behaviour?
This PNG file can not be uploaded from my app to a 3d-party server. It always reports this error:
does multipart has image?
I'm sure multipart encoding is correct. Tens of thousands of images are uploaded from my app without this issue. It it the first time.
I guess there is something special about this PNG file and I proved it:
Dropbox iOS app can not display the image.
Tweetbot can not upload it. The error message is "media type unrecognized".
So this PNG file is indeed special and quite some apps and servers don't handle it properly. But I don't know what's so special about it and hope someone who know PNG better than me can help. Thanks.
It is a CgBI file, not a PNG, most likely made with Apple's rogue modified pngcrush.
Such files always contain "CgBI" in bytes 12-15, where "IHDR" belongs.
CgBI files can be converted to valid PNG files (except that the transparent areas are irreparably damaged) by several applications, including
Jongware's pngdefry
Apple's "pngcrush" (but not the real pngcrush)
others listed on the above-referenced CgBI wiki page
Here are the first few bytes in your file:
$ od -c test.png | head -4
0000000 211 P N G \r \n 032 \n \0 \0 \0 004 C g B I
0000020 P \0 002 + 325 263 177 \0 \0 \0 \r I H D R
0000040 \0 \0 \0 ` \0 \0 \0 ` \b 006 \0 \0 \0 342 230 w
0000060 8 \0 \0 \0 c H R M \0 \0 z % \0 \0 200
Those bytes represent the following:
PNG signature 0-7
CgBI length 8-11
"CgBI" 12-15
CgBI data 16-19
CgBI CRC 20-23
IHDR length 24-27 (should be in 8-11)
"IHDR" 28-31 (should be in 12-15)
width 32-35 (should be in 16-19)
height 36-39 (should be in 20-23)
bit depth 40 (should be in 24)
color type 41 (should be in 25)
compression 42 (should be in 26)
filter method 43 (should be in 27)
interlace method 44 (should be in 28)
IHDR CRC 45-48 (should be in 29-32)
...
I have a number of text files I'm looking to send to different destinations depending on whether or not the file contains Cyrillic characters using a batch script. For example:
All Files are located in C:\mydocs. The script will be monitoring this file.
File one: contains all English characters > copy to C:\mydocs\English\
File two: Contains some Cyrillic characters > copy to C\mydocs\Contains_Cyrillic\
Is this possible?
It depends on how your text file is encoded. If the file is unicode, then I'm not sure how to test.
But if the file is extended ascii (1 byte per character), then the meaning of bytes > decimal 127 is dependent on the code page. You can't really tell if the file contains Cyrillic, but you can tell if it contains a byte >127 which is likely to be a non-English character.
The following script should work on Windows XP and later - no need to download anything.
It first creates a file that is >= the length of your file, consisting only of the character "A". Then it uses FC to do a binary comparison and pipes the result to FINDSTR which looks for a value >= 0x80. If one is found, then it returns ERRORLEVEL 1, else it returns ERRORLEVEL 0.
#echo off
call :HasExtendedASCII %1 && (echo English) || echo Not English
exit /b
:HasExtendedASCII
setlocal enableDelayedExpansion
set "tempFile=%temp%\dummyFile%random%.txt"
<nul set /p "=A" >"!tempFile!"
set /a dummySize=1
for /l %%N in (1 1 32) do if !dummySize! lss %~z1 (set /a dummySize*=2 & type "!tempFile!" >>"!tempFile!")
fc /b "!tempFile!" %1|findstr /re " [89ABCDEF][0123456789ABCDEF]" >nul&& set rtn=1 || set rtn=0
del "!tempFile!"
exit /b %rtn%
This is not so easy as the cmd works only over the extended ascii table.
Here is a file that contains the cyrillic alphabet printed with type command:
тхЁЄ·єшюярёфЇуїщъыч№Ўцсэьў∙°■╫▐┘╪▀┬┼╨╥┌╙╚╬╧└╤─╘├╒╔╩╦╟▄╓╞┴═╠ (bulgarian cyrillic- may differs with russian , mongolian and etc...)
unfortunately FINDSTR command does not work well with these.
BUT IF the only specaial characters that these files contains are cyrillic may be there's a chance :-).You can check the cyrillic characters by their HEX codes.There's a certutil command that you can use to encode a file to hex , or dump it to hex.Not win xp native but it can be downloaded from microsoft.com .Here are the hex codes:
ff e2 e5 f0 f2 fa f3 e8 ee ef e0 f1 e4 f4 e3 f5
e9 ea eb e7 fc f6 e6 e1 ed ec f7 f9 f8 fe d7 de
d9 d8 df c2 c5 d0 d2 da d3 c8 ce cf c0 d1 c4 d4
c3 d5 c9 ca cb c7 dc d6 c6 c1 cd cc
and here's the code:
#echo off
certutil -dump my.cirillyc.file | findstr /r ""ff" "e2" "e5" "f0" "f2" "fa" "f3" "e8" "ee" "ef" "e0" "f1" "e4" "f4" "e3" "f5" "e9" "ea" "eb" "e7" "fc" "f6" "e6" "e1" "ed" "ec" "f7" "f9" "f8" "fe" "d7" "de" "d9" "d8" "df" "c2" "c5" "d0" "d2" "da" "d3" "c8" "ce" "cf" "c0" "d1" "c4" "d4" "" "c3" "d5" "c9" "ca" "cb" "c7" "dc" "d6" "c6" "c1" "cd" "cc""
if %errorlevel% EQU 0 (
copy my.cirillyc.file C\mydocs\Contains_Cyrillic\
)
May not work so properly if you file contains some of ╓╞┴═╠... symbols but should be ok in the more cases.To traverse all files in a directory you can surround the this with for /f loop