Check if bytes result in valid ISO 8859-15 (Latin) in Python - character-encoding

I want to test if a string of bytes that I'm extracting from a file results in valid ISO-8859-15 encoded text.
The first thing I came across is this similar case about UTF-8 validation:
https://stackoverflow.com/a/5259160/1209004
So based on that, I thought I was being clever by doing something similar for ISO-8859-15. See the following demo code:
#! /usr/bin/env python
#
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
try:
bytes.decode('iso-8859-15', 'strict')
return(True)
except UnicodeDecodeError:
return(False)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
isValidLatin = isValidISO885915(bytes)
print(isValidLatin)
main()
However, running this returns True, even though x95 is not a valid code point in ISO-8859-15! Am I overlooking something really obvious here? (BTW I tried this with Python 2.7.4 and 3.3, results are identical in both cases).

I think I've found a workable solution myself, so I might as well share it.
Looking at the codepage layout of ISO 8859-15 (see here), I really only need to check for the presence of code points 00 -1f and 7f - 9f. These corrrepond to the C0 and C1 control codes.
In my project I was already using something based on the code here for removing control characters from a string (C0 + C1). So, using that as a basis I came up with this:
#! /usr/bin/env python
#
import unicodedata
def removeControlCharacters(string):
# Remove control characters from string
# Based on: https://stackoverflow.com/a/19016117/1209004
# Tab, newline and return are part of C0, but are allowed in XML
allowedChars = [u'\t', u'\n',u'\r']
return "".join(ch for ch in string if
unicodedata.category(ch)[0] != "C" or ch in allowedChars)
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
# Decode bytes to string
try:
string = bytes.decode("iso-8859-15", "strict")
except:
# Empty string in case of decode error
string = ""
# Remove control characters, and compare result against
# input string
if removeControlCharacters(string) == string:
isValidLatin = True
else:
isValidLatin = False
return(isValidLatin)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
print(isValidISO885915(bytes))
main()
There may be more elegant / Pythonic ways to do this, but it seems to do the trick, and works with both Python 2.7 and 3.3.

Related

Lua length of Frame for Parsing

I have an binary file with shows glibberish infos if i open it in Notepad.
I am working on an plugin to use with wireshark.
So my problem is that I need help. I am reading in an File and need to find 'V' '0' '0' '1' (0x56 0x30 0x30 0x31) in the File, because its the start of an Header, with means there is an packet inside. And I need to do this for the whole file, like parsing. Also should start the Frame with V 0 0 1 and not end with it.
I currently have an Code where I am searching for 0x7E and parse it. What I need is the length of the frame. For example V 0 0 1 is found, so the Length from V to the Position before the next V 0 0 1 in the File. So that I can work with the length and add it to an captured length to get the positions, that wireshark can work with.
For example my unperfect Code for working with 0x7E:
local line = file:read()
local len = 0
for c in (line or ''):gmatch ('.') do
len = len + 1
if c:byte() == 0x7E then
break
end
end
if not line then
return false
end
frame.captured_length = len
Here is also the Problem that the Frame ends with 7E which is wrong. I need something that works perfectly for 'V' '0' '0' '1'. Maybe I need to use string.find?
Please help me!
Thats an example how my file looks like if i use the HEX-Editor in Visual Studio Code.
Lua has some neat pattern tools. Here's a summary:
(...) Exports all captured text within () and gives it to us.
-, +, *, ?, "Optional match as little as possible", "Mandatory match as much as possible", "optional match as much as possible", "Optional match only once", respectively.
^ and $: Root to start or end of file, respectively.
We'll be using this universal input and output to test with:
local output = {}
local input = "V001Packet1V001Packet2oooV001aaandweredonehere"
The easiest way to do this is probably to recursively split the string, with one ending at the character before "V", and the other starting at the character after "1". We'll use a pattern which exports the part before and after V001:
local this, next = string.match(input, "(.-)V001(.*)")
print(this,next) --> "", "Packet1V001Packet2..."
Simple enough. Now we need to do it again, and we also need to eliminate the first empty packet, because it's a quirk of the pattern. We can probably just say that any empty this string should not be added:
if this ~= "" then
table.insert(output, this)
end
Now, the last packet will return nil for both this and next, because there will not be another V001 at the end. We can prepare for that by simply adding the last part of the string when the pattern does not match.
All put together:
local function doStep(str)
local this, next = string.match(str, "(.-)V001(.*)")
print(this,next)
if this then
-- There is still more packets left
if this ~= "" then
-- This is an empty packet
table.insert(output, this)
end
if next ~= "" then
-- There is more out there!
doStep(next)
end
else
-- We are the last survivor.
table.insert(output, str)
end
end
Of course, this can be improved, but it should be a good starting point. To prove it works, this script:
doStep(input)
print(table.concat(output, "; "))
prints this:
Packet1; Packet2ooo; aaandweredonehere

What is wrong with this CMAC computation?

I have an example of a CMAC computation, which I want to reproduce in Python, however I am failing. The example looks like this:
key = 3ED0920E5E6A0320D823D5987FEAFBB1
msg = CEE9A53E3E463EF1F459635736738962&cmac=
The expected (truncated) CMAC looks like this (note: truncated means that every second byte is dropped)
ECC1E7F6C6C73BF6
So I tried to reenact this example with the following code:
from Crypto.Hash import CMAC
from Crypto.Cipher import AES
from binascii import hexlify, unhexlify
def generate_cmac(key, msg):
"""generate a truncated cmac message.
Inputs:
key: 1-dimensional bytearray of arbitrary length
msg: 1-dimensional bytearray of arbitrary length
Outputs:
CMAC: The cmac number
CMAC_t: Trunacted CMAC"""
# Generate CMAC via the CMAC algorithm
cobj = CMAC.new(key=key, ciphermod=AES)
cobj.update(msg)
mac_raw = cobj.digest()
# Truncate by initializing an empty array and assigning every second byte
mac_truncated = bytearray(8 * b'\x00')
it2 = 0
for it in range(len(mac_raw)):
if it % 2:
mac_truncated[it2:it2+1] = mac_raw[it:it+1]
it2 += 1
return mac_raw, mac_truncated
key = unhexlify('3ED0920E5E6A0320D823D5987FEAFBB1') # The key as in the example
msg = 'CEE9A53E3E463EF1F459635736738962&cmac=' # The msg as in the example
msg_utf = msg.encode('utf-8')
msg_input = hexlify(msg_utf) # Trying to get the bytearray
mac, mact_calc = generate_cmac(key, msg_input) # Calculate the CMAC and truncated CMAC
# However the calculated CMAC does not match the cmac of the example
My function generate_cmac() works perfectly for other cases, why not for this example?
(If anybody is curious, the example stems from this document Page 18/Table 6)
Edit: An example for a successful cmac computation is the following:
mact_expected = unhexlify('94EED9EE65337086') # as stated in the application note
key = unhexlify('3FB5F6E3A807A03D5E3570ACE393776F') # called K_SesSDMFileReadMAC
msg = [] # zero length input
mac, mact_calc = generate_cmac(key, msg) # mact_expected and mact_calc are the same
assert mact_expected == mact_calc, "Example 1 failed" # This assertion passes
TLDR: overhexlification
Much to my stupefaction, the linked example indeed seems to mean CEE9A53E3E463EF1F459635736738962&cmac=when it writes that, since the box below contains 76 hex characters for the the 38 bytes coding that in ASCII, that is 434545394135334533453436334546314634353936333537333637333839363226636d61633d.
However I'm positive that this does not need to be further hexlified on the tune of 76 bytes as the code does. In other words, my bets are on
key = unhexlify('3ED0920E5E6A0320D823D5987FEAFBB1')
msg = 'CEE9A53E3E463EF1F459635736738962&cmac='.encode()
mac, mact_calc = generate_cmac(key, msg)

Wireshark Lua dissector utf16 string

I am writing a custom Wireshark Lua dissector. One field in the dissector is a UTF16 string. I tried to specify this field with
msg_f = ProtoField.string("mydissector.msg", "msg", base.UNICODE)
local getMsg = buffer(13) -- starting on byte 13
subtree:add_le(m.msg_f, getMsg)
However, this only adds the first character rather than the whole string. It also raises an Expert Info warning undecoded trailing/stray characters.
What is the correct way to parse a UTF16 string?
You haven't specified the range of bytes that comprises the string. This is typically determined by either an explicit length field or by a NULL-terminator. The exact method of determining the range is dependent upon the particular protocol and field in question.
An example of each type:
If there's a length field, say of 1 byte in length that precedes the string, then you can use something like:
local str_len = buffer(13, 1):le_uint()
subtree:add_le(m.msg_len_f, buffer(13))
if str_len > 0 then
subtree:add_le(m.msg_f, buffer(14, str_len))
end
And if the string is NULL-terminated, you can use something like:
local str = buffer(13):stringz()
local str_len = str:len()
subtree:add_le(m.msg_f, buffer(13, str_len + 1))
These are just pseudo-examples, so you'll need to apply whatever method, possibly none of these, to fit your data.
Refer to the Wireshark's Lua API Reference Manual for more details, or to the Wireshark LuaAPI wiki pages.
The solution I came up with is simply:
msg_f = ProtoField.string("mydissector.msg", "msg")
local getMsg = buffer(13) -- starting on byte 13
local msg = getMsg:le_ustring()
subtree:add(msg_f, getMsg, msg)

Figure out bytes content

I was working on a compound file which contains several streams. I'm frustrated how to figure out the content of each stream. I don't know if these bytes are text or mp3 or video.
for example: is there a way to understand what types of data could these bytes are?
b'\x00\x00\x00\x00\x00\x00\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x0bz\xcc\xc9\xc8\xc0\xc0\x00\xc2?\x82\x1e<\x0ec\xbc*8\x19\xc8i\xb3W_\x0b\x14bH\x00\xb2-\x99\x18\x18\xfe\x03\x01\x88\xcf\xc0\x01\xc4\xe1\x0c\xf9\x0cE\x0c\xd9\x0c\xc5\x0c\xa9\x0c%\x0c\x86`\xcd \x0c\x020\x1a\x00\x00\x00\xff\xff\x02\x080\x00\x96L~\x89W\x00\x00\x00\x00\x80(\\B\xefI;\x9e}p\xfe\x1a\xb2\x9b>(\x81\x86/=\xc9xH0:Pwb\xb7\xdck-\xd2F\x04\xd7co'
Yes, there is away to figure out each stream content. there is a signature for each file on this planet in addition to extension which is not reliable. it might be removed or falsely added.
So what is the signature?
In computing, a file signature is data used to identify or verify the
contents of a file. In particular, it may refer to:
File magic number: bytes within a file used to identify the
format of the file; generally a short sequence of bytes (most are
2-4 bytes long) placed at the beginning of the file; see list of file
signatures
File checksum or more generally the result of a hash function over the file contents: data used to verify the integrity of the file
contents, generally against transmission errors or malicious attacks.
The signature can be included at the end of the file or in a separate
file.
I used the magic number to define the magic number term I'm copying this from Wikipedia
In computer programming, the term magic number has multiple
meanings. It could refer to one or more of the following:
Unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants
A constant numerical or text value used to identify a file format or protocol; for files, see List of file
signatures
Distinctive unique values that are unlikely to be mistaken for other meanings(e.g., Globally Unique Identifiers)
in the second point it is a certain sequence of bytes like
PNG (89 50 4E 47 0D 0A 1A 0A)
or
BMP (42 4D)
So how to know the magic number of each file?
in this article "Investigating File Signatures Using PowerShell" we find the writer created a wonderful power shell function to get the magic number also he mentioned a tool and I'm copying this from his article
PowerShell V5 brings in Format-Hex, which can provide an alternative
approach to reading the file and displaying the hex and ASCII value to
determine the magic number.
form Format-Hex help I'm copying this description
The Format-Hex cmdlet displays a file or other input as hexadecimal
values. To determine the offset of a character from the output, add
the number at the leftmost of the row to the number at the top of the
column for that character.
This cmdlet can help you determine the file type of a corrupted file
or a file which may not have a file name extension. Run this cmdlet,
and then inspect the results for file information.
this tool is very good also to get the magic number of a file. Here is an example
another tool is online hex editor but to be onset I didn't understand how to use it.
now we got the magic number but how to know what type of data or is that file or stream?
and that is the most good question.
Luckily there are many database for these magic numbers. let me list some
File Signatures
FILE SIGNATURES TABLE
List of file signatures
for example the first database has a search capability. just enter the magic number with no spaces and search
after you may find. Yes, may. There is a big possibility that you won't directly find the file type in question.
I faced this and solved it by testing the streams against specific types of signatures. Like PNG I was searching for in a stream
def GetPngStartingOffset(arr):
#targted magic Number for png (89 50 4E 47 0D 0A 1A 0A)
markerFound = False
startingOffset = 0
previousValue = 0
arraylength = range(0, len(arr) -1)
for i in arraylength:
currentValue = arr[i]
if (currentValue == 137): # 0x89
markerFound = True
startingOffset = i
previousValue = currentValue
continue
if currentValue == 80: # 0x50
if (markerFound and (previousValue == 137)):
previousValue = currentValue
continue
markerFound = False
elif currentValue == 78: # 0x4E
if (markerFound and (previousValue == 80)):
previousValue = currentValue
continue
markerFound = False
elif currentValue == 71: # 0x47
if (markerFound and (previousValue == 78)):
previousValue = currentValue
continue
markerFound = False
elif currentValue == 13: # 0x0D
if (markerFound and (previousValue == 71)):
previousValue = currentValue
continue
markerFound = False
elif currentValue == 10: # 0x0A
if (markerFound and (previousValue == 26)):
return startingOffset
if (markerFound and (previousValue == 13)):
previousValue = currentValue
continue
markerFound = False
elif currentValue == 26: # 0x1A
if (markerFound and (previousValue == 10)):
previousValue = currentValue
continue
markerFound = False
return 0
Once this function found the magic number
I split the stream and save the png file
arr = stream.read()
a = list(arr)
B = a[GetPngStartingOffset(a):len(a)]
bytesString = bytes(B)
image = Image.open(io.BytesIO(bytesString))
image.show()
At the end this is not an end to end solution but it is a way to figure out streams content
Thanks for reading and Thanks for #Robert Columbia for his patience

Lua true Binary I/O

I read this question, and I checked it myself.
I use the following snippet:
f = io.open("file.file", "wb")
f:write(1.34)
f:close()
This creates the file, into which 1.34 is written. This is same as : 00110001 00101110 00110011 00110100 , that is binary codes for the digit 1, the decinal point, then 3 and finally 4.
However, I would like to have printed 00111111 10101100 11001100 11001101, which is a true float representation. How do I do so?
You may need to convert it to binary representation, using something similar to this answer. This discussion on serialization of lua numbers may also be useful.

Resources