ssdeep fuzzy hashes and frida server binaries - frida

I am doing some testing on fuzzy hashing and was using frida server binaries as examples. I was assuming that newer versions of frida server will build on top of existing code and so there should be a high percentage of similarity between each release. However, when I did a ssdeep comparison between minor versions such as 12.11.14 and 12.11.13, I found that the similarity is 0 (no similarity).
I was wondering if someone else has tried this before? Has this got to do with how the frida server code is compiled?
Example of fuzzy hashing frida_server_12.11.14_ios_arm64 and frida_server_12.11.13_ios_arm64
filename frida-server-12.11.14-ios-arm64 393216:8ljDmG+xQiYfCR1t1rK8JuMwlG/WjeZFO5r+vxLEf8FC/ie:cvY5IMwlUWjeZFLxC
filename2 frida-server-12.11.13-ios-arm64 393216:hWCJZ51p1Y1flpCv5iu1rK42BQAtcqg9EywtUQO82xsT89C/7mekv:hLP6+kBQAtvg9Eywj2Eu
The comparison function in ssdeep returned 0, which means the signatures did not match.
However, when I compared the various architectures (arm, arm64 and arm64e) within the same version (12.11.14), there are some similarities.
filename frida-server-12.11.14-ios-arm64 393216:8ljDmG+xQiYfCR1t1rK8JuMwlG/WjeZFO5r+vxLEf8FC/ie:cvY5IMwlUWjeZFLxC
filename2 frida-server-12.11.14-ios-arm 196608:BpN3Tm15GpZHGG+xQiYMxi/s60k01dX1rKCVasZDpeCVR:XNjDmG+xQiYfCR1t1rK+e
fuzzy hashes comparison 57
filename frida-server-12.11.14-ios-arm64 393216:8ljDmG+xQiYfCR1t1rK8JuMwlG/WjeZFO5r+vxLEf8FC/ie:cvY5IMwlUWjeZFLxC
filename2 frida-server-12.11.14-ios-arm64e 786432:BBMMwlUWjeZFLxC99NMeQEaB9GFLM6j7D:VdyD
fuzzy hashes comparison 49

Related

Importing MNIST dataset with Fortran

A Linux/GFortran question.
I know exactly what my problem is but I can't figure out how to solve it...
I want to import the MNIST dataset images and labels into Fortran arrays to play around with Machine Learning algorithms using Fortran. I've done this with Python but I can't replicate reading the data files with Fortran.
The dataset files and file layout descriptions are at:
http://yann.lecun.com/exdb/mnist/
The 2 problems I'm struggling with are...
1) The data in the files is stored in unsigned bytes. I can't find a similar datatype in Fortran. I'm using integer(kind=1) to read the first 4 bytes successfully, which constitutes the file magic number, but I'm worried about incorrectly reading the value of one of these bytes into the signed integer(kind=1) datatype.
2) The data is stored in Big-Endian format. So when I read the number of images, rows and columns, which are stored in 4 byte integers, into my Little-Endian machine, I receive the obvious gobbledegook. Ideally, what I would like to be able to do is specify the Endiness of a variable to read from a file in an edit descriptor. Is this possible?
Any assistance would be much appreciated.
Kind regards

aarch64 MMU: skipping first/second level tables

on armv7 [1] / aarch32 [2] MMU, when using Long descriptor, when the virtual space described by ttbr0 is small enough (1Gb here), the level 1 translation can be skipped, leaving only two levels of translations.
However I saw nothing of the like in the aarch64 translation description. Does anyone know if it is still possible to reduce the number of translation table used by ttbr0 when using aarch64 ?
A reference in the ARM ARM would be great if it exists.
Best,
V.
[1]: ARMARM v7 , B3.6 Long-descriptor translation table format, Fig B3-12 General view of stage 1 address translation using Long-descriptor format
[2]: ARMARM v8, G4.6.1 Overview of VMSAv8-32 address translation using Long-descriptor translation tables, Fig G4-8
As discussed here, it is still possible, but extended to several scenario. I mostly depend on the granule size choose, and the size of the virtual space. There are some tables (like Table D4-11 TCR.TnSZ values and IA ranges when there is no concatenation of translation tables) which helps you understand at which level of translation you will start, depending on your configuration.

Text clustering within a log file

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.
I see actually two separate machine learning problems:
1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.
In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false
No you could try to train a classifier which takes your labeled data as an input and the result could be a model.
2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.
Now I would first try a straigt-forward document clustering using a standard nlp pipeline:
Tokenize your document to get the tokens
Normalize them (maybe stemming is not the best idea for logs)
Create for each document a tf-idf vector
Start with a simple clustering algorithm like k-means to try to cluster the different section
After the clustering you should have the section similar to each other in the same cluster
I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

Will random() ever change?

I have been looking into a development issue that requires the use of pseudorandom number generation to allow the same set of random numbers to be generated for a given seed.
I have currently been looking at using long random(void) and void srandom(unsigned seed) for this (man page), and currently these are generating the same set of random numbers in a Mac app, an iOS app and an iOS app (64-bit) which is what I was hoping. The iOS tests were only in the simulator so I don't know whether this will affect the result.
My main concerns is that this algorithm could change at some point, making the applications we're developing effectively useless with old data. What are the chances of these algorithms changing / being different on a future device?
I'd say it's extremely likely they will change as the sequence is not guaranteed by any standard.
Why not use your own random number sequence? Even a simple linear congruential generator satisfies most statistical properties of randomness. Here is the formula for such a generator:
next_number = (a * current_number + b) % c
with
a = 1103515245
b = 12345
c = 4294967296
These values of a, b, c give you good statistical properties and are quite well known for building quick and dirty generators.
I don't have the slightest idea about the answer to the question you ask.
If a related question is "How can I be absolutely sure to have the same pseudo-random sequences generated in 10 years time ?", the answer to this question is : don't rely on an external library, write the code explicitly.
Bathsheba proposed this generator. You can google for "pseudo random generator algorithm". Here is a list of algorithms listed on wikipedia.
In fact, srandom did change since Mac OS X 10.7, according to this blog post. However, this was due
to the way srandom was implemented: it tried to access an uninitialized local variable, which
is undefined behavior in C. According to the post, the new compiler used since Mac
OS X 10.7 optimized out the uninitialized memory access, changing its behavior in subtle
ways.

False autovectorization in Intel C compiler (icc)

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?
Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

Resources