rc4 key recovery (2 messages, same partial key) - stream

I have two messages that are encrypted with the same partial key. For example:
C1 = RC4(M1, "(VARIABLE_DATA)XXXXYYYY")
C2 = RC4(M2, "(VARIABLE_DATA)XXXXYYYY")
Is it possible with RC4, if C1 and C2 are known to atleast recover the partial key of "XXXXYYYY" since that never changes?

I think there's some confusion about your question. The way stream ciphers work is by generating a keystream that gets (usually) exclusive-or'ed with the message. You are correct that if you use the same key and IV, and thus the same keystream, that this leaks information about the messages.
Here, K is the key stream generated by RC4:
C1 = K ^ M1
C2 = K ^ M2
And by rearranging:
C1 ^ C2 = (K ^ M1) ^ (K ^ M2)
the keystream cancels out here, and you're left with
C1 ^ C2 = M1 ^ M2
Since the attacker knows the two ciphertext values, he can compute the difference of the two messages. If the attacker knows one of the inputs (perhaps a fixed header), he can compute the second message.
M2 = (C1 ^ C2) ^ M1
There's also some statistical tests using cribs, if the messages are natural language.
To answer your question, RC4 should generate an entirely different keystream under related keys, so this attack won't work. There are other attacks against the key scheduling algorithm though, and plenty of reasons to prefer an alternative to RC4.
If you're asking about recovering the initial key from the keystream, there are a few

In general keys of uncompromised encryption techniques are only recoverable using brute force, which in turn requires some method of verifying that the decryption was successful.

Trying to solve this exact question i've stumbled across several threads with almost the same question and absolutely the same answer... which didn't work for me. BUT! The answers were absolutely correct and #mfanto gives very accurate description of what needs to be done (though the brackets make no sense as u can see in my code)!
Here is my C code which worked for me:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
if(argc != 4)
{
printf("usage: ./rc4_pa cipher1 cipher2 message1\n");
return 1;
}
char *c1 = argv[1];
char *c2 = argv[2];
char *m1 = argv[3];
int len_c1 = strlen(c1);
int len_m1 = strlen(m1);
char m2[len_m1 + 1];
m2[len_m1] = '\0';
for(int i = 0; i < len_m1; i++)
{
m2[i] = c1[i] ^ m1[i] ^ c2[i];
}
printf("decrypted: %s\n", m2);
}
Why my code didn't work out of the box? I've got my cipher text from a web server and usually some chars of a ciphertext are not really printable. The only way to pass them further is to encode once more. In my case that was base64.
save the code to rc4_pa.c and make rc4_pa than use it like this
$ ./rc4_pa $(echo L1Gd8F5g | base64 -d) $(echo MFuD8FVg | base64 -d) hello
hope someone else might find it helpful.

Related

Clang: How to get the macro name used for size of a constant size array declaration

TL;DR;
How to get the macro name used for size of a constant size array declaration, from a callExpr -> arg_0 -> DeclRefExpr.
Detailed Problem statement:
Recently I started working on a challenge which requires source to source transformation tool for modifying
specific function calls with an additional argument. Reasearching about the ways i can acheive introduced me
to this amazing toolset Clang. I've been learning how to use different tools provided in libtooling to
acheive my goal. But now i'm stuck at a problem, seek your help here.
Considere the below program (dummy of my sources), my goal is to rewrite all calls to strcpy
function with a safe version of strcpy_s and add an additional parameter in the new function call
i.e - destination pointer maximum size. so, for the below program my refactored call would be like
strcpy_s(inStr, STR_MAX, argv[1]);
I wrote a RecursiveVisitor class and inspecting all function calls in VisitCallExpr method, to get max size
of the dest arg i'm getting VarDecl of the first agrument and trying to get the size (ConstArrayType). Since
the source file is already preprocessed i'm seeing 2049 as the size, but what i need is the macro STR_MAX in
this case. how can i get that?
(Creating replacements with this info and using RefactoringTool replacing them afterwards)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define STR_MAX 2049
int main(int argc, char **argv){
char inStr[STR_MAX];
if(argc>1){
//Clang tool required to transaform the below call into strncpy_s(inStr, STR_MAX, argv[1], strlen(argv[1]));
strcpy(inStr, argv[1]);
} else {
printf("\n not enough args");
return -1;
}
printf("got [%s]", inStr);
return 0;
}
As you noticed correctly, the source code is already preprocessed and it has all the macros expanded. Thus, the AST will simply have an integer expression as the size of array.
A little bit of information on source locations
NOTE: you can skip it and proceed straight to the solution below
The information about expanded macros is contained in source locations of AST nodes and usually can be retrieved using Lexer (Clang's lexer and preprocessor are very tightly connected and can be even considered one entity). It's a bare minimum and not very obvious to work with, but it is what it is.
As you are looking for a way to get the original macro name for a replacement, you only need to get the spelling (i.e. the way it was written in the original source code) and you don't need to carry much about macro definitions, function-style macros and their arguments, etc.
Clang has two types of different locations: SourceLocation and CharSourceLocation. The first one can be found pretty much everywhere through the AST. It refers to a position in terms of tokens. This explains why begin and end positions can be somewhat counterintuitive:
// clang::DeclRefExpr
//
// ┌─ begin location
foo(VeryLongButDescriptiveVariableName);
// └─ end location
// clang::BinaryOperator
//
// ┌─ begin location
int Result = LHS + RHS;
// └─ end location
As you can see, this type of source location points to the beginning of the corresponding token. CharSourceLocation on the other hand, points directly to the characters.
So, in order to get the original text of the expression, we need to convert SourceLocation's to CharSourceLocation's and get the corresponding text from the source.
The solution
I've modified your example to show other cases of macro expansions as well:
#define STR_MAX 2049
#define BAR(X) X
int main() {
char inStrDef[STR_MAX];
char inStrFunc[BAR(2049)];
char inStrFuncNested[BAR(BAR(STR_MAX))];
}
The following code:
// clang::VarDecl *VD;
// clang::ASTContext *Context;
auto &SM = Context->getSourceManager();
auto &LO = Context->getLangOpts();
auto DeclarationType = VD->getTypeSourceInfo()->getTypeLoc();
if (auto ArrayType = DeclarationType.getAs<ConstantArrayTypeLoc>()) {
auto *Size = ArrayType.getSizeExpr();
auto CharRange = Lexer::getAsCharRange(Size->getSourceRange(), SM, LO);
// Lexer gets text for [start, end) and we want him to grab the end as well
CharRange.setEnd(CharRange.getEnd().getLocWithOffset(1));
auto StringRep = Lexer::getSourceText(CharRange, SM, LO);
llvm::errs() << StringRep << "\n";
}
produces this output for the snippet:
STR_MAX
BAR(2049)
BAR(BAR(STR_MAX))
I hope this information is helpful. Happy hacking with Clang!

mlpack sparse coding solution not found

I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
{
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
sc.Encode(Utils::MAX_ITERATIONS);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
}
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int PIXEL_COUNT = IMAGE_WIDTH * IMAGE_HEIGHT;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.

Bitwise operations, wrong result in Dart2Js

I'm doing ZigZag encoding on 32bit integers with Dart. This is the source code that I'm using:
int _encodeZigZag(int instance) => (instance << 1) ^ (instance >> 31);
int _decodeZigZag(int instance) => (instance >> 1) ^ (-(instance & 1));
The code works as expected in the DartVM.
But in dart2js the _decodeZigZag function is returning invalid results if I input negativ numbers. For example -10. -10 is encoded to 19 and should be decoded back to -10, but it is decoded to 4294967286. If I run (instance >> 1) ^ (-(instance & 1)) in the JavaScript console of Chrome, I get the expected result of -10. That means for me, that Javascript should be able to run this operation properly with it number model.
But Dart2Js generate the following JavaScript, that looks different from the code I tested in the console:
return ($.JSNumber_methods.$shr(instance, 1) ^ -(instance & 1)) >>> 0;
Why does Dart2Js adds a usinged right shift by 0 to the function? Without the shift, the result would be as expected.
Now I'm wondering, is it a bug in the Dart2Js compiler or the expected result? Is there a way to force Dart2Js to output the right javascript code?
Or is my Dart code wrong?
PS: Also tested splitting up the XOR into other operations, but Dart2Js is still adding the right shift:
final a = -(instance & 1);
final b = (instance >> 1);
return (a & -b) | (-a & b);
Results in:
a = -(instance & 1);
b = $.JSNumber_methods.$shr(instance, 1);
return (a & -b | -a & b) >>> 0;
For efficiency reasons dart2js compiles Dart numbers to JS numbers. JS, however, only provides one number type: doubles. Furthermore bit-operations in JS are always truncated to 32 bits.
In many cases (like cryptography) it is easier to deal with unsigned 32 bits, so dart2js compiles bit-operations so that their result is an unsigned 32 bit number.
Neither choice (signed or unsigned) is perfect. Initially dart2js compiled to signed 32 bits, and was only changed when we tripped over it too frequently. As your code demonstrate, this doesn't remove the problem, just shifts it to different (hopefully less frequent) use-cases.
Non-compliant number semantics have been a long-standing bug in dart2js, but fixing it will take time and potentially slow down the resulting code. In the short-term future Dart developers (compiling to JS) need to know about this restriction and work around it.
Looks like I found equivalent code that output the right result. The unit test pass for both the dart vm and dart2js and I will use it for now.
int _decodeZigZag(int instance) => ((instance & 1) == 1 ? -(instance >> 1) - 1 : (instance >> 1));
Dart2Js is not adding a shift this time. I would still be interested into the reason for this behavior.

c stream buffer

I am using C and need a stream buffer mechanism that I can write arbitrary bytes two and read bytes from. I would prefer something that is platform independent (or that can at least run on osx and linux). Is anyone aware of any permissive lightweight libraries or code than I can drop in?
I've used buffers within libevent and I may end up going that route, but it seems overkill to have libevent as a dependency when I don't do any sort of event based io.
If you don't mind depending on C++ and possibly some bits of STL, you can use std::stringstream. It shouldn't be too difficult to write a thin C wrapper around it.
Is setbuf(3) (and its aliases) the 'mechanism' you are searching for?
Please consider the following example:
#include <stdio.h>
int main()
{
char buf[256];
setbuffer(stderr, buf, 256);
fprintf(stderr, "Error: no more oxygen.\n");
buf[1] = 'R';
buf[2] = 'R';
buf[3] = 'O';
buf[4] = 'R';
fflush(stderr);
}

Parsing really big log files (>1Gb, <5Gb)

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:
TIMESTAMP=20090101000000;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000100;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000152;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;...
I need to strip this into the table:
TIMESTAMP | PARAM1 | PARAM2 | PARAM3
The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?
Best regards,
Arthur
Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.
This presentation about the use of Python generators blew my mind:
http://www.dabeaz.com/generators-uk/
David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
which can then be used for all sorts of querying:
stat404 = set(r['request'] for r in log
if r['status'] == 404)
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
He also shows that performance compares well to the performance of standard unix tools like grep, find etc.
Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.
(The code examples above are copied from the presentation slides.)
Lex handles this sort of things amazingly well.
But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?
Try AWK
The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.
The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.
Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.
Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.
Something like this should work.
use strict;
use warnings;
my $filename = shift #ARGV;
open my $io, '<', $filename or die "Can't open $filename";
my ($match_buf, $read_buf, $count);
while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
$match_buf .= $read_buf;
while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
my ($timestamp, #params) = ($1, $2, $3, $4);
print $timestamp ."\n";
last unless $timestamp;
}
}
This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:
#include <stdio.h>
#include <err.h>
int
main(int argc, char **argv)
{
const char *filename = "noeol.txt";
FILE *f;
char buffer[1024], *s, *p;
char line[1024];
size_t n;
if ((f = fopen(filename, "r")) == NULL)
err(1, "cannot open %s", filename);
while (!feof(f)) {
n = fread(buffer, 1, sizeof buffer, f);
if (n == 0)
if (ferror(f))
err(1, "error reading %s", filename);
else
continue;
for (s = p = buffer; p - buffer < n; p++) {
if (*p == ';') {
*p = '\0';
strncpy(line, s, p-s+1);
s = p + 1;
if (strncmp("TIMESTAMP", line, 9) != 0)
printf("\t");
printf("%s\n", line);
}
}
}
fclose(f);
}
Sounds like a job for sed:
sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output
You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).
If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.
I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.
I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS

Resources