Z3: express linear algebra properties - z3

I would like to prove properties of expressions involving matrices and vectors (potentially large size, but size is fixed).
For example I want to prove that the outcome of an expression is a diagonal matrix or a triangular matrix, or it is positive definite, ...
To that end I'd like encode well known properties and identities from linear algebra, such as:
||x + y|| <= ||x|| + ||y||
(A * B) * C = A * (B * C)
det(A+B) = det(A) + det(B)
Tr(zA) = z * Tr(A)
(I + AB) ^ (-1) = I - A(I + BA) ^ (-1) * B
...
I have attempted to implement this in Z3. But even for simple properties it returns unknown or times out. I've tried with array theory and quantifiers.
I'd like know if this problem can be solved with an SMT solver or is it not suited for these kind of problems? Could you give a hint by giving a small example?

You can certainly use Z3 to do this.
I have constructed a small example here, which defines the identity matrix and what it means to be a diagonal matrix, and then proves that the identity matrix is diagonal.
So, it is definitely possible to do this kind of work in Z3. Though you may find you have a better time using a tool built on top of Z3 that has more interactive proving features, such as Dafny or F*.

Related

Is activation only used for non-linearity ?

Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.
Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).
Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems.
In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:
Every layer in the network consists of a sequence of linear operations, plus the non-linearity.
Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer F and it's input h as:
F(h) = Wh + b
where W represents a matrix of weights, plus a bias b. This operation is purely sequential, and for a simple multi-layer perceptron (with n layers and without non-linearities), we can write the calculations as follows:
y = F_n(F_n-1(F_n-2(...(F_1(x))))
which is equivalent to
y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n
Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:
y = W_p x + b_p
This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).
Introducing non-linearities to this equation turns the simple "building blocks" F(h) into:
F(h) = g(Wh + b)
Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.
EDIT:
To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.

Trying to understand expected value in Linear Regression

I'm having trouble understanding a lecture slide in my school's machine learning course
why does the expected value of Y = f(X)? what does it mean
my understanding is that X, Y are vectors and f(X) outputs a vector of Y where each individual value (y_i) in the Y vector corresponds to a f(x_i) where x_i is the value in X at index i; But now it's taking the expected value of Y, which is going to be a single value, so how is that equal to f(X)?
X, Y (uppercase) are vectors
x_i,y_i (lowercase with subscript) are scalars at index i in X,Y
There is a lot of confusion here. First let's start with definitions
Definitions
Expectation operator E[.]: Takes a random variable as an input and gives a scalar/vector as an output. Let's say Y is a normally distributed random variable with mean Mu and Variance Sigma^{2} (usually stated as:
Y ~ N( Mu , Sigma^{2} ), then E[Y] = Mu
Function f(.): Takes a scalar/vector (not a random variable) and gives a scalar/vector. In this context it is an affine function, that is f(X) = a*X + b where a and b are fixed constants.
What's Going On
Now you can view linear regression from two angles.
Stats View
One angle assumes that your response variable-Y- is a normally distributed random variable because:
Y ~ a*X + b + epsilon
where
epsilon ~ N( 0 , sigma^sq )
and X is some other distribution. We don't really care how X is distributed and treat it as given. In that case the conditional distribution is
Y|X ~ N( a*X + b , sigma^sq )
Notice here that a,b and also X is a number, there is no randomness associated with them.
Maths View
The other view is the math view where I assume that there is a function f(.) that governs the real life process, that if in real life I observe X, then f(X) should be the output. Of course this is not the case and the deviations are assumed to be due to various reasons such as gauge error etc. The claim is that this function is linear:
f(X) = a*X + b
Synthesis
Now how do we combine these? Well, as follows:
E[Y|X] = a*X + b = f(X)
About your question, I first would like to challenge that it should be Y|X and not Y by itself.
Second, there are tons of possible ontological discussions over what each term here represents in real life. X,Y (uppercase) could be vectors. X,Y (uppercase) could also be random variables. A sample of these random variables might be stored in vectors and both would be represented with uppercase letters (the best way is to use different fonts for each). In this case, your sample will become your data. Discussions about the general view of the model and its relevance to real life should be made at random variable level. The way to infer the parameters, how linear regression algorithms works should be made at matrix and vectors levels. There could be other discussion where you should care about both.
I hope this overly unorganized answer helps you. In general if you want to learn such stuff, be sure you know what kind of math objects and operators you are dealing with , what do they take as input and what are their relevance to real life.

Gradient from non-trainable weights function

I'm trying to implement a self-written loss function. My pipeline is as follows
x -> {constant computation} = x_feature -> machine learning training -> y_feature -> {constant computation} = y_produced
These "constant computations" are necessary to bring out the differences between the desired o/p and produced o/p.
So if I take the L2 norm of the y_produced and y_original, how should I incorporate this loss in the original loss.
Please Note that y_produced has a different dimension than y_feature.
As long as you are using differentiable operations there is no difference between "constant transformations" and "learnable ones". There is no such distinction, look even at the linear layer of a neural net
f(x) = sigmoid( W * x + b )
is it constant or learnable? W and b are trained, but "sigmoid" is not, yet gradient flows the same way, no matter if something is a variable or not. In particular gradient wrt. to x is the same for
g(x) = sigmoid( A * x + c )
where A and c are constants.
The only problem you will encounter is using non-differentiable operations, such as: argmax, sorting, indexing, sampling etc. these operations do not have a well defined gradient thus you cannot directly use first order optimisers with them. As long as you stick with the differentiable ones - the problem described does not really exist - there is no difference between "constant transromations" and any other transformations - no matter change of the size etc.

Naive Bayes Confusion;

I'm working on homework for my machine learning course and am having trouble understanding the question on Naive Bayes. The problem I have is a variation of question number 2 on the following page:
https://www.cs.utexas.edu/~mooney/cs343/hw3-old/hw3.html
The numbers I have are slightly different, so I'll replace the numbers from my assignment with the example above. I'm currently attempting to figure out the probability that the first text is physics. To do so, I have something that looks a little like this:
P(physics|c) = P(physics) * P(carbon|physics) * p(atom|physics) * p(life|physics) * p(earth|physics) / [SOMETHING]
P(physics|c) = .35 * .005 * .1 * .001 * .005 / [SOMETHING]
I'm basing this off of an example that I've seen in my notes, but I can't seem to figure out what I'm supposed to divide by. I'll provide the example from the notes as well.
Perhaps I'm going about this in the wrong way, but I'm unsure where the P(X) term that we're dividing by is coming from. How does this relate to the probability that the text is physics? I feel that getting this issue resolved will make the remainder of the assignment simple.
The denominator P(X) is just the sum of P(X|Y)*P(Y) for all possible classes.
Now, it's important to note that in Naive Bayes, you do not have to compute this P(X). You only have to compute P(X|Y)*P(Y) for each class, and then select the class that produced the highest probability.
In your case, I assume you must have several classes. You mentioned physics, but there must be others like chemistry or math.
So you can compute:
P(physics|X) = P(X|physics) * P(physics) / P(X)
P(chemistry|X) = P(X|chemistry) * P(chemistry) / P(X)
P(math|X) = P(X|math) * P(math) / P(X)
P(X) is the sum of P(X|Y)*P(Y) for all classes:
P(X) = P(X|physics)*P(physics) + P(X|chemistry)*P(chemistry) + P(X|math)*P(math)
(By the way, the above statement is exactly analogous to the example in the image that you provided. The equations are a bit complicated there, but if you rearrange them, you will find that P(X) = P(X|positive)*P(positive) + P(X|negative)*P(negative) in that example).
To produce the answer (that is, to determine Y among physics, chemistry, or math), you would select the maximum value among P(physics|X), P(chemistry|X), and P(math|X).
As I mentioned, you do not need to compute P(X) because this term exists in the denominator of all of P(physics|X), P(chemistry|X), and P(math|X). Thus, you only need to determine the max among P(X|physics)*P(physics), P(X|chemistry)*P(chemistry), and P(X|math)*P(math).
The point is that you don't really need a value for P(x) because it is the same among all classes. So you should ignore it and just compare the numbers before the division step. The highest number is the predicted class.
The reason it is in the equation is originating from the Bayes rule:
P(C1|X) = P(X|C1) * P(C1) / P(X)

Z3: Function Expansion and Encoding in QBVF

I am trying to encode formulas with functions in Z3 and I have an encoding problem. Consider the following example:
f(x) = x + 42
g(x1, x2) = f(x1) / f(x2)
h(x1, x2) = g(x1, x2) % g(x2, x1)
k(x1, x2, x3) = h(x1, x2) - h(x2, x3)
sat( k(y1, y2, y3) == 42 && k(y3, y2, y1) == 42 * 2 && ... )
I would like my encoding to be both efficient (no expression duplication) and allow Z3 to re-use lemmas about functions across subproblems. Here is what I have tried so far:
Inline the functions for every free variable instantiation y1, y2, etc. This introduces duplication and performance is not as good as I hoped for.
Assert the function declarations with universal quantifiers. This works for very specific examples - from the solving times it seems that Z3 can (?) re-use results from previous queries that involve the same functions. However, solving times vary greatly and in many cases (1) turns out to be faster.
Use function definitions (i.e., quantifiers + the MACRO_FINDER option). If my understanding of the documentation is correct, this should expand the functions and thus should be close to (1). However, in terms of performance the results were a bit surprising (">" means faster):
For problems where (1) > (2) I get: (1) > (3) > (2)
For problems where (2) > (1) I get: (2) > (1) = (3)
I have also tried tweaking the MBQI option (and others) with most of the above. However, it is not clear what is the best combination. I am using Z3 4.0.
The question is: What is the "right" way to encode the problem? Note that I only have interpreted functions (I do not really need UF). Could I use this fact for a more efficient encoding and avoid function expansion?
Thanks
I think there's no clear answer to this question. Some techniques work better for one type of benchmarks and other techniques work better for others. For the QBVF benchmarks we've looked at so far, we found macros give us the best combination of small benchmark size and small solving times, but this may not apply in this case.
Your understanding of the documentation is correct, the macro finder will identify quantifiers that look like function definitions and replace all other calls to that function with its definition. It's possible that not all of your macros are picked up or that you are using quasi-macros which aren't detected correctly, either of which could go towards explaining why the performance is sometimes worse than your (1). How much is the difference in the case that (1) > (3)? A little overhead is to be expected, but vast variations in runtime are probably due to some macros being malformed or not being detected.
In general, the is no "right" way to encode these problems. Function expansion can not always be avoided. The trade-off is essentially between expanding eagerly (1, 3), or doing it lazily (2). It may be that there is a correlation of the type SAT (1, 3 faster) and UNSAT (2 faster), but this is also not guaranteed to be the case.

Resources