Issues with real value calculation in z3 - z3

I am working on a problem where I want to learn real parameters based on certain constraints. Following is a snippet of the code:
s = Solver()
def logistic_function(x, y):
out = y[0]
for i in range(len(x)):
out = out + x[i] * y[i + 1]
return out
self.W = RealVector('w', X.shape[1]+1)
self.F = RealVector('f', X.shape[0])
for h in hard_instances:
if y[h] == 1:
self.model.add(Learner.logistic_function(list(X.iloc[h]), self.W) >= 0)
else:
self.model.add(Learner.logistic_function(list(X.iloc[h]), self.W) < 0)
self.model.add(Learner.logistic_function(list(X.iloc[h]), self.W) == self.F[h])
s.check()
Where {X,y} is a dataset and hard_instances is a set of indexes that I am working with. After I get a model solution from the z3, I manually calculate the Learner.logistic_function(list(X.iloc[h]), self.W) value for each of the indices in the hard_instance by extracting the W and compare it with the F[h] values. Following are the results for both (with the size of hard_instances being 100):
Confidence value calculated using z3 output:
[-27.378400928361582, -24.54479132404113, -24.307289651276747, -31.70713297755848, -30.762167609027458, -31.315939646075787, -33.00420507718851, -31.744112911331754, -26.23355531746848, -31.36228488104281, -30.427736819536484, -31.50527359981793, -42.88965873739677, -31.707129367228667, -31.56210015506779, -32.12409972766397, -31.70713297755848, -30.031558658483476, -18.358324137431104, -32.05022247782185, -41.531034659230095, -32.29000919967995, -31.75974435910986, -31.663303581555095, -31.492373296661544, -31.31746775645906, -31.707165983877054, -31.347401145915946, -9013.822052472326, -31.75724273178162, -38.284678733394, -290.3139883637738, -38.55432745005057, 186.6069890256151, -44.1131569461781, -3965.3468548458463, -30.19582424657528, -31.81069063619864, -30.619869067329255, -31.58128167212442, -31.822174319008383, -37.18356870531899, -33.442884835165096, -51.320302912234084, -267.50833857889654, -28.402318357266232, -31.62745176425138, 416.1353823972281, -40.42543492186646, -28.541400567435975, -94.80187721209138, -32.013861248574415, -29.42849153859601, -32.14935341971468, -31.20975052479889, 34.2925702396417, -52.91711293409269, -31.772331866138927, -28.05140296433753, -36.58557486365847, -31.83338866474074, -36.5299223283415, -31.327926505869392, -199.5517369747143, -32.08369384912356, -32.07316618164427, -98.62741827618949, -1003.470954079502, -31.240876251803435, 456.34073138747084, -64.27303567782826, -49.714357622299886, -31.905532688175438, 15.611397869700923, 518.5055614575923, -30.65519656405803, -72.45941570859743, -31.967928531880276, -30.55418177994407, -31.225101988980224, -395.9939788509901, -53.142500004465916, -29.61894900393206, -31.756397212326476, -32.51642103656665, -31.12483808710671, -30.768528286960102, -765.2299421009044, 240.09127856915606, 47.958346445463505, -30.42562757004379, -34.02946877293487, 245.48085838630791, -53.48190520867068, -11.398510468740053, -27.119576978335733, -1472.8001539856432, -7.909727924310422, -31.984175109074794, -1246.733548266756]
Values of the F variable:
['6.63397?', '48.31824?', '9.84546?', '-0.5', '0', '0', '-0.5', '-5.38594?', '-0.5', '0', '2.67479?', '-3.55787?', '-10.59216?', '-0.5', '-2.87871?', '-0.5', '-0.5', '-0.5', '21.51906?', '-0.84313?', '0', '-0.5', '-0.5', '-0.5', '-3.33203?', '0', '-0.5', '0', '-8980.16307?', '-0.5', '-7.06012?', '-248.88109?', '-6.90423?', '-0.5', '-12.90605?', '-3945.30152?', '-0.5', '0', '2.09643?', '0.57093?', '-0.5', '0', '-0.5', '-0.5', '-227.57308?', '0', '0', '490.58834?', '-0.5', '0', '-20.46458?', '-0.5', '-0.5', '-0.5', '0', '75.43883?', '-0.5', '-0.5', '3.39666?', '-0.5', '-0.5', '0', '0', '-91.32681?', '-1.13005?', '0', '-41.40830?', '-972.60267?', '0', '-0.5', '0', '0', '-0.5', '43.98384?', '570.67543?', '-0.5', '-41.06592?', '0', '8.24429?', '-3.43914?', '-377.45813?', '-21.85040?', '0', '-11.83317?', '-4.00421?', '-0.5', '0.53349?', '-691.19144?', '264.52677?', '68.84287?', '9.83243?', '3.52497?', '269.78794?', '-21.63869?', '40.52079?', '6.89110?', '-1451.54107?', '26.21240?', '0', '-1190.68525?']
There seems to be a huge difference in the values calculated within z3 (represented by F) and values I calculate based on the solution that I get from z3. These values should be same.
Also, hard_instances are random samples from the full dataset. This kind of discrepancy happens only with some samples. For a lot of the samples, the value I calculate and the value I get from z3 are same. Also, there is no discrepancy if I use the Integer solver and learn integer parameters instead of real ones.

Are you saying z3 is giving you an incorrect model, or are you saying it doesn't match what you calculated using other means? It's hard to understand from your post.
Note that z3 will give you a solution that satisfies the constraints. Are you sure they are unique? If there can be multiple solutions to the constraints, you might be getting a totally valid model but not what you expected. This can happen if you under-constrain the system, for instance.
Also keep in mind that your calculations outside of z3 are probably done using floating-point numbers, and there might be computational errors creeping in. Z3's Reals are algebraic-reals: i.e., arithmetic on them are precise. With floating point, you might be getting results that differ. (Though unless there's huge instabilities in the problem, the differences shouldn't be that large; especially for small numbers.)
If you're saying z3's model does not satisfy the constraints, then that would be a bug that should be reported, of course. If you suspect that is the case, then please post an MCVE: https://stackoverflow.com/help/minimal-reproducible-example Sharing code is nice, but if we can't load/run it ourselves, it doesn't really help all that much.

Related

How to specify the positive class manually before fitting Sklearn estimators and transformers

I am trying to predict credit card approvals using the relevant dataset from UCI ML Repo. The problem is that the target encodes the applications for credit cards as '+' for approved and '-' for rejected.
As there are a bit more rejected applications in the target, all scorers, estimators are treating the rejected class as positive while it should be otherwise. Because of this, my confusion matrix is all messed up because I think all True Positives and True Negatives, False Positives and False Negatives get inverted:
How can I specify the positive class manually?
I do not know of scikit-learn estimators or transformers that let you flip positive and negative class identifiers as a parameter. But I can think of two ways to work around this:
Method 1: You transform the array labels yourself before fitting the estimator
That can be easily achieved for numpy arrays:
y = np.array(['+', '+', '+', '-', '-'])
y_transformed = [1 if i == '+' else 0 for i in y]
and also pandas Series objects:
y = pd.Series(['+', '+', '+', '-', '-'])
y_transformed = y.map({'+': 1, '-': 0})
In both cases the output will be [1, 1, 1, 0, 0]
Method 2: You define the labels parameter in confusion_matrix
scikit-learn's confusion_matrix has a parameter labels that lets you reorder the labels. Use like this:
y_true = np.array([1, 1, 1, 0, 0])
y_pred = np.array([1, 0, 1, 0, 0])
print(confusion_matrix(y_true, y_pred))
# output
[[2 0]
[1 2]]
print(confusion_matrix(y_true, y_pred, labels=[1, 0]))
# output
[[2 1]
[0 2]]

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to perform some operation stating it cannot convert string to float. Hence I used dtype=str argument to read all the columns as string. Now I want to convert the particular column to numeric with errors='coerce' so that I those records contain string are converted to NaN values and rest get converted to float correctly. Can you please advise how this can be achieved using dask?
Have already tried: astype conversion
import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()
I had a similar problem and I solved it using .where.
p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")
Or perhaps replacing the second line with:
p.where(p.str.isnumeric(), 999).astype("u4")
In my case my DataFrame (or Series) was the result of other operations, so I couldn't apply it directly to read_csv.
As of March 2020, dask.dataframe.to_numeric() has been implemented and is documented here
Here's a minimal example:
import pandas as pd
import dask.dataframe as dd
# create dask dataframe with dummy data incl. number as string
data = {'A': '1', 'B': 2, 'C': 3}
df = pd.DataFrame([data])
ddf = dd.from_pandas(df, npartitions=3)
# inspect dtypes
ddf.dtypes
> A object
> B int64
> C int64
> dtype: object
# apply to_numeric method
ddf.A = dd.to_numeric(ddf.A)
# verify dtypes
ddf.dtypes
> A int64
> B int64
> C int64
> dtype: object

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

Out of memory error for convolution using Theano

I am doing a convolution in Theano:
theano.tensor.nnet.conv.conv2d(x,h, border_mode='full')
and it runs out of memory, I get the following message:
RuntimeError: GpuCorrMM failed to allocate working memory of 3591 x 319086
Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, (True, False, True, False)), CudaNdarrayType(float32, (False, True, False, False))]
Inputs shapes: [(1, 513, 1, 7), (1, 1, 513, 622)]
Inputs strides: [(0, 7, 0, 1), (0, 0, 622, 1)]
Inputs values: ['not shown', 'not shown']
I have tried setting theano flags to 'optimizer_excluding=conv_dnn', but still didn't work. Is there any way around this?
You are trying to allocate a matrix which need something like 9TB of memory. An individual neuron needs 2.5GB of memory. The only optimization I know for such issues is to either decrease the number of units or buying more RAM. Loads of RAM :)
For me, I disabled g++ during runtime by simply remove the (MinGW) bin directory from the path variable. The processing time is slow, but it completes process.
My program execution enviroment: OS Windows Vista 32 bit, CPU Intel 2.16 GHz, RAM 4.00 GB and no GPU

How to check if two Torch tensors or matrices are equal?

I need a Torch command that checks if two tensors have the same content, and returns TRUE if they have the same content.
For example:
local tens_a = torch.Tensor({9,8,7,6});
local tens_b = torch.Tensor({9,8,7,6});
if (tens_a EQUIVALENCE_COMMAND tens_b) then ... end
What should I use in this script instead of EQUIVALENCE_COMMAND ?
I tried simply with == but it does not work.
torch.eq(a, b)
eq() implements the == operator comparing each element in a with b (if b is a value) or each element in a with its corresponding element in b (if b is a tensor).
Alternative from #deltheil:
torch.all(tens_a.eq(tens_b))
This below solution worked for me:
torch.equal(tensorA, tensorB)
From the documentation:
True if two tensors have the same size and elements, False otherwise.
To compare tensors you can do element wise:
torch.eq is element wise:
torch.eq(torch.tensor([[1., 2.], [3., 4.]]), torch.tensor([[1., 1.], [4., 4.]]))
tensor([[True, False], [False, True]])
Or torch.equal for the whole tensor exactly:
torch.equal(torch.tensor([[1., 2.], [3, 4.]]), torch.tensor([[1., 1.], [4., 4.]]))
# False
torch.equal(torch.tensor([[1., 2.], [3., 4.]]), torch.tensor([[1., 2.], [3., 4.]]))
# True
But then you may be lost because at some point there are small differences you would like to ignore. For instance floats 1.0 and 1.0000000001 are pretty close and you may consider these are equal. For that kind of comparison you have torch.allclose.
torch.allclose(torch.tensor([[1., 2.], [3., 4.]]), torch.tensor([[1., 2.000000001], [3., 4.]]))
# True
At some point may be important to check element wise how many elements are equal, comparing to the full number of elements. If you have two tensors dt1 and dt2 you get number of elements of dt1 as dt1.nelement()
And with this formula you get the percentage:
print(torch.sum(torch.eq(dt1, dt2)).item()/dt1.nelement())
Try this if you want to ignore small precision differences which are common for floats
torch.all(torch.lt(torch.abs(torch.add(tens_a, -tens_b)), 1e-12))
You can convert the two tensors to numpy arrays:
local tens_a = torch.Tensor((9,8,7,6));
local tens_b = torch.Tensor((9,8,7,6));
a=tens_a.numpy()
b=tens_b.numpy()
and then something like
np.sum(a==b)
4
would give you a fairly good idea of how equals are they.

Resources