Recursion error on using the aima package - machine-learning

I was trying to create FOL system to prove that
" House owned by Jill has mould . So Jill has breathing problems ". I am getting a maximum recursion error for this code .
# Import libraries
import aima.utils
import aima.logic
# The main entry point for this module
def main():
# Create an array to hold clauses
"""all houses have a issues structureal damages is dangerous to live
A Houses(x) ^ structure_issue(y) ^ owner(z) ===> breathing_problem(x)"""
# if the
clauses = []
# Add first-order logic clauses (rules and fact)
clauses.append(aima.utils.expr("(House(x) & issue(y) & Owns(z,x) & structure_issue(x,y,z) ) ==> breathing_problem(z)"))
clauses.append(aima.utils.expr("breathing_problem(Jill)"))
clauses.append(aima.utils.expr("Owns(Jill, H1)"))
clauses.append(aima.utils.expr("Building(H1)"))
clauses.append(aima.utils.expr("Building(H1) & Owns(Jill, H1) ==> structure_issue(H1,y,Jill)"))
clauses.append(aima.utils.expr("issue_1(mould)"))
clauses.append(aima.utils.expr("issue_1(x) ==>issue(x)"))
# Create a first-order logic knowledge base (KB) with clauses
KB = aima.logic.FolKB(clauses)
print(clauses)
# Add rules and facts with tell
KB.tell(aima.utils.expr('structure_issue(H1,Floor,Jill)'))
KB.tell(aima.utils.expr('structure_issue(H1,plumbing, Jill)'))
KB.tell(aima.utils.expr("structure_issue(H1,x,Jill) ==> breathing_problem(x)"))
print(KB)
# Get information from the knowledge base with ask
breathing_problem = KB.ask(aima.utils.expr('breathing_problem(x)'))
# Print answers
print('breathing_problem?')
print(breathing_problem)
# Tell python to run main method
if __name__ == "__main__": main()

Related

Downloading a file in a DoFn

It's unclear whether it's safe to download files within a DoFn.
My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.
An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.
Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.
Adding on to this answer.
If your model data is static than you can use below code example to pass your model as side input.
#DoFn to open the model from GCS location
class get_model(beam.DoFn):
def process(self, element):
from apache_beam.io.gcp import gcsio
logging.info('reading model from GCS')
gcs = gcsio.GcsIO()
yield gcs.open(element)
#Pipeline to load pickle file from GCS bucket
model_step = (p
| 'start' >> beam.Create(['gs://somebucket/model'])
| 'load_model' >> beam.ParDo(get_model())
| 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))
#DoFn to predict the results.
class predict(beam.DoFn):
def process(self, element, model):
(features, clients) = element
result = model.predict_proba(features)[:, 1]
return [(clients, result)]
#main pipeline to get input and predict results.
_ = (p
| 'get_input' >> #get input based on source and preprocess it.
| 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
| 'write' >> #write output based on target.
In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.
If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

Delayed dask.dataframe.DataFrame.to_hdf computations crashing

I'm using Dask to to execute the following logic:
read in a master delayed dd.DataFrame from multiple input files (one pd.DataFrame per file)
perform multiple query calls on the master delayed DataFrame
use DataFrame.to_hdf to save all dataframes from the DataFrame.query calls.
If I use compute=False in my to_hdf calls and feed the list of Delayeds returned by each to_hdf call to dask.compute then I get a crash/seg fault. (If I omit compute=False everything runs fine). Some googling gave me some information about locks; I tried adding a dask.distributed.Client with a dask.distributed.Lock fed to to_hdf, as well as a dask.utils.SerializableLock, but I couldn't solve the crash.
here's the flow:
import uproot
import dask
import dask.dataframe as dd
from dask.delayed import delayed
def delayed_frame(files, tree_name):
"""create master delayed DataFrame from multiple files"""
#delayed
def single_frame(file_name, tree_name):
"""read external file, convert to pandas.DataFrame, return it"""
tree = uproot.open(file_name).get(tree_name)
return tree.pandas.df() ## this is the pd.DataFrame
return dd.from_delayed([single_frame(f, tree_name) for f in files])
def save_selected_frames(df, selections, prefix):
"""perform queries on a delayed DataFrame and save HDF5 output"""
queries = {sel_name: df.query(sel_query)
for sel_name, sel_query in selections.items()]
computes = []
for dfname, df in queries.items():
outname = f"{prefix}_{dfname}.h5"
computes.append(df.to_hdf(outname, f"/{prefix}", compute=False))
dask.compute(*computes)
selections = {"s1": "(A == True) & (N > 1)",
"s2": "(B == True) & (N > 2)",
"s3": "(C == True) & (N > 3)"}
from glob import glob
df = delayed_frame(glob("/path/to/files/*.root"), "selected")
save_selected_frames(df, selections, "selected")
## expect output files:
## - selected_s1.h5
## - selected_s2.h5
## - selected_s3.h5
Maybe the HDF library that you're using isn't entirely threadsafe? If you don't mind losing parallelism then you could add scheduler="single-threaded" to the compute call.
You might want to consider using Parquet rather than HDF. It has fewer issues like this.

Check if PCollection is empty - Apache Beam

Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.

Z3-solver throws 'model is not available' exception on python 3

For solving a SAT-problem I decided to use the Z3-solver from Microsoft and Python 3. The aim is to take a long model (up to 500,000 features) and find all possible solutions. To find them I want to add the first solution S1 to the initial equation and exclude S1 it and so forth. I will do it using a while-loop.
Solving a SAT-problem is important for me, because I wanna analyse feature models.
But I'm facing a problem with adding sth to the initial equation. I will share a minimal example:
# Import statements
import sys
sys.path.insert(0,'/.../z3/bin')
from z3 import * # https://github.com/Z3Prover/z3/wiki
def main():
'''
Solves after transformation a given boolean equation by using the Z3-Solver from Microsoft.
'''
fd = dict()
fd['_r'] = Bool('_r')
fd['_r_1'] = Bool('_r_1')
equation = '''And(fd.get("_r"),Or(Not(fd.get("_r")),fd.get("_r_1")))'''
# Solve the equation
s = Solver()
exec('s.add(' + equation + ')')
s.check()
print(s.model())
###################################
# Successfull until here.
###################################
s.add(Or(fd['_r'] != bool(s.model()[fd.get('_r')])))
# s.add(Or(fd['_r'] != False))
s.check()
print(s.model())
if __name__=='__main__':
main()
The first coded line after # Successfull... throws an z3types.Z3Exception: model is not available error. So i tried the line above adding simply false to the model. That works just fine.
I'm stucked here. I believe the error is easy to solve, but I don't see the solution. Does one of you? Thanks!
Models become available only after s.check() returns 'sat'.
The model maps Boolean propositions to {True, False} and
generally maps constants and functions to fixed values.
The requirement is that the model provides an interpretation
that satisfies the formula that is added to the solver 's'.
We don't know whether the solver state is satisfiable before
we have called 's.check()'.
Suppose you want to say:
s.add(Or(fd['_r'] != bool(s.model()[fd.get('_r')])))
meaning that in the model that satisfies the constraint should
have the property that if '_r' is true under the model, then
fd['_r'] != True, and if '_r' is false under the model, then
fd['_r'] != False. This is equivalent to saying
fd['_r'] != '_r'. So it isn't really necessary to access the value
of '_r' in whatever model may evaluate '_r' in order to say something
about the evaluation of it.

Parallel query of SQLite database in R

I have a large database (~100Gb) from which I need to pull every entry,
perform some comparisons on it, and then store the results of those comparisons. I have attempted to run parallel queries within a single R sessions without any success. I can just run multiple R sessions all at once but I am looking for a better approach. Here is what I attempted:
library(RSQLite)
library(data.table)
library(foreach)
library(doMC)
#---------
# SETUP
#---------
#connect to db
db <- dbConnect(SQLite(), dbname="genes_drug_combos.sqlite")
#---------
# QUERY
#---------
# 856086 combos = 1309 * 109 * 6
registerDoMC(8)
#I would run 6 seperate R sessions (one for each i)
res_list <- foreach(i=1:6) %dopar% {
a <- i*109-108
b <- i*109
pb <- txtProgressBar(min=a, max=b, style=3)
res <- list()
for (j in a:b) {
#get preds for drug combos
statement <- paste("SELECT * from combo_tstats WHERE rowid BETWEEN", (j*1309)-1308, "AND", j*1309)
combo_preds <- dbGetQuery(db, statement)
#here I do some stuff to the result returned from the query
combo_names <- combo_preds$drug_combo
combo_preds <- as.data.frame(t(combo_preds[,-1]))
colnames(combo_preds) <- combo_names
#get top drug combos
top_combos <- get_top_drugs(query_genes, drug_info=combo_preds, es=T)
#update progress and store result
setTxtProgressBar(pb, j)
res[[ length(res)+1 ]] <- top_combos
}
#bind results together
res <- rbindlist(res)
}
I dont get any errors but only one core spins up. In contrast, if I run multiple R sessions, all my cores go at it. What am I doing wrong?
Some things I have learned while accessing concurrently with RSQLite the same file SQLite database:
1. Make sure each worker has its own DB connection.
parallel::clusterEvalQ(cl = cl, {
db.conn <- RSQLite::dbConnect(RSQLite::SQLite(), "./export/models.sqlite");
RSQLite::dbClearResult(RSQLite::dbSendQuery(db.conn, "PRAGMA busy_timeout=5000;"));
})
2. Use PRAGMA busy_timeout=5000;
By default this is set to 0, and chances are that you will end up with a "database is locked" error each time your worker tries to write to the DB while it is locked. Previous code sets this PRAGMA in each worker connection. Note that SELECT operations are never locked, only INSERT/DELETE/UPDATE.
3. Use PRAGMA journal_mode=WAL;
This only has to be set once and stays on by default forever. It will add two (more or less permanent) files to the DB. It will improve concurrent read/write performance. Read more here.
With the above settings I have not experienced this issue.

Resources