SHAP package preserve indices and group by - machine-learning

I'm using the shap python package and I have a dataframe with index columns which I pass to the explain function, but when I get the results I loose the index information which I need for grouping and aggregating the shap values.
indexed_features = features.set_index(['col1', 'col2'])
explain = shap.TreeExplainer(model.predictor, indexed_features)
explanation = explain(indexed_features)
The result I have is an explanation object with .data, .base_values, .values as np ndarrays. I still want to have information on the indices on those results, so that I can group or filter by them.
Is there a way this thing can happen?

Related

Search vector vs. ALL others in Solr

Say I have a field in Solr which is an array of ints that looks something like this:
vector=array(469,323,324,119,74,58,68,59,49,40,32,26,21,17,14,12,10,9,7,5,-642,-184,-99,-84,-79,-63,-50,-38,-30,-21,-18,-16,-17,-16,14,25,52,21,15,93,53,52,32,15,61,29,346,20,69,72,38,165)
Is there a way to find either the k-nearest neighbors or the cosineSimilarity between this vector and that for all other documents matching a search in Solr?
I tried building a matrix manually but it was crashing Solr.
let(
a=search(satracks,
q="vector:*",
fl="vector",
qt="/export",
sort="vector desc"
),
b=col(a, vector),
mat1=matrix(b),
mat2=transpose(mat1),
testvector=array(469,323,324,119,74,58,68,59,49,40,32,26,21,17,14,12,10,9,7,5,-642,-184,-99,-84,-79,-63,-50,-38,-30,-21,-18,-16,-17,-16,14,25,52,21,15,93,53,52,32,15,61,29,346,20,69,72,38,165),
k=knn(mat2, testvector,5)
)
The documentation only shows random samples. I want to compare a vector to every other vector that matches a given search.
You can do this using Solr 9. First you have to add to every document in solr vectorized field. Than you can use knn in query:
&q={!knn f=vectorized_field topK=10}[your_vector]

Trying to group out data and write them out to files

I was wondering if anyone knew the proper way to write out a group of files based on the value of a column in Dask. In other words, if I want to group a bunch of columns based on a value in a column and write those out to CSVs. I've been trying to use the groupby-apply paradigm with Dask, but the problem is that it does not return a dask.dataframe object, so the function I apply it with uses the Pandas API.
Is there a better way to approach what I'm trying to do? A scalable solution would be much appreciated because some of the data that I'm dealing with is very large.
Thanks!
If you were saving to parquet, then partition_on kwarg would be useful. If you are saving to csv, then it's possible to do something similar with (rough pseudocode):
def save_partition(df, partition_info=None):
for group_label, group_df in df.groupby('some_col'):
csv_name = f"{group_label}_partition_{partition_info['number']}.csv"
group_df.to_csv(csv_name)
delayed_save = ddf.map_partitions(save_partition)
The delayed_save can then be computed when convenient.

dask sort index after load from dataframe with index non-sorted

I have created dataframe with non-sorted index with pandas and save it to parquet. Later, if I load with dask, How do I perform sort index? Do I have to do something like,
pdf.reset_index().set_index(idx)?
As far as I am aware, the answer is yes, your approach is correct. For example, searching for "sort_index" in Dask issues does not really yield any relevant results.
Keep in mind that sorting out-of-core is quite a difficult operation. It's possible you might get more stable results (or even better performance) in Pandas if your dataset fits in your memory.

Using dask.compute with an array of delayed items

Currently, I can create (nested) lists of objects that are a mix of eagerly computed items and delayed items.
If I pass that list to dask.compute, it can create the graph and computes the result as a new list replacing the delayed items with their computed counterparts.
The list has a very well defined structure that I would like to exploit. As such, before using Dask, I had been using numpy array with dtype=object.
Can I pass these numpy arrays to dask.compute?
Are there other collections, that support ND slicing à la numpy, that I can use instead?
My current workaround is to either use dictionaries, or nested lists, but the ability to slice numpy arrays is really nice and I would not like to loose that.
Thanks,
Mark
code example as notebook
Dask.compute currently only searches through core Python data structures like lists and dictionaries. It does not search through Numpy arrays.
You might consider using Numpy arrays until the very end, then calling .tolist() then calling np.array again.
result = dask.compute(*x.tolist())
result = np.array(result)

How to go from DataFrame to multiple Series efficiently in Dask?

I'm trying to find an efficient way to transform a DataFrame into a bunch of persisted Series (columns) in Dask.
Consider a scenario where the data size is much larger than the sum of worker memory and most operations will be wrapped by read-from-disk / spill-to-disk. For algorithms which operate only on individual columns (or pairs of columns), reading-in the entire DataFrame from disk for every column operation is inefficient. In such a case, it would be nice to locally switch from a (possibly persisted) DataFrame to persisted columns instead. Implemented naively:
persisted_columns = {}
for column in subset_of_columns_to_persist:
persisted_columns[column] = df[column].persist()
This works, but it is very inefficient because df[column] will re-read the entire DataFrame N = len(subset_of_columns_to_persist) times from disk. Is it possible to extract and persist multiple columns individually based on a single read-from-disk deserialization operation?
Note: len(subset_of_columns_to_persist) is >> 1, i.e., simply projecting the DataFrame to df[subset_of_columns_to_persist] is not the solution I'm looking for, because it still has a significant I/O overhead over persisting individual columns.
You can persist many collections at the same time with the dask.persist function. This will share intermediates.
columns = [df[column] for column in df.columns]
persisted_columns = dask.persist(*columns)
d = dict(zip(df.columns, persisted_columns))

Resources