In pandas if I want to keep only the target column i can do something like this:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(3,3), columns=list("abc"))
>>> df[["a"]]
a
0 0.057213
1 0.162161
2 1.351165
>>> type(res)
<class 'pandas.core.frame.DataFrame'>
Notice that the result is still a dataframe.
However in Deedle if I want to drop all columns but the "a" column, I'm having to do acrobatics:
let dropCols = ["b"; "c"]
dropCols |> List.iter (fun i -> someFrame.DropColumn i)
Doing someFrame?a, gives a series. How do I convert this back to a frame? Is there a straightforward way of keeping only the target column in a Frame while not having it convert to a series?
[EDIT]
In addition to s952163's answer, its worth posting this as well:
fr.Columns.[["a"]]
You can do something like:
Frame.ofColumns [["a"] => df.["a"]]
Although in your example above you appear to be mutating the df in place. You can rewrite that to be dropCols |> List.iter df.DropColumn, i.e. without the lambda.
After looking at the API docs it seems there is also this:
Frame.sliceCols ["a"] df.
Related
What is the best way to iterate da.linalg.inv over a multi-dimensional dask array?
I have a dask array of shape (4, 4, 8, 8), and need to compute the inverse of the last two dimensions. With numpy, np.linalg.inv loops over all dimensions except the last two, so in the following example, I can just call np.linalg.inv(A).
I have chosen to use a for loop, but I have read about gufuncs in dask (the documentation seems a little outdated). However, I'm not sure how to implement the it, particularly the "signature" bit,
import dask.array as da
import numpy as np
A = da.random.random((4,4,8,8))
A2 = A.reshape((-1,) + A.shape[-2:])
B = [da.linalg.inv(a) for a in A2]
B2 = da.asarray(B)
B3 = B2.reshape(A.shape)
np.testing.assert_array_almost_equal(
np.linalg.inv(A.compute()),
B3
)
My attempt at a gufunc leads to an error:
def foo(x):
return da.linalg.inv(x)
gufoo = da.gufunc(foo, signature="()->()", output_dtypes=float, vectorize=True)
gufoo(A2).compute() # IndexError: tuple index out of range
I think that you want to apply the numpy function np.linalg.inv over your Dask array rather than the dask array function.
If np.linalg.inv is already a gufunc then it might work as expected today
np.linalg.inv(A)
If I understand correctly, cvxpy converts our high-level problem description to the standard canonical form before it is sent to a solver.
By the standard form I mean the form that can be used for the descent algorithms, so, for instance, it would convert all the absolute values in the objective to be a difference of two positive numbers with some new constraints, etc.
Wondering if its possible to see what the reduction looked like for a problem I specify in cvxpy?
For instance, lets say I have the following problem:
import numpy as np
import cvxpy as cp
x = cp.Variable(2)
L = np.asarray([[1,2],[2,3]])
P = L.T # L
constraints = []
constraints.append(x >= [-10, -10])
constraints.append(x <= [10, 10])
obj = cp.Minimize(cp.quad_form(x, P) - [1, 2] * x)
prob = cp.Problem(obj, constraints)
prob.solve(), prob.solver_stats.solver_name
(-0.24999999999999453, 'OSQP')
So, I would like to see the actual arguments (P, q, A, l, u) being sent to the OSQP solver https://github.com/oxfordcontrol/osqp-python/blob/master/module/interface.py#L278
Any help is greatly appreciated!
From looking at the documentation, it seems you can do this using the command get_problem_data as follows:
data, chain, inverse_data = prob.get_problem_data(prob.solver_stats.solver_name)
I have not tried it, and it says it output depends on the particular solver and the solver chain, but it may help you!
I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe
I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis.
Up to now I was using dask.delayed objects to load, convert and append all data which works fine (see example below). However for future work I would like to store the loaded dictionaries in a dask.bag using dask.persist().
I managed to load the data into dask.bag, resulting in a list of dicts or list of pandas.DataFrame that I can use locally after calling compute(). When I tried turning the dask.bag into a dask.dataframe using to_delayed() however, I got stuck with an error (see below).
It feels like I am missing something rather simple here or maybe my approach to dask.bag is wrong?
The below example shows my approach using simplified functions and throws the same error. Any advice on how to tackle this is appreciated.
import numpy as np
import pandas as pd
import dask
import dask.dataframe
import dask.bag
print(dask.__version__) # 1.1.4
print(pd.__version__) # 0.24.2
def make_dict(n=1):
return {"name":"dictionary","data":{'A':np.arange(n),'B':np.arange(n)}}
def make_df(d):
return pd.DataFrame(d['data'])
k = [1,2,3]
# using dask.delayed
dfs = []
for n in k:
delayed_1 = dask.delayed(make_dict)(n)
delayed_2 = dask.delayed(make_df)(delayed_1)
dfs.append(delayed_2)
ddf1 = dask.dataframe.from_delayed(dfs).compute() # this works as expected
# using dask.bag and turning bag of dicts into bag of DataFrames
b1 = dask.bag.from_sequence(k).map(make_dict)
b2 = b1.map(make_df)
df = pd.DataFrame().append(b2.compute()) # <- I would like to do this using delayed dask.DataFrames like above
ddf2 = dask.dataframe.from_delayed(b2.to_delayed()).compute() # <- this fails
# error:
# ValueError: Expected iterable of tuples of (name, dtype), got [ A B
# 0 0 0]
what I ultimately would like to do using the distributed scheduler:
b = dask.bag.from_sequence(k).map(make_dict)
b = b.persist()
ddf = dask.dataframe.from_delayed(b.map(make_df).to_delayed())
In the bag case the delayed objects point to lists of elements, so you have a list of lists of pandas dataframes, which is not quite what you want. Two recommendations
Just stick with dask.delayed. It seems to work well for you
Use the Bag.to_dataframe method, which expects a bag of dicts, and does the dataframe conversion itself
I am trying to implement a rolling average which resets whenever a '1' is encountered in a column labeled 'A'.
For example, the following functionality works in Pandas.
import pandas as pd
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
If I try an analogous code in Dask, I get the following:
import pandas as pd
import dask
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x = dask.dataframe.from_pandas(x, npartitions=3)
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-189-b6cd808da8b1> in <module>()
7 x = dask.dataframe.from_pandas(x, npartitions=3)
8
----> 9 x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
10 x
AttributeError: 'SeriesGroupBy' object has no attribute 'rolling'
After searching through the Dask API documentation I have not been able to find an implementation of what I am looking for.
Can anyone suggest an implementation of this algorithm in a Dask compatible way?
Thank you :)
Since then I found the following code snippet:
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
at Dask rolling function by group syntax.
Here is an adapted toy example:
import pandas as pd
import dask.dataframe as dd
x = pd.DataFrame([[1,2,3], [2,3,4], [4,5,6], [2,3,4], [4,5,6], [4,5,6], [2,3,4]])
x['bool'] = [0,0,0,1,0,1,0]
x.columns = ['a', 'b', 'x', 'bool']
ddf = dd.from_pandas(x, npartitions=4)
ddf['cumsum'] = ddf['bool'].cumsum()
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
df1
This has the correct functionality, but the order of the indices is now incorrect. Alternatively, if one knows how to preserve the order of the index, that would be a suitable solution.
You might want to construct your own rolling operation using the map_overlap or the _cum_agg methods (cum_agg is unfortunately not well documented).
Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?
My algorithm won't give me the shape of the array until pretty late in the computation.
Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)
I don't think it is too detrimental if I can't, but it would be cool if could.
Sample code
from dask import delayed
from dask import array as da
import numpy as np
n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)
# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension
Example
import random
import numpy as np
import dask
import dask.array as da
#dask.delayed
def f():
return np.ones((5, random.randint(10, 20))) # a 5 x ? array
values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)
>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>
>>> x.shape
(5, np.nan)
>>> x.compute().shape
(5, 88)
Docs
See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks