Why does dask return a subgraph rather than a computation on read_csv().compute()? - dask

When I do dd.read_csv('AAPL.csv'), it correctly identifies the structure:
In [13]: dd.read_csv('AAPL.csv')
Out[13]:
Dask DataFrame Structure:
2016-01-04T09:00:00Z 105.45 105.45.1 103.6 103.6.1 17462
npartitions=1
object float64 float64 float64 float64 float64
... ... ... ... ... ...
Dask Name: read-csv, 1 tasks
But when I try to actually compute it, it always returns this:
In [14]: dd.read_csv('AAPL.csv').compute()
Out[14]:
0 (<Serialize: subgraph_callable-f9057ef2-07c7-4...
dtype: object
It works fine when importing with pandas. What am I doing wrong?

Related

Getting byte type object in prediction

I am getting Byte type value returned from predict function on a data.
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
predictor1 = Predictor(endpoint_name=predictor.endpoint_name, serializer=CSVSerializer(), deserializers = CSVDeserializer())
result = predictor1.predict(data)
print(type(result))
print(result)
<class 'bytes'>
b'{"probabilities": [[0.9999768137931824, 2.3188162231235765e-05]]}'
The way you're using the predict method, the output is passed in bytes. But there is no sagemaker function you have to use to solve the problem.
Just use decode() and eval() to retrieve the parameters correctly:
decoded_string = result.decode('utf-8')
json_from_string = eval(decoded_string)
print(json_from_string['probabilities'][0])
output will be:
[0.9999768137931824, 2.3188162231235765e-05]

Dask regex extract comparison failing with NotImplementedError

I have a Dask dataframe that looks like this:
class1 statement class2 value
<geoentity_Pic_de_Font_Blanca_2986043> <hasLatitude> 42.64991^^<degrees> 42.64991
<geoentity_Pic_de_Font_Blanca_2986043> <hasLongitude> 1.53335^^<degrees> 1.53335
<geoentity_Pic_de_Font_Blanca_2986043> <hasGeonamesEntityId> 2986043 NaN
<geoentity_Pic_de_Font_Blanca_2986043> rdfs:label Pic de Font Blanca NaN
I'm trying to check whether the number in class1 matches the one in class2 for all the <hasGeonamesEntityId> rows; so that I can get rid of those rows, since they would then carry unnecessarily duplicated data.
I tried:
df[(df['statement'] == '<hasGeonamesEntityId>') & (df['class1'].str.extract(r'_(\d+)>$') == df['class2'])].head()
but this gives me the following error:
E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
3347 graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self, key])
3348 return new_dd_object(graph, name, self, self.divisions)
-> 3349 raise NotImplementedError(key)
3350
3351 def __setitem__(self, key, value):
NotImplementedError: Dask DataFrame Structure:
0 1
npartitions=442
bool bool
... ...
... ... ...
... ...
... ...
Dask Name: and_, 3978 tasks
My dtypes are:
class1 category
statement category
class2 object
value category
I'm not sure why this is failing since the extract on its own seems to return the correct sub string. Anybody know what I'm doing wrong?
It's hard to say without a reproducible example, but it looks like you're trying to index a Dask DataFrame with another Dask DataFrame, which isn't supported and probably isn't what you want.
Using just pandas
In [18]: df = pd.DataFrame({"A": ['a1', 'b2', 'c3']})
In [19]: df[df.A.str.extract('(\d)') == '1']
Out[19]:
A
0 NaN
1 NaN
2 NaN
That's because .str.extract returns a DataFrame. Set expand=False to get a 1D series
In [20]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[20]:
A
0 a1
Which works for Dask as well
In [21]: df = dd.from_pandas(df, 2)
In [22]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[22]:
Dask DataFrame Structure:
A
npartitions=1
0 object
2 ...
Dask Name: getitem, 5 tasks
In [23]: _.compute()
Out[23]:
A
0 a1

How to perform a Diff on a Dask SeriesGroup object

I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command:
df.groupby('IndexName')['ValueName'].diff().
Dask, however, doesn't implement the diff function on SeriesGroupBy objects. I've attempted to implement my own with the following command:
df.groupby('IndexName')['ValueName'].apply(lambda x: x.diff(1) )
but this yields the following error:
ValueError: Wrong number of items passed 0, placement implies 3987
Any Ideas:
Below is the sample dataframe:
dummy = {
'Index1' : pd.DataFrame({'A' : np.arange(10),'ValueName': np.random.rand(10)}),
'Index2' : pd.DataFrame({'A' : np.arange(5),'ValueName': np.random.rand(5)})
}
pdf = pd.concat(dummy,names=['IndexName'])
def getDummy(f):
return f
dfs = [delayed(getDummy)(f) for f in [pdf]]
#NOTE: dd.from_pandas doesn't support multiindex...but delayed does
df = dd.from_delayed(dfs)

how to convert pandas str.split call to to dask

I have a dask data frame where the index is a string which looks like this:
12/09/2016 00:00;32.0046;-106.259
12/09/2016 00:00;32.0201;-108.838
12/09/2016 00:00;32.0224;-106.004
(its basically a string encoding the datetime;latitude;longitude of the row)
I'd like to split that while still in the dask context to individual columns representing each of the fields.
I can do that with a pandas dataframe as:
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
But that doesn't work in dask for several of the attempts I've tried. If I directly substitute the df for a dask df I get the error:
'Index' object has no attribute 'str'
If I use the column name instead of index as:
forecastDf['date'], forecastDf['Lat'], forecastDf['Lon'] = forecastDf['dateLocation'].str.split(';', 2).str
I get the error:
TypeError: 'StringAccessor' object is not iterable
Here is an runnable example of this working in Pandas
import pandas as pd
df = pd.DataFrame()
df['dateLocation'] = ['12/09/2016 00:00;32.0046;-106.259','12/09/2016 00:00;32.0201;-108.838','12/09/2016 00:00;32.0224;-106.004']
df = df.set_index('dateLocation')
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
df.head()
Here is the error I get if I directly convert that to dask
import dask.dataframe as dd
dd = dd.from_pandas(df, npartitions=1)
dd['date'], dd['Lat'], dd['Lon'] = dd.index.str.split(';', 2).str
>>TypeError: 'StringAccessor' object is not iterable
forecastDf['date'] = forecastDf['dateLocation'].str.partition(';')[0]
forecastDf['Lat'] = forecastDf['dateLocation'].str.partition(';')[2]
forecastDf['Lon'] = forecastDf['dateLocation'].str.partition(';')[4]
Let me know if this works for you!
First make sure the column is string dtype
forecastDD['dateLocation'] = forecastDD['dateLocation'].astype('str')
Then you can use this to split in dask
splitColumns = client.persist(forecastDD['dateLocation'].str.split(';',2))
You can then index the columns in the new dataframe splitColumns and add them back to the original data frame.
forecastDD = forecastDD.assign(Lat=splitColumns.apply(lambda x: x[0], meta=('Lat', 'f8')), Lon=splitColumns.apply(lambda x: x[1], meta=('Lat', 'f8')), date=splitColumns.apply(lambda x: x[2], meta=('Lat', np.dtype(str))))
Unfortunately I couldn't figure out how to do it without calling compute and creating the temp dataframe.

Annotations in grails with Dropwizard metrics not working

I'm configuring dropwizard metrics for a grails application with annotations. Using metrics-aspectj plugin (https://github.com/astefanutti/metrics-aspectj)
I have the following in `BuildConfig.groovy'
compile 'io.astefanutti.metrics.aspectj:metrics-aspectj:1.1.0'
compile 'io.dropwizard.metrics:metrics-core:3.1.0'
compile 'io.dropwizard.metrics:metrics-graphite:3.1.2'
compile 'io.dropwizard.metrics:metrics-annotation:3.1.2'
Here I'm trying to post data to Graphite server, running on local.
I have configured a controller as follows with graphite reporter. When i run the app it doesn't report anything, I'm trying to figure out where im going wrong, or please let me know, if there is an another approach (eg. using spring AOP)
import com.codahale.metrics.ConsoleReporter
import com.codahale.metrics.MetricFilter
import com.codahale.metrics.MetricRegistry
import com.codahale.metrics.SharedMetricRegistries
import com.codahale.metrics.annotation.Metered
import com.codahale.metrics.annotation.Timed
import com.codahale.metrics.graphite.Graphite
import com.codahale.metrics.graphite.GraphiteReporter
import io.astefanutti.metrics.aspectj.Metrics
import java.util.concurrent.TimeUnit
#Metrics(registry = "graphiteregistry2")
class GlassdoorController {
final MetricRegistry registry = new MetricRegistry();
GlassdoorController() {
final Graphite graphite = new Graphite(new InetSocketAddress("127.0.0.1", 2003));
final GraphiteReporter reporter = GraphiteReporter.forRegistry(registry)
.prefixedWith("grails.example.com")
.convertRatesTo(TimeUnit.SECONDS)
.convertDurationsTo(TimeUnit.MILLISECONDS)
.filter(MetricFilter.ALL)
.build(graphite);
reporter.start(1, TimeUnit.SECONDS);
ConsoleReporter reporter1 = ConsoleReporter.forRegistry(registry)
.convertRatesTo(TimeUnit.SECONDS)
.convertDurationsTo(TimeUnit.MILLISECONDS)
.build();
reporter1.start(5, TimeUnit.SECONDS)
SharedMetricRegistries.add("graphiteregistry2",registry);
log.info(SharedMetricRegistries.getOrCreate("graphiteregistry2"));
}
#Metered(name = "reviewspage")
#Timed(name = "reviewspagetimed")
def reviews() {
//business logic
}
Looking at your code, I spotted a difference:
reporter.start(1, TimeUnit.SECONDS);
// ...
reporter1.start(5, TimeUnit.SECONDS)
That means your Graphite reporter fires every second, but the console reporter only every 5 seconds. Probably your test program exists before those 5 seconds are over. So you can either add something Thread.sleep(6000) to your main() method or reduce the reporting interval for the console reporter to match the Graphite reporter. I tried, it works. ;-)
06.03.16 12:22:59 ==============================================================
-- Meters ----------------------------------------------------------------------
de.scrum_master.app.GlassdoorController.reviewspage
count = 1
mean rate = 0,20 events/second
1-minute rate = 0,20 events/second
5-minute rate = 0,20 events/second
15-minute rate = 0,20 events/second
-- Timers ----------------------------------------------------------------------
de.scrum_master.app.GlassdoorController.reviewspagetimed
count = 1
mean rate = 0,20 calls/second
1-minute rate = 0,20 calls/second
5-minute rate = 0,20 calls/second
15-minute rate = 0,20 calls/second
min = 0,01 milliseconds
max = 0,01 milliseconds
mean = 0,01 milliseconds
stddev = 0,00 milliseconds
median = 0,01 milliseconds
75% <= 0,01 milliseconds
95% <= 0,01 milliseconds
98% <= 0,01 milliseconds
99% <= 0,01 milliseconds
99.9% <= 0,01 milliseconds

Resources