Pandas time series index attribute error when using TsTables & PyTables in creating a table class - time-series

I am trying to create a table structure through tb.IsDescription class, then create a .h5 file and populate it with a Pandas Dataframe with Datetime index, using TsTables package. I have already tested the Dataframe and the date time Indexing and both seem to be fine. I believe the issue is with the TsTable package, as it remains 'Unused import statement'. The error I get is: " AttributeError: module 'pandas.tseries' has no attribute 'index' ". The reason I am using the TsTAble is that I have heard it is faster than other modules. Any suggestions how to resolve this issue, or any substitute method?
import numpy as np
import pandas as pd
import tables as tb
import datetime as dt
path = r'C:\Users\--------\PycharmProjects\pythonProject2'
no = 5000000 # number of time steps
co = 3 # number of time series
interval = 1. / (12 * 30 * 24 * 60) # the time interval as a year fraction
vol = 0.2 # volatility
rn = np.random.standard_normal((no, co))
rn[0] = 0.0 # sets the initial random numbers to zero
paths = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * interval + vol * np.sqrt(interval) * rn, axis=0))
# simulation based on an Euler discretization
paths[0] = 100 # Sets the initial values of the paths to 100
dr = pd.date_range('2019-1-1', periods=no, freq='1s')
print(dr[-6:]) # the date range appears fine
df = pd.DataFrame(paths, index=dr, columns=['ts1', 'ts2', 'ts3'])
print(df.info(verbose=True)) # df is pandas Dataframe and appears fine
print(df.head()) # tested a fraction of the data, it is fine
import tstables as tstab # I get Unused import statement
class ts_desc(tb.IsDescription):
timestamp = tb.Int64Col(pos=0) # The column for the timestamps
ts1 = tb.Float64Col(pos=1) # The column to store numerical data
ts2 = tb.Float64Col(pos=2)
ts3 = tb.Float64Col(pos=3)
h5 = tb.open_file(path + 'tstab.h5', 'w')
ts = h5.create_ts('/', 'ts', ts_desc)
ts.append(df) # !!!!! the error I get is from this code line !!!!
# value error raised is: if rows.index.__class__ != pandas.tseries.index.DatetimeIndex:
AttributeError: module 'pandas.tseries' has no attribute 'index' `

I suspect you have run into a version compatibility issue between tstables and your pandas versions (assuming you are running any recent pandas version). Based on the tstables PyPI page, the last release of tstables was in 2015. A check of the tstables github project page shows there was an issue with Pandas 0.20.3 and use of datetime. The error message is the same as yours: module 'pandas.tseries' has no attribute 'index' in tstables See this: tstables breaks down with Pandas 20.3
The issue has a link to another build that works with Pandas 0.20.3. Development notes state "Removed 'convert_datetime64' parameter on line 245". Not sure if it will work with more recent versions, but worth a try. See this: schwed2 tstables build
If that doesn't solve the problem, I suggest running the simple examples provided or run the setup tests. (Note: I could not find the bpi_2014_01.csv file to test the bitcoin/bpi example.)
Good luck.

Related

GluonTS example airpassengers dataset not found

I am trying to run the GluonTS example code, going through some struggle to install the libraries, now I get the following error:
FileNotFoundError: C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\test
The C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\ does exist but contains only train folder. Have tried reinstalling but to no avail. Any ideas how to fix this and run the example, even if finding the dataset in correct format elsewhere?
EDIT: To clarify, I was referring to an example on https://ts.gluon.ai/stable/
import matplotlib.pyplot as plt
from gluonts.dataset.util import to_pandas
from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.mx import DeepAREstimator, Trainer
dataset = get_dataset("airpassengers")
deepar = DeepAREstimator(prediction_length=12, freq="M", trainer=Trainer(epochs=5))
model = deepar.train(dataset.train)
# Make predictions
true_values = to_pandas(list(dataset.test)[0])
true_values.to_timestamp().plot(color="k")
prediction_input = PandasDataset([true_values[:-36], true_values[:-24], true_values[:-12]])
predictions = model.predict(prediction_input)
for color, prediction in zip(["green", "blue", "purple"], predictions):
prediction.plot(color=f"tab:{color}")
plt.legend(["True values"], loc="upper left", fontsize="xx-large")
There was an incorrect import on the earlier version of the example, which was since corrected, also I needed to specify regenerate=True while getting the dataset, so:
dataset = get_dataset("airpassengers", regenerate=True).

How to execute prophet functions using multi-processing?

I have been generating a few forecast with a small set of 100 products (e.g 100 Aliases), but I want to scale to 20k Aliases. Currently in a anaconda environment using default configuration (1 core). Forecast function has been running for 2 days and I have processes like 4k Aliases. How would I implement multiprocessing with prophet. I think that would help reduce processing time (if I understand correctly?). I currently have a machine with 32 gb of ram, and Intel Xeon(R) CPU E5-2697 v4 (18 cores). Any help would be amazing....
import warnings;
warnings.simplefilter('ignore')
!pip install pystan
!pip install fbprophet
import pandas as pd
from fbprophet import Prophet
import os
from pandas_visual_analysis import VisualAnalysis
import pyodbc
from datetime import datetime, timedelta, date
import psycopg2
from distributed import Client, performance_report
import multiprocessing as mp
from multiprocessing import Process, current_process
#organized Alias in groups
x = df_group_final.groupby('Alias')
#create empty dataframe
y = pd.DataFrame()
def forecast(Aliases, target):
#for loop to pick all Aliases and loop through 1 at a time
for Alias in Aliases.groups:
#assigns group variable and get all data for that group (aka Alias)
group = Aliases.get_group(Alias)
#initialized model with 95% confidence interval.
m = Prophet(interval_width=0.95, seasonality_mode='multiplicative')
#fit model based on past data by each group at a time
m.fit(group)
#based on model predict 18 months in future
future = m.make_future_dataframe(periods=18, freq='m')
forecast = m.predict(future)
#m.plot(forecast)
#creates forecast for ALias
forecast = forecast.rename(columns={'yhat':'yhat_'+Alias})
target = pd.merge(target, forecast.set_index('ds'), how='outer', left_index=True, right_index=True)
forecast(x,y)

I'm using Dask to apply LabelingFunction using Snorkel on multiple datasets but it seems to take forever. Is this normal?

My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)

dask can not read the file that pandas can

I have a csv file that can be accessed using pandas but fails with dask dataframe.
I am using exact same parameters and still getting error with dask.
Pandas use case:
import pandas as pd
mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry', 'tran_date']
df = pd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str )
Pandas output:
df.retry.value_counts()
1 2792174
2 907081
3 116369
6 6475
4 5598
7 1314
5 1053
8 288
16 3
13 3
Name: retry, dtype: int64
dask code:
import dask.dataframe as dd
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')
df = dd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str,
storage_options = {'anon':False, 'key': 'xxx' , 'secret':'xxx'} )
df_persisted = client.persist(df)
df_persisted.retry.value_counts().compute()
Dask Output:
ParserError: unexpected end of data
I have tried opening smaller (and bigger) files in dask and there was no issue with them. It is possible that this file may have unclosed quotations. I can not see any reason why dask is unable to read the file.
Dask splits your files by looking for the line separator character b"\n". It looks for this single byte in parts of the file, so that the whole thing does not need to be read beforehand. When it finds it is not aware of whether the byte is escaped or within a quoted scope.
Thus, the chunking up of a large file by Dask can fail, and it appears that this is happening for you: some block is finishing on a newline which is not really a line ending.

Biopython: Local alignment between DNA sequences doesn't find optimal alignment

I'm writing code to find local alignments between two sequences. Here is a minimal, working example I've been working on:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
seq1 = "GTGGTCCTAGGC"
seq2 = "GCCTAGGACCAC"
# scores for the alignment
match =1
mismatch = -2
gapopen = -2
gapext = 0
# see: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
# 'localms' takes <seq1,seq2, match,mismatch,open,extend>
for a in pairwise2.align.localms(seq1,seq2,match,mismatch,gapopen,gapext):
print(format_alignment(*a))
The following code runs with the output
GTGGTCCTAGGC----
|||||
----GCCTAGGACCAC
Score=5
But a score of '6' should be possible, finding the 'C-C' next to the 5 alignments, like so:
GTGGTCCTAGGC----
||||||
----GCCTAGGACCAC
Score=6
Any ideas on what's going on?
This seems to be a bug in the current implementation of local alignments in Biopython's pairwise2 module. There is a recent pull request (#782) on Biopython's GitHub, which should solve your problem:
>>> from Bio import pairwise2 # This is the version from the pull request
>>> seq1 = 'GTGGTCCTAGGC'
>>> seq2 = 'GCCTAGGACCAC'
>>> for a in pairwise2.align.localms(seq1, seq2, 1, -2, -2, 0):
print pairwise2.format_alignment(*a)
GTGGTCCTAGGC----
||||||
----GCCTAGGACCAC
Score=6
If you are working with short sequences only, you can just download
the code for pairwise2.py from the pull request
mentioned above. In addition you need to 'inactivate' the respective
C module (cpairwise2.pyd or
cpairwise2.so), e.g. by renaming it or by removing the
import of the C functions at the end of
pairwise2.py(from .cpairwise import ...).
If your are working with longer sequences, you will need the speed enhancement of the C module. Thus you also have to download
cpairwise2module.c from the pull request and compile it
into cpairwise2.pyd (for Windows systems) or
cpairwise2.so (Unix, Linux).
EDIT: In Biopython 1.68 the problem is solved.

Resources