i have a monthly time series data containing 528 rows and i use ts()function as follows
oilfulldata <- ts(data$logOil, start = c(1976,01), end = c(2019,12), frequency = 12 )
this function is working proberly and i get the values stored in oilfulldata as i see them in the excel sheet i import the data from
head(Oilfulldata)
[1] 1.080266 1.082785 1.085291 1.085291 1.085291 1.085291
second
i try to make multiple time series from different dates as follows
Oildata1 <- ts(data$logOil, start = c(1976,01), end = c(1999,12), frequency = 12 )
Oildata2 <- ts(data$logOil, start = c(2002,01), end = c(2019,12), frequency = 12 )
first code also is also working proberly and get the values like i see them in the excel sheet i import the data from
head(Oildata1)
[1] 1.080266 1.082785 1.085291 1.085291 1.085291 1.085291
second code is my problem
although i get no error but stored data is worng
head(Oildata2)
[1] 1.080266 1.082785 1.085291 1.085291 1.085291 1.085291
stored data shows the data from 1976,01 although i specified another start date
can anyone tell me whats going on here ?
You need the window() function:
Oildata1 <- window(oilfulldata, start = c(1976,01), end = c(1999,12))
Oildata2 <- window(oilfulldata, start = c(2002,01), end = c(2019,12))
The ts() function is just grabbing enough observations from the start of the data$logOil vector to match the start, end and frequency arguments. It has no way of knowing what time periods the observations correspond to.
Related
I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.
I am (sort of a beginner starting out) experimenting with Keras on a time series data application where I created a regression model and then saved it to run on a different Python script.
The time series data that I am dealing with is hourly data, and I am using a saved model in Keras to predict a value for each of hour in the data set. (data = CSV file is read into pandas) With a years worth of time series data there is 8760 (hours in a year) predictions and finally I am attempting to sum the values of the predictions at the end.
In the code below I am not showing how the model architecture gets recreated (keras requirement for a saved model) and the code works its just extremely slow. This method seems fine for under a 200 predictions but for a 8760 the code seems to bog down way too much to ever finish.
I don't have any experience with databases but would this be a better method versus storing 8760 keras predictions in a Python list? Thanks for any tips I am still riding the learning curve..
#set initial loop params & empty list to store modeled data
row_num = 0
total_estKwh = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params])
estimatedKwh = load_trained_model(weights_path).predict(params)
print('Analyzing row number:', row_num)
total_estKwh.append(estimatedKwh)
row_num += 1
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()
Seems you are making your life very difficult without obvious reason...
For starters, you don't need to load your model for every row - this is overkill! You shoud definitely move load_trained_model(weights_path) out of the for loop, with something like
model = load_trained_model(weights_path) # load ONCE
and replace the respective line in the loop with
estimatedKwh = model.predict(params)
Second, it is again not efficient to call the model for prediction row-by-row; it is preferable to first prepare your params as an array, and then feed this to the model for getting batch predictions. Forget the print statement, too..
All in all, try this:
params_array = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params]) # is this if really necessary??
params_array.append(params)
params_array = np.asarray(params_array, dtype=np.float32)
total_estKwh = load_trained_model(weights_path).predict(params_array)
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()
I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)
I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.
I have a model Sensor and a query from this model:
#sensors = Sensor.where(device_id: 4)
Output is:
Id seq_num temp
5 1 40
6 2 41
7 3 45
First: I want to search this query result for example locate (find) for seq_num = 2
(#sensors.find(seq_num = 2))
Second: after find a record change temp value and save to database. It is possible that all of record changed.
How can do first and second?
If seq_num value is unique for all sensors, then you can find a sensor in #sensors list by its seq_num value using this code:
sensor = #sensors.detect { |s| s.seq_num == 2 }
detect method returns the first found element or nil in nothing was found.
To save all the sensors after updating their temp value you can use this code:
#sensors.each { |s| s.save }
or
#sensors.each(&:save)
Only those sensors that have temp value changed will be saved.
# First
#sensors = Sensor.where(device_id: 4, seq_num: 2)
# Second
Sensor.update_all({:temp => 1}, {device_id: 4, seq_num: 2})
your question is a bit unclear. You want to get a list of all sensors, and then only get the one with seq_num 2. Do you still need the rest of the data in #sensors? Then you want to change the value in the one you found and save it? Also do you need to work with the device, since asssuming sensor belongs to device you can query through the Device association.
if you need all sensors in #sensors. you can do this.
#sensors = Sensor.where(device_id: 4)
#sensor = #sensors.find { |sensor| sensor.seq_num == 2 }
if you only need the one sensor you can do
#senser = Sensor.where(device_id: 4, seq_num: 2)
as for changing the data and saving it ,
#sensor.temp = new_temp
#sensor.save