How to merge two dask dataframes with string indexes? - dask

I am trying read sql tables and perform merge in dask. This is using dask version 2.8.0. Here is the snippet of my code:
tdf = dd.read_sql_table('comments', conn_url, index_col='author', divisions=list('1234567890'))
adf = dd.read_sql_table('users', conn_url, index_col='id', divisions=list('1234567890'))
dd.merge(tdf, adf, how='left', left_index=True, right_index=True)
The dtypes of indexes is 'O'. However I get an error
...
...
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(self, divisions, npartitions, partition_size, freq, force)
1120 return repartition_npartitions(self, npartitions)
1121 elif divisions is not None:
-> 1122 return repartition(self, divisions, force=force)
1123 elif freq is not None:
1124 return repartition_freq(self, freq=freq)
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(df, divisions, force)
5656 tmp = "repartition-split-" + token
5657 out = "repartition-merge-" + token
-> 5658 dsk = repartition_divisions(
5659 df.divisions, divisions, df._name, tmp, out, force=force
5660 )
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition_divisions(a, b, name, out1, out2, force)
5314 ('c', 2): ('b', 3)}
5315 """
-> 5316 check_divisions(b)
5317
5318 if len(b) < 2:
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in check_divisions(divisions)
5276 divisions = list(divisions)
5277 if divisions != sorted(divisions):
-> 5278 raise ValueError("New division must be sorted")
5279 if len(divisions[:-1]) != len(list(unique(divisions[:-1]))):
5280 msg = "New division must be unique, except for the last element"
ValueError: New division must be sorted
How can I achieve this join?

The divisions list is indeed not sorted, recall that your indexes are in string format and '0' as a string will be before '1':
# check order
sorted(list("123456890"))

Related

Repartition dask dataframe while maintaining its original sequence

I am trying to repartition multiple .parquet files in order to save a specific number of parquet files. I have a time-series data that is dependent on the number of observations (instead of timestamps) for each client, so I need to ensure that the partitioning will not split a series over two files. In addition, I want to preserve the order, since I have the labels stored elsewhere. Here is an example of what I am trying to do:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on "customer_id" here
df = df.set_index("customer_id")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
last_idx = [ids[-1]]
my_divisions = ids + last_idx
read_ddf.divisions = my_divisions
# Split into two equal partitions with three customers each
new_divisions = [my_divisions[0], my_divisions[3], my_divisions[5]]
new_ddf = read_df.repartition(divisions=new_divisions)
which raises an error:
ValueError: New division must be sorted
I have tried an alternative approach, which involves setting the "partition" column as the index and modifying the index to "ids" later, but this sorts my entire dataframe, which is undesired because the new sequence no longer matches the labels stored. This is shown here:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on the defined "partition" instead
df = df.set_index("partition")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
# my_range is equivalent to the list of partitions
my_range = [i for i in range(0,6)]
last_idx = [my_range[-1]]
my_divisions = my_range + last_idx
read_ddf.divisions = my_divisions
new_divisions = [0, 2, 4, 5]
new_ddf = read_ddf.repartition(divisions=new_divisions)
# Need the "customer_id" as index
new_ddf = new_ddf.set_index("customer_id", drop = True)
But this sorts the dataframe by the index and messes up the structure, while I would like to keep the original order.
print("Partition 0")
print(new_ddf.get_partition(0).compute())
print("-------------------")
print("Partition 1")
print(new_ddf.get_partition(1).compute())
print("-------------------")
print("Partition 2")
print(new_ddf.get_partition(2).compute())
Partition 0
Empty DataFrame
Columns: [x]
Index: []
-------------------
Partition 1
x
customer_id
1088 9
1088 10
1088 11
1536 3
1536 4
1536 5
-------------------
Partition 2
x
customer_id
2251 15
2251 16
2251 17
6411 12
6411 13
6411 14
8477 6
8477 7
8477 8
9635 0
9635 1
9635 2
Are there any workarounds for this issue? I am aware that set_index in dask is quite expensive, but none of the approaches are currently working. Also, in my case I already have the .parquet files with the preprocessed data, so I only created the initial dataframe using pandas for demonstration purposes (it would have been much easier to specify the number of partitions in the first step if I had all the data in pandas).

Finding the "GAP" in node values ? or next?

Let say I have a nodes with values a multiples of 10. I want to find the first GAP in the values.
Here is how I would do it in numpy :
> np.where(np.diff([11,21,31,51,61,71,91]) > 10)[0][0] + 2
> 4 i.e. 41
How would I do this in Cypher... ?
match (n) where n.val % 10 = 1
with n.val
order by val ....???
I'm using RedisGraph.
PS>
if no GAP it should return the next value i.e. biggest + 10, if possible !
I'm not sure if this is the most performant solution, but you can accomplish this using a combination of collect() and list comprehensions:
MATCH (n) WHERE n.val % 10 = 1 WITH n.val AS val ORDER BY n.val // collect ordered vals
WITH collect(val) AS vals // combine vals into array
WITH vals, [idx IN range(0, size(vals) + 1) WHERE vals[idx + 1] - vals[idx] > 10] AS gaps // find first index with diff > 10
RETURN vals[gaps[0]] + 10 // return missing value
To additionally return the next-biggest value if no gaps are found, change the RETURN clause to use a CASE statement:
RETURN CASE size(gaps) WHEN 0 THEN vals[-1] + 10 ELSE vals[gaps[0]] + 10 END

sklearn oneclass svm KeyError

My Dataset is a set of system calls for both malware and benign, I preprocessed it and now it looks like this
NtQueryPerformanceCounter
NtProtectVirtualMemory
NtProtectVirtualMemory
NtQuerySystemInformation
NtQueryVirtualMemory
NtQueryVirtualMemory
NtProtectVirtualMemory
NtOpenKey
NtOpenKey
NtOpenKey
NtQuerySecurityAttributesToken
NtQuerySecurityAttributesToken
NtQuerySystemInformation
NtQuerySystemInformation
NtAllocateVirtualMemory
NtFreeVirtualMemory
Now I'm using tfidf to extract the features and then use ngram to make a sequence of them
from __future__ import print_function
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.svm import OneClassSVM
nGRAM1 = 8
nGRAM2 = 10
weight = 4
main_corpus_MAL = []
main_corpus_target_MAL = []
main_corpus_BEN = []
main_corpus_target_BEN = []
my_categories = ['benign', 'malware']
# feeding corpus the testing data
print("Loading system call database for categories:")
print(my_categories if my_categories else "all")
import glob
import os
malCOUNT = 0
benCOUNT = 0
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysMAL', '*.txt')):
fMAL = open(filename, "r")
aggregate = ""
for line in fMAL:
linea = line[:(len(line)-1)]
aggregate += " " + linea
main_corpus_MAL.append(aggregate)
main_corpus_target_MAL.append(1)
malCOUNT += 1
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysBEN', '*.txt')):
fBEN = open(filename, "r")
aggregate = ""
for line in fBEN:
linea = line[:(len(line) - 1)]
aggregate += " " + linea
main_corpus_BEN.append(aggregate)
main_corpus_target_BEN.append(0)
benCOUNT += 1
# weight as determined in the top of the code
train_corpus = main_corpus_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
train_corpus_target = main_corpus_target_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]
def size_mb(docs):
return sum(len(s.encode('utf-8')) for s in docs) / 1e6
# size of datasets
train_corpus_size_mb = size_mb(train_corpus)
test_corpus_size_mb = size_mb(test_corpus)
print("%d documents - %0.3fMB (training set)" % (
len(train_corpus_target), train_corpus_size_mb))
print("%d documents - %0.3fMB (test set)" % (
len(test_corpus_target), test_corpus_size_mb))
print("%d categories" % len(my_categories))
print()
print("Benign Traces: "+str(benCOUNT)+" traces")
print("Malicious Traces: "+str(malCOUNT)+" traces")
print()
print("Extracting features from the training data using a sparse vectorizer...")
t0 = time()
vectorizer = TfidfVectorizer(ngram_range=(nGRAM1, nGRAM2), min_df=1, use_idf=True, smooth_idf=True) ##############
analyze = vectorizer.build_analyzer()
X_train = vectorizer.fit_transform(train_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, train_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()
print("Extracting features from the test data using the same vectorizer...")
t0 = time()
X_test = vectorizer.transform(test_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, test_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
The output is:
Loading system call database for categories:
['benign', 'malware']
177 documents - 45.926MB (training set)
44 documents - 12.982MB (test set)
2 categories
Benign Traces: 72 traces
Malicious Traces: 150 traces
Extracting features from the training data using a sparse vectorizer...
done in 7.831695s at 5.864MB/s
n_samples: 177, n_features: 603170
Extracting features from the test data using the same vectorizer...
done in 1.624100s at 7.993MB/s
n_samples: 44, n_features: 603170
Now for the learning section I'm trying to use sklearn OneClassSVM:
print("==================\n")
print("Training: ")
classifier = OneClassSVM(kernel='linear', gamma='auto')
classifier.fit(X_test)
fraud_pred = classifier.predict(X_test)
unique, counts = np.unique(fraud_pred, return_counts=True)
print (np.asarray((unique, counts)).T)
fraud_pred = pd.DataFrame(fraud_pred)
fraud_pred= fraud_pred.rename(columns={0: 'prediction'})
main_corpus_target = pd.DataFrame(main_corpus_target)
main_corpus_target= main_corpus_target.rename(columns={0: 'Category'})
this the output to fraud_pred and main_corpus_target
prediction
0 1
1 -1
2 1
3 1
4 1
5 -1
6 1
7 -1
...
30 rows * 1 column
====================
Category
0 1
1 1
2 1
3 1
4 1
...
217 0
218 0
219 0
220 0
221 0
222 rows * 1 column
but when i try to calculate TP,TN,FP,FN:
##Performance check of the model
TP = FN = FP = TN = 0
for j in range(len(main_corpus_target)):
if main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == 1:
TP = TP+1
elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
FN = FN+1
elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
FP = FP+1
else:
TN = TN +1
print (TP, FN, FP, TN)
I get this error:
KeyError Traceback (most recent call last)
<ipython-input-32-1046cc75ba83> in <module>
7 elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
8 FN = FN+1
----> 9 elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
10 FP = FP+1
11 else:
c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
1069 key = com.apply_if_callable(key, self)
1070 try:
-> 1071 result = self.index.get_value(self, key)
1072
1073 if not is_scalar(result):
c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 30
1) I know the error is because it's trying to access a key that isn’t in a dictionary, but i can't just insert some numbers in the fraud_pred to handle this issue, any suggestions??
2) Am i doing anything wrong that they don't match?
3) I want to compare the results to other one class classification algorithms, Due to my method, what are the best ones that i can use??
Edit: Before calculating the metrics:
You could change your fit and predict functions to:
fraud_pred = classifier.fit_predict(X_test)
Also, your main_corpus_target and X_test should have the same length, can you put the code where you create main_corpus_target please?
its created it right after the benCOUNT += 1:
main_corpus_target = main_corpus_target_MAL main_corpus_target.extend(main_corpus_target_BEN)
This means that you are creating a main_corpus_target that includes MAL and BEN, and the error you get is:
ValueError: Found input variables with inconsistent numbers of samples: [30, 222]
The number of samples of fraud_pred is 30, so you should evaluate them with an array of 30. main_corpus_target contains 222.
Watching your code, I see that you want to evaluate the X_test, which is related to test_corpus X_test = vectorizer.transform(test_corpus). It would be better to compare your results to test_corpus_target, which is the target variable of your dataset and also has a length of 30.
These two lines that you have should output the same length:
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]
May I ask why are you calculating the TP, TN... by yourself?
You have a faster option:
Transform the fraud_pred series, replacing the -1 to 0.
Use the confusion matrix function that sklearn offers.
Use ravel to extract the values of the confusion matrix.
An example, after transforming the -1 to 0:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].values).ravel()
Also, if you are using the last pandas version:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].to_numpy()).ravel()

How does this binary encoder function work?

I'm trying to understand the logic behind this binary encoder.
It automatically takes categorical variables and dummy codes them (similar to one-hot-encoding on sklearn), but reduces the number of output columns equal to the log2 of the length of unique values.
Basically, when I used this library, I noticed that my dummy variables are limited to only a few of the unique values. Upon further investigation I noticed this #staticmethod, which take the log2 of the len of unique values in a categorical variable.
My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this? How does taking the log2 determine how many digits are needed to represent the data?
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
Full source code:
"""Binary encoding"""
import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input
__author__ = 'willmcginnis'
[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
Parameters
----------
verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.
Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None
"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}
[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)
self._dim = X.shape[1]
# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)
# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)
for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)
# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]
return self
[docs] def transform(self, X):
"""Perform the transformation to new categorical data.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
"""
if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')
# first check the type
X = convert_input(X)
# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))
if not self.cols:
return X
X = self.ordinal_encoder.transform(X)
X = self.binary(X, cols=self.cols)
if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)
if self.return_df:
return X
else:
return X.values
[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""
X = X_in.copy(deep=True)
if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]
bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]
# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))
for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))
X = X.reindex(columns=bin_cols + pass_thru)
return X
[docs] #staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
[docs] #staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""
if col is None or float(col) < 0.0:
return None
else:
col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col
My question is WHY? I realize that this reduces the dimensionality of
the output data, but what is the logic behind doing this?
Basically, the issue of categorical encoding is to make your algorithm it's dealing with categorical features. Therefore, several methods are available for doing it, including binary encoding. Actually, it's logic is close to the logic of One Hot Encoding (OHE), if you understood it.
For binary encoding, each unique label in your categorical vector is associated randomly to a number between (0) and (the number of unique labels-1). Now, you encode this number in base 2 and "transcript" the previous number in 0 and 1 through the newly created columns.
As an example, let's say your dataset as three different labels: 'A', 'B' & 'C'.
The following correspondance is randomly built:
'A' -> 1 -> 01;
'B' -> 2 > 10;
'C' -> 0 -> 00.
Therefore, an example of encoding of a given dataset is:
index my_category enc_category_0 enc_category_1
0 A, 1, 0
1, B, 0, 1
2, C, 0, 0
3 A, 1, 0
Regarding the utility of it, as you said it's reduce the dimensionality. Besides, I guess it helps not having too much zeros in the encoded columns as with OHE. Here is an interesting post: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
How does taking the log2 determine how many digits are needed to represent the data?
If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.
For more info about the implementation and code example: http://contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder
Hope I was clear,
Tchau

Recurrence relation - equal roots of characteristic equation

I have the following problem:
Solve the following recurrence relation, simplifying your final answer
using 'O' notation.
f(0)=3
f(1)=12
f(n)=6f(n-1)-9f(n-2)
We know this is a homogeneous 2nd order relation so we write the characteristic equation: a^2-6a+9=0 and the solutions are a1,2=3.
The problem is when I replace these values I get:
f(n)=c1*3^n+c2*3^n
and using the 2 initial relations I have:
f(0)=c1+c2=3
f(1)=3(c1+c2)=12
which gives me that there no values such that c1 and c2 such that these 2 relation are true.
Am I doing something wrong? Is the way it should be solved different when it comes to identical roots for the characteristic equation?
You can't solve it this way, because your matrix A is not diagonalizable.
However, here is what you get if you use Jordan's normal form instead:
f(n) = 3^{n-1}(3n + 9)
The Jordan matrix and the basis (with notation from wikipedia + Octave) is:
J := [3,1;0,3]
P := [3,4;1,1]
such that PJP^{-1} = A, where
A := [6,-9;1,0]
is your recurrence matrix. Furthermore, the Jordan matrix is almost as good as a diagonal matrix for computing powers:
J^n = 3^(n-1) * [3,n;0,3].
The recurrence is then:
[f(n+1); f(n)] = A^n [12,3] = PJ^nP^-1[12,3] = (<whatever>, 3^(n-1)*(3n+9)).
Here a quick numerical check (Scala, but you can take whatever you want, Octave or I whatever you like):
scala> def f(n: Int): Int = { if (n == 0) 3 else if (n == 1) 12 else (6 * f(n-1) - 9 * f(n-2)) }
f: (n: Int)Int
scala> for (i <- 0 until 20) println(f(i))
3
12
45
162
567
1944
6561
21870
72171
236196
767637
2480058
7971615
25509168
81310473
258280326
817887699
^
scala> def explicit(n: Int): Int = (Math.pow(3, n -1) * (3 * n + 9)).toInt
explicit: (n: Int)Int
scala> for (i <- 0 until 20) println(explicit(i))
3
12
45
162
567
1944
6561
21870
72171
236196
767637
2480058
7971615
25509168
81310473
258280326
817887699

Resources