Is my Statement and Branch coverage correct? - code-coverage

def score_report(scores):
'''print a report on exam scores
args: scores - a list of numbers representing exam scores
returns: nothing
'''
sum = 0.0
for score in scores:
if score > 0.0:
sum += score
if len(scores) > 0:
mean = sum / float(len(scores))
print("the mean score is {0}".format(mean))
if (mean > 50):
print("on average, people passed. Yay!")
else:
print("No scores were found")
If my understanding is correct, in order to achieve 100% statement coverage but not achieve 100% branch coverage for score_report my test inputs for score_report would be [100], []
Is this correct? The reason I ask is 100% statement coverage is supposed to use one input to cover all statements and I've used two. I'm not sure if this is my mistake.
Furthermore, the smallest set of test inputs to this method that achieve 100% branch coverage would be [100], [], [-1]
Is this also correct?

Related

the stacking model, I want to see the recall and precision results

In the stacking model, I want to see the recall and accuracy results, I have tried many methods and I have not found results. I have found recall and precision in another model but I stuck with the stacking model., little help would go a long way.
estimator = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('dec_tree', dec_tree),
('knn', knn),
('xgb' , xgb),
('ext' , ext),
('grad' , grad),
('hist', hist)]
#bulid stack model
stack_model= StackingClassifier(
estimators=estimator, final_estimator=LogisticRegression())
#train stack model
stack_model.fit(x_train, y_train)
#make preduction
y_train_pred = stack_model.predict(x_train)
y_test_pred = stack_model.predict(x_test)
#traning set performance
stack_model_train_accuracy = accuracy_score(y_train,y_train_pred)
stack_model_train_f1 = f1_score(y_train,y_train_pred, average ='weighted')
#Testing set performance
stack_model_test_accuracy = accuracy_score(y_test,y_test_pred)
stack_model_test_f1= f1_score(y_test,y_test_pred, average ='weighted')
#print
print ('Model Performance For Traning Set')
print ('- Accuracy: %s' % stack_model_train_accuracy)
print ('- f1: %s' % stack_model_train_f1)
print ('______________________________________')
print ('Model Performance For Testing Set')
print ('- Accuracy: %s' % stack_model_test_accuracy)
print ('- f1: %s' % stack_model_test_f1)
until here it is working >> but I need the recall and precision. if I check them out, in the same way, I checked the accuracy and f_score > > it will be wrong! and if I used classification_report it will be an error too.

Python Seasonality Detection

What is the best way to detect seasonality in a signal (time series) in Python? I want to provide the algorithm with the signal and the output should be a 1 indicating seasonality exists and 0 indicating it does not exist.
Hope that helps for some basic usage, still I do not suggest it for complicated problems. A simple seasonality detection code I wrote:
def check_repetition(arr, limit, index_start, index_end):
"""
Checks repetition in data so that we can apply de-noising.
"""
length = index_start
# length is the length we want to apply the checking
# check how many periods there are with that kind of length
for i in range(0, int( len(array)/length)):
# if the difference in seasons is not smaller than the limit
condition = np.array( arr[i:int(i+length)]) - np.array( arr[int( i+length):int( i+2*length)])
condition = np.sum([abs(number) for number in condition])
if condition >= limit :
# check if the length is still bigger than the limit
# increase the length to check
if length + 1 <= index_end:
#print( "Checked for length:" + str( length))
return check_repetition(arr, limit, length + 1, index_end)
# if not than no more computations needed
else:
return 0
# if it passed the for loop for one cycle of i then return the number of entries per cycle
if i == int( len(array)/length)-2:
return(length)
# if nothing worked
return 0
This returns the seasonality length. You can play with it in starting with seasonality from array length/2 to a small value, or the opposite. Also included is some noise detection with the parameter limit, which should limit the amount of noise accepted.

minimize the maximum continious subarray in array of 0/1

Algo question
Binary array of 0/1 given
In one operation i can flip any array[index] of array i.e. 0->1 or 1->0
so aim is to minimize the maximum lenth of continious 1's or 0's by using atmost k flips
eg if 11111 if array and k=1 ,best is to make array as 11011
And minimized value of maximum continous 1's or 0's is 2
for 111110111111 and k=3 ans is 2
I tried Brute Force (by trying various position flips) but its not efficient
I think Greedy ,but can not figure out exactly
can you please help me for algo,O(n) or similar
A solution could be devised by reading each bit in order and recording the size of each continuous group of 1 into a list A.
Once you are done filling A, you can follow the algorithm narrated by the pseudocode below:
result = N
for i = 1 to N
flips_needed = 0
for a in A:
flips_needed += <number of flips needed to make sure largest group remaining in a is of size i>
if k >= flips_needed:
result = flips_needed
break
return result
N is the number of bits in the entire initial sequence.
The algorithm above works by dividing the groups of 1 into sizes of at most i. Whenever doing that requires <= k, we have the result we are looking for, as i starts from 1 and goes up. (i.e. when we found flips_needed <= k, we know the groups of 1 are as minimal as they can get)

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

Estimating change of a cyclic boolean variable

We have a boolean variable X which is either true or false and alternates at each time step with a probability p. I.e. if p is 0.2, X would alternate once every 5 time steps on average. We also have a time line and observations of the value of this variable at various non-uniformly sampled points in time.
How would one learn, from observations, the probability that after t+n time steps where t is the time X is observed and n is some time in the future that X has alternated/changed value at t+n given that p is unknown and we only have observations of the value of X at previous times? Note that I count changing from true to false and back to true again as changing value twice.
I'm going to approach this problem as if it were on a test.
First, let's name the variables.
Bx is value of the boolean variable after x opportunities to flip (and B0 is the initial state). P is the chance of changing to a different value every opportunity.
Given that each flip opportunity is not related to other flip opportunities (there is, for example, no minimum number of opportunities between flips) the math is extremely simple; since events are not affected by the events of the past, we can consolidate them into a single computation, which works best when considering Bx not as a boolean value, but as itself a probability.
Here is the domain of the computations we will use: Bx is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of truth. P is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of flipping at any given opportunity.
The probability of falseness, 1 - Bx, and the probability of not flipping, 1 - P, are probabilistic identities which should be quite intuitive.
Assuming these simple rules, the general probability of truth of the boolean value is given by the recursive formula Bx+1 = Bx*(1-P) + (1-Bx)*P.
Code (in C++, because it's my favorite language and you didn't tag one):
int max_opportunities = 8; // Total number of chances to flip.
float flip_chance = 0.2; // Probability of flipping each opportunity.
float probability_true = 1.0; // Starting probability of truth.
// 1.0 is "definitely true" and 0.0 is
// "definitely false", but you can extend this
// to situations where the initial value is not
// certain (say, 0.8 = 80% probably true) and
// it will work just as well.
for (int opportunities = 0; opportunities < max_opportunities; ++opportunities)
{
probability_true = probability_true * (1 - flip_chance) +
(1 - probability_true) * flip_chance;
}
Here is that code on ideone (the answer for P=0.2 and B0=1 and x=8 is B8=0.508398). As you would expect, given that the value becomes less and less predictable as more and more opportunities pass, the final probability will approach Bx=0.5. You will also observe oscillations between more and less likely to be true, if your chance of flipping is high (for instance, with P=0.8, the beginning of the sequence is B={1.0, 0.2, 0.68, 0.392, 0.46112, ...}.
For a more complete solution that will work for more complicated scenarios, consider using a stochastic matrix (page 7 has an example).

Resources