Do math on string count (and text parsing with awk) - parsing

I have a 4 column file (input.file) with a header:
something1 something2 A B
followed by many 4-column rows with the same format (e.g.):
ID_00001 1 0 0
ID_00002 0 1 0
ID_00003 1 0 0
ID_00004 0 0 1
ID_00005 0 1 0
ID_00006 0 1 0
ID_00007 0 0 0
ID_00008 1 0 0
Where "1 0 0" is representative of "AA", "0 1 0" means "AB", and "0 0 1" means "BB"
First, I would like to create a 5th column to identify these representations:
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
Note that the A's and B's need to be parsed from columns 3 and 4 of the header row, as they are not always A and B.
Next, I want to "do math" on the counts for (the new) column 5 as follows:
(2BB + AB) / 2(AA + AB + BB)
Using the example, the math would give:
(2(1) + 3) / 2(3 + 3 + 1) = 5/14 = 0.357
which I would like to append to the end of the desired output file (output.file):
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
B_freq = 0.357
So far I have this:
awk '{ if ($2 = 1) {print $0, $5="AA"} \
else if($3 = 1) {print $0, $5="AB"} \
else if($4 = 1) {print $0, $5="BB"} \
else {print$0, $5="no data"}}' input.file > output.file
Obviously, I was not able to figure out how to parse the info from row 1 (the header row, edited out "column 1"), much less do the math.
Thanks guys!

a more structured approach...
NR==1 {a["100"]=$3$3; a["010"]=$3$4; a["001"]=$4$4; print; next}
{k=$2$3$4;
print $0, (k in a)?a[k]:"no data";
c[k]++}
END {printf "\nB freq = %.3f\n",
(2*c["001"]+c["010"]) / 2 / (c["100"]+c["010"]+c["001"])}
UPDATE
For non binary data you can follow the same logic with some pre-processing. Something like this should work in the main block:
for(i=2;i<5;i++) v[i]=(($i-0.9)^2<=0.1^2)?1:0;
k=v[2] v[3] v[4];
...
here the value is quantized at one for the range [0.8,1] and zero otherwise.
To capture "B" or substitute set h=$4 in the first block and use it as printf "\n%s freq...",h,(2*c...

Related

How can I label connected components in APL?

I'm trying to do leet puzzle https://leetcode.com/problems/max-area-of-island/, requiring labelling connected (by sides, not corners) components.
How can I transform something like
0 0 1 0 0
0 0 0 0 0
0 1 1 0 1
0 1 0 0 1
0 1 0 0 1
into
0 0 1 0 0
0 0 0 0 0
0 2 2 0 3
0 2 0 0 3
0 2 0 0 3
I've played with the stencil ⌺ operator and also tried using scan operators but still not quite there. Can somebody help?
We can start off by enumerating the ones. We do the by applying the function ⍸ (where, but since all are 1s, it is equivalent to 1,2,3,…) # at the subset masked by ⊢ the bits themselves, i.e. ⍸#⊢:
⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 3 0 4
0 5 0 0 6
0 7 0 0 8
Now we need to flood-fill the lowest number in each component. We do this with repeated application until the fix-point ⍣≡ of processing Moore neighbourhoods ⌺3 3. To get the von Neumann neighbours, we reshape the 9 elements in the Moore neighbourhood into a 4-row 2-column matrix with 4 2⍴ and use ⊢/ to select the right column. We remove any 0s with 0~⍨ them prepend , the original value ⍵[2;2] (even if 0) and have ⌊/ select the smallest value:
{⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 2 0 4
0 2 0 0 4
0 2 0 0 4
We map the values to indices by finding their ⊢ indices ⍳⍨ in the unique elements of ∘∪ 0 followed by , the ravelled matrix ,:
(⊢⍳⍨∘∪0,,){⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
1 1 2 1 1
1 1 1 1 1
1 3 3 1 4
1 3 1 1 4
1 3 1 1 4
And decrement which adjusts back to begin with zero:
¯1+(⊢⍳⍨∘∪0,,){⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 2 0 3
0 2 0 0 3
0 2 0 0 3

Transform string variable into 0-1 columns

As a very begginer in SPSS I would ask you for help with some transformation from table A into table B. I have to recode values of "brand" variable into columns and make 0-1 variables.
#table A#
nr brand
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
2 HENKEL
3 HENKEL
3 HENKEL
3 HENKEL
3 VIZIR
4 BIEDRONKA
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 HENKEL
5 VIZIR
6 HENKEL
#table B#
nr GREEN HENKEL VIZIR BIEDR BOBINI
1 1 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 1 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
I can do it in this particular case in this simple way:
compute HENKEL=0.
...
do if BRAND='GREEN_CARE' .
compute GREEN_CARE=1.
else if ....
but the loop has to be usable with another variable and different number of values ect. I was trying to make it all day and gave up.
Do you have any idea to make it in a easy way?
Thanks!
The following syntax does the job on the sample data you provided.
First, let's recreate the sample data to demonstrate on:
Data list list/nr (f1) brand (a30).
begin data
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
2 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "VIZIR"
4 "BIEDRONKA"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "HENKEL"
5 "VIZIR"
6 "HENKEL"
end data.
dataset name originalDataset.
Now for the restructure.
sort cases by nr brand.
* creating an index to enumerate cases for each combination of `nr` and `brand`.
* This is necessary for the `casestovars` command to work later.
compute ind=1.
if $casenum>1 and lag(nr)=nr and lag(brand)=brand ind=lag(ind)+1.
exe.
* variable names can't have spaces in them, so changing the category names accordingly.
compute brand=replace(rtrim(brand)," ","_").
sort cases by nr ind brand.
compute exist=1.
casestovars /id=nr ind /index= brand/autofix=no.

Splitting a string where one item is in parentheses

Please find my code below.
str = "1791 (AR6K Async) S 2 0 0 0 -1 2129984 0 0 0 0 0 113 0 0 20 0 1 0 2370 0 0 4294967295 0 0 0 0 0 0 0 2147483647 0 3221520956 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
for val in str:gmatch("%S+") do
print(val)
end
Output:
1791
(AR6K
Async)
S
2
0
0
0
-1
....
But I am expecting the output like,
1791
(AR6K Async)
S
2
0
0
0
-1
...
Can anyone please help me how to get the values in bracket as a single value instead getting separate values.
str = "1791 (AR6K Async) S 2 0 0 0 -1 2129984 0 0 0 0 0 113 0 0 20 0 1 0 2370 0 0 4294967295 0 0 0 0 0 0 0 2147483647 0 3221520956 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
for val in str:gsub("%S+","\0%0\0")
:gsub("%b()", function(s) return s:gsub("%z","") end)
:gmatch("%z(.-)%z") do
print(val)
end
Explanation:
1. Surround all spaceless substrings with "zero marks"
(add one binary-zero-character at the beginning and one at the end)
2. Remove "zero marks" from inside parentheses
3. Display all surrounded parts
It may not be possible to use a single lua pattern alone to do this.
However it can be easy to roll your own parsing / splitting of the string or just extend your code a bit to concatenate the parts from a part that starts with ( to the part that ends with )
Here is a small example
str = "1791 (AR6K Async) S 2 0 0 0 -1 2129984 0 0 0 0 0 113 0 0 20 0 1 0 2370 0 0 4294967295 0 0 0 0 0 0 0 2147483647 0 3221520956 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
local temp
for val in str:gmatch("%S+") do
if temp then
if val:sub(#val, #val) == ")" then
print(temp.." "..val)
temp = nil
else
temp = temp.." "..val
end
elseif val:sub(1,1) == "(" then
temp = val
else
print(val)
end
end
This code behaves exactly like your own, except that when it encounters a substring that starts with an opening bracket, it will save it to temp variable. Then it will concatenate new values to temp until a substring with the closing bracket at the end of it is encountered. At that point the whole string saved to temp will be printed and temp is set to nil and the loop is continued normally.
So there is just a special case coded for when a string with brackets comes by.
This may not be the most efficient implementation, but it works. Also this assumes that the separators are spaces, since when the strings are concatenated to temp variable, they will be concatenated with an ordinary space. This does not handle nested brackets.
This was just a quick demonstration of the idea however so I believe you can fix these shortcomings on your own as you need to if you use it.

Random Forest overfitting?

I'm facing the following problem: i'm training a random forest for binary prediction. the data is so structured:
> str(data)
'data.frame': 120269 obs. of 11 variables:
$ SeriousDlqin2yrs : num 1 0 0 0 0 0 0 0 0 0 ...
$ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
$ age : num 45 40 38 30 49 74 39 57 30 51 ...
$ NumberOfTime30.59DaysPastDueNotWorse: num 2 0 1 0 1 0 0 0 0 0 ...
$ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
$ MonthlyIncome : num 9120 2600 3042 3300 63588 ...
$ NumberOfOpenCreditLinesAndLoans : num 13 4 2 5 7 3 8 9 5 7 ...
$ NumberOfTimes90DaysLate : num 0 0 1 0 0 0 0 0 0 0 ...
$ NumberRealEstateLoansOrLines : num 6 0 0 0 1 1 0 4 0 2 ...
$ NumberOfTime60.89DaysPastDueNotWorse: num 0 0 0 0 0 0 0 0 0 0 ...
$ NumberOfDependents : num 2 1 0 0 0 1 0 2 0 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:29731] 7 9 17 33 42 53 59 63 72 87 ...
.. ..- attr(*, "names")= chr [1:29731] "7" "9" "17" "33" ...
I split the data
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
then i run the model and try to make predictions:
model.rf <- randomForest(as.factor(train[,1]) ~ ., data=train,ntree=1000,mtry=10,importance=TRUE)
pred.rf <- predict(model.rf, test, type = "prob")
rfpred <- c(1:22773)
rfpred[pred.rf[,1]<=0.5] <- "yes"
rfpred[pred.rf[,1]>0.5] <- "no"
rfpred <- factor(rfpred)
test[,1][test[,1]==1] <- "yes"
test[,1][test[,1]==0] <- "no"
test[,1] <- factor(test[,1])
confusionMatrix(as.factor(rfpred), as.factor(test$Y))
what I get is the following output:
> print(model.rf)
Call:
randomForest(formula = as.factor(train[, 1]) ~ ., data = train, ntree = 1000, mtry = 10, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 10
OOB estimate of error rate: 0%
Confusion matrix:
0 1 class.error
0 43093 0 0
1 0 25225 0
> head(pred.rf)
0 1
45868.1 1 0
112445 1 0
39001 1 0
133443 1 0
137460 1 0
125835.1 1 0
> confusionMatrix(as.factor(rfpred), as.factor(test$Y))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 14570 0
yes 0 8203
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.6398
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.6398
Detection Rate : 0.6398
Detection Prevalence : 0.6398
Balanced Accuracy : 1.0000
'Positive' Class : no
obviously the model cannot be so accurate!! what's wrong with my code?

Delphi 7 boolean equations aren't working

I have a program written in Delphi 7 that appears to be experiencing some logic issues. the following line never gives a true value even when my watch window says it should.
Seq^.step[1] :=
(PlcStart^ and (not Seq^.Step[2])) or
(RetryDelay^.Done and (not Seq^.Step[2])) or
(Seq^.Step[1] and (not Seq^.Step[reset_]));
my watch window shows that (PlcStart^ and (not Seq^.Step[2])) or (RetryDelay^.Done and (not Seq^.Step[2])) or (Seq^.Step[1] and (not Seq^.Step[reset_])) is true but the value of Seq^.Step[1] never gets set to true.
The real strange part is that I have a number of programs with the exact same line that all appear to be working correctly.
Seq^.step[1] :=
(PlcStart^ and (not Seq^.Step[2])) or
(RetryDelay^.Done and (not Seq^.Step[2])) or
(Seq^.Step[1] and (not Seq^.Step[reset_]));
I'm not familiar with Delphi but I am familiar with boolean logic. If I'm reading this right your're saying:
(A ∧ ¬B) ∨ (C ∧ ¬B) ∨ (D ∧ ¬E)
in javascript that's:
(a && !b) || (c && !b) || (d && !e)
Using http://mustpax.github.io/Truth-Table-Generator/ to generate a truth table and converting "false" to "0" and "true" to "1", we get the truth table:
a b c d e (a & !b) | (c & !b) | (d & !e)
1 1 1 1 1 0
0 1 1 1 1 0
1 0 1 1 1 1
0 0 1 1 1 1
1 1 0 1 1 0
0 1 0 1 1 0
1 0 0 1 1 1
0 0 0 1 1 0
1 1 1 0 1 0
0 1 1 0 1 0
1 0 1 0 1 1
0 0 1 0 1 1
1 1 0 0 1 0
0 1 0 0 1 0
1 0 0 0 1 1
0 0 0 0 1 0
1 1 1 1 0 1
0 1 1 1 0 1
1 0 1 1 0 1
0 0 1 1 0 1
1 1 0 1 0 1
0 1 0 1 0 1
1 0 0 1 0 1
0 0 0 1 0 1
1 1 1 0 0 0
0 1 1 0 0 0
1 0 1 0 0 1
0 0 1 0 0 1
1 1 0 0 0 0
0 1 0 0 0 0
1 0 0 0 0 1
0 0 0 0 0 0
This table may or may not be correct, I haven't verified it. You can go through it and decide for yourself. Anyway, assuming it is correct, you could check the expected output for your given input and verify whether your expectations are correct.

Resources