How to combine similar fields parsed from CSV using Ruby - ruby-on-rails

I am parsing baseball statistics from a CSV file, and I need to account for players who played for multiple teams within a season. Currently my code looks like this:
require 'CSV'
CSV.foreach("Batting-07-12-resaved.csv",{:headers=>:first_row}) do |row|
if row[7].to_i != 0 && row[5] != 0 && row[1].to_i == 2009
avg = row[7].to_f / row[5].to_f
puts row[0] + ": " + avg.round(3).to_s[1..-1]
end
end
The CSV headers look like this, and a player is identified by a key that sort of looks like their name and may recur based on different teams they played for (here are a few of the lines, copied from formatted file):
playerID yearID league teamID G AB R H 2B 3B HR RBI SB CS
aardsda01 2012 AL NYA 1
aardsda01 2010 AL SEA 53 0 0 0 0 0 0 0 0 0
aardsda01 2009 AL SEA 73 0 0 0 0 0 0 0 0 0
aardsda01 2008 AL BOS 47 1 0 0 0 0 0 0 0 0
aardsda01 2007 AL CHA 25 0 0 0 0 0 0 0 0 0
abadfe01 2012 NL HOU 37 7 0 1 0 0 0 0 0 0
abadfe01 2011 NL HOU 28 0 0 0 0 0 0 0 0 0
abadfe01 2010 NL HOU 22 1 0 0 0 0 0 0 0 0
abercre01 2008 NL HOU 34 55 10 17 5 0 2 5 5 2
abercre01 2007 NL FLO 35 76 16 15 3 0 2 5 7 1
abreubo01 2012 AL LAA 8 24 1 5 3 0 0 5 0 0
abreubo01 2012 NL LAN 92 195 28 48 8 1 3 19 6 2
So, for example, the bottom two lines, Bobby Abreu played for two different teams in the 2012 season.
How could I combine the numbers from these two rows under the same playerId for the 2012 season to calculate his 2012 batting average?

You need to keep a data structure that holds data about each playerID as you iterate through the CSV data. Using a hash would be perfect. ruby-doc.org manual page
require 'CSV'
# Hashes are built into ruby. Using a hash literal
# is more idomatic than h = Hash.new() */
h = {}
CSV.foreach("Batting-07-12-resaved.csv",{:headers=>:first_row}) do |row|
if row[7].to_i != 0 && row[5].to_i != 0 && row[1].to_i == 2009
playerData = h[row[0]]
if (!playerData)
playerData = [row[0], row[7].to_f, row[5].to_f]
else
playerData = [row[0], row[7].to_f+playerData[1], row[5].to_f+playerData[2]]
end
h[row[0]]=playerData
end
end
h.each {|key, value|
puts "#{value[0]} is #{value[1]/value[2]}"
}

Related

Calculate the longest continuous running time of a device

I have a table created with the following script:
n=15
ts=now()+1..n * 1000 * 100
status=rand(0 1 ,n)
val=rand(100,n)
t=table(ts,status,val)
select * from t order by ts
where
ts is the time, status indicates the device status (0: down; 1: running), and val indicates the running time.
Suppose I have the following data:
ts status val
2023.01.03T18:17:17.386 1 58
2023.01.03T18:18:57.386 0 93
2023.01.03T18:20:37.386 0 24
2023.01.03T18:22:17.386 1 87
2023.01.03T18:23:57.386 0 85
2023.01.03T18:25:37.386 1 9
2023.01.03T18:27:17.386 1 46
2023.01.03T18:28:57.386 1 3
2023.01.03T18:30:37.386 0 65
2023.01.03T18:32:17.386 1 66
2023.01.03T18:33:57.386 0 56
2023.01.03T18:35:37.386 0 42
2023.01.03T18:37:17.386 1 82
2023.01.03T18:38:57.386 1 95
2023.01.03T18:40:37.386 0 19
So how do I calculate the longest continuous running time? For example, both the 7th and 8th records have the status 1, I want to sum their val values. Or the 14th-15th records, I want to sum up their val values.
You can use the built-in function segment to group the consecutive identical values. The full script is as follows:
select first(ts), sum(iif(status==1, val, 0)) as total_val
from t
group by segment(status)
having sum(iif(status==1, val, 0)) > 0
The result:
segment_status first_ts total_val
0 2023.01.03T18:17:17.386 58
3 2023.01.03T18:22:17.386 87
5 2023.01.03T18:25:37.386 58
9 2023.01.03T18:32:17.386 66
12 2023.01.03T18:37:17.386 177

It is possible to luaj convert a script to the binary representation of lua 5.2

I am trying to convert a script to binary representation using string.dump, but I saw that the luaj 3.0.1 representation is different from lua 5.2, has some way of changing this, or turns the luaj 3.0.1 representation into lua 5.2 ?
Code
Globals _G = JsePlatform.standardGlobals();
LuaValue Function = _G.load("return string.dump(function() return 'Test' end),true);");
String Return = Function.call().toString();
Output in Hex
1B 4C 75 61 52 0 0 4 4 4 8 0 19 93 D A 1A A 0 0 0 1 0 0 0 1 0 0 2 0 0 0 3 0 0 0 1 1 0 0 1F 0 80 0 1F 0 0 0 1 4 0 0 0 5 54 65 73 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Transform string variable into 0-1 columns

As a very begginer in SPSS I would ask you for help with some transformation from table A into table B. I have to recode values of "brand" variable into columns and make 0-1 variables.
#table A#
nr brand
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
2 HENKEL
3 HENKEL
3 HENKEL
3 HENKEL
3 VIZIR
4 BIEDRONKA
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 HENKEL
5 VIZIR
6 HENKEL
#table B#
nr GREEN HENKEL VIZIR BIEDR BOBINI
1 1 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 1 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
I can do it in this particular case in this simple way:
compute HENKEL=0.
...
do if BRAND='GREEN_CARE' .
compute GREEN_CARE=1.
else if ....
but the loop has to be usable with another variable and different number of values ect. I was trying to make it all day and gave up.
Do you have any idea to make it in a easy way?
Thanks!
The following syntax does the job on the sample data you provided.
First, let's recreate the sample data to demonstrate on:
Data list list/nr (f1) brand (a30).
begin data
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
2 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "VIZIR"
4 "BIEDRONKA"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "HENKEL"
5 "VIZIR"
6 "HENKEL"
end data.
dataset name originalDataset.
Now for the restructure.
sort cases by nr brand.
* creating an index to enumerate cases for each combination of `nr` and `brand`.
* This is necessary for the `casestovars` command to work later.
compute ind=1.
if $casenum>1 and lag(nr)=nr and lag(brand)=brand ind=lag(ind)+1.
exe.
* variable names can't have spaces in them, so changing the category names accordingly.
compute brand=replace(rtrim(brand)," ","_").
sort cases by nr ind brand.
compute exist=1.
casestovars /id=nr ind /index= brand/autofix=no.

Do math on string count (and text parsing with awk)

I have a 4 column file (input.file) with a header:
something1 something2 A B
followed by many 4-column rows with the same format (e.g.):
ID_00001 1 0 0
ID_00002 0 1 0
ID_00003 1 0 0
ID_00004 0 0 1
ID_00005 0 1 0
ID_00006 0 1 0
ID_00007 0 0 0
ID_00008 1 0 0
Where "1 0 0" is representative of "AA", "0 1 0" means "AB", and "0 0 1" means "BB"
First, I would like to create a 5th column to identify these representations:
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
Note that the A's and B's need to be parsed from columns 3 and 4 of the header row, as they are not always A and B.
Next, I want to "do math" on the counts for (the new) column 5 as follows:
(2BB + AB) / 2(AA + AB + BB)
Using the example, the math would give:
(2(1) + 3) / 2(3 + 3 + 1) = 5/14 = 0.357
which I would like to append to the end of the desired output file (output.file):
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
B_freq = 0.357
So far I have this:
awk '{ if ($2 = 1) {print $0, $5="AA"} \
else if($3 = 1) {print $0, $5="AB"} \
else if($4 = 1) {print $0, $5="BB"} \
else {print$0, $5="no data"}}' input.file > output.file
Obviously, I was not able to figure out how to parse the info from row 1 (the header row, edited out "column 1"), much less do the math.
Thanks guys!
a more structured approach...
NR==1 {a["100"]=$3$3; a["010"]=$3$4; a["001"]=$4$4; print; next}
{k=$2$3$4;
print $0, (k in a)?a[k]:"no data";
c[k]++}
END {printf "\nB freq = %.3f\n",
(2*c["001"]+c["010"]) / 2 / (c["100"]+c["010"]+c["001"])}
UPDATE
For non binary data you can follow the same logic with some pre-processing. Something like this should work in the main block:
for(i=2;i<5;i++) v[i]=(($i-0.9)^2<=0.1^2)?1:0;
k=v[2] v[3] v[4];
...
here the value is quantized at one for the range [0.8,1] and zero otherwise.
To capture "B" or substitute set h=$4 in the first block and use it as printf "\n%s freq...",h,(2*c...

How does Weka evaluate classifier model

I used random forest algorithm and got this result
=== Summary ===
Correctly Classified Instances 10547 97.0464 %
Incorrectly Classified Instances 321 2.9536 %
Kappa statistic 0.9642
Mean absolute error 0.0333
Root mean squared error 0.0952
Relative absolute error 18.1436 %
Root relative squared error 31.4285 %
Total Number of Instances 10868
=== Confusion Matrix ===
a b c d e f g h i <-- classified as
1518 1 3 1 0 14 0 0 4 | a = a
3 2446 0 0 0 1 1 27 0 | b = b
0 0 2942 0 0 0 0 0 0 | c = c
0 0 0 470 0 1 1 2 1 | d = d
9 0 0 9 2 19 0 3 0 | e = e
23 1 2 19 0 677 1 22 6 | f = f
4 0 2 0 0 13 379 0 0 | g = g
63 2 6 17 0 15 0 1122 3 | h = h
9 0 0 0 0 9 0 4 991 | i = i
I wonder how Weka evaluate errors(mean absolute error, root mean squared error, ...) using non numerical values('a', 'b', ...).
I mapped each classes to numbers from 0 to 8 and evaluated errors manually, but the evaluation was different from Weka.
How to reimplemen the evaluating steps of Weka?

Resources