Parsing in Ruby (on Rails) - ruby-on-rails

I want to write a Rails app to assist me with my online Poker. I play on PokerStars, and there is text data available for each hand that is played. The format it comes in is this:
PokerStars Game #27457662450: Tournament #157033867, Freeroll Hold'em No Limit - Level IV (50/100) - 2009/04/24 20:39:44 ET
Table '157033867 830' 9-max Seat #1 is the button
Seat 1: DortheaV (7624 in chips)
Seat 2: Currly234 (3016 in chips)
Seat 3: paolilla (3086 in chips)
Seat 4: triumph888 (1571 in chips) is sitting out
Seat 5: Minchausti (1185 in chips) is sitting out
Seat 6: madmike11847 (1195 in chips) is sitting out
Seat 7: alamodey (4038 in chips)
Seat 8: whiskerbob (3365 in chips)
Seat 9: SHpic76 (1115 in chips) is sitting out
DortheaV: posts the ante 10
Currly234: posts the ante 10
paolilla: posts the ante 10
triumph888: posts the ante 10
Minchausti: posts the ante 10
madmike11847: posts the ante 10
alamodey: posts the ante 10
whiskerbob: posts the ante 10
SHpic76: posts the ante 10
Currly234: posts small blind 50
paolilla: posts big blind 100
*** HOLE CARDS ***
Dealt to alamodey [8s Ks]
triumph888: folds
Minchausti: folds
madmike11847: folds
alamodey: calls 100
whiskerbob: folds
SHpic76: folds
DortheaV: folds
Currly234: calls 50
paolilla: checks
*** FLOP *** [5c 4h 6d]
Currly234: checks
paolilla: checks
alamodey: bets 234
Currly234: folds
paolilla: folds
Uncalled bet (234) returned to alamodey
alamodey collected 390 from pot
alamodey: doesn't show hand
*** SUMMARY ***
Total pot 390 | Rake 0
Board [5c 4h 6d]
Seat 1: DortheaV (button) folded before Flop (didn't bet)
Seat 2: Currly234 (small blind) folded on the Flop
Seat 3: paolilla (big blind) folded on the Flop
Seat 4: triumph888 folded before Flop (didn't bet)
Seat 5: Minchausti folded before Flop (didn't bet)
Seat 6: madmike11847 folded before Flop (didn't bet)
Seat 7: alamodey collected (390)
Seat 8: whiskerbob folded before Flop (didn't bet)
Seat 9: SHpic76 folded before Flop (didn't bet)
Are there any parsing libraries for Ruby or do I have to do this manually and hackily?

This sounds like a job for Regex!. I doubt using any library would make it any easier to parse, since it's a pretty custom format you'll just have to hack away at it.

You might want to look at Treetop, a Parsing Expression Grammar based parser generator for Ruby.

Ragel is very good to write a parser. E.g. the http parser of Mongrel is generated with ragel.

Also if you just want the data you should just check out PokerTracker. PokerTracker stores 100% of hand information and has a well-documented schema and an open PostgreSQL database.

Related

Which Feature Selection Techniques for NLP is this represent

I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Simulating loops in Google Spreadsheet using build-in formulas

I have such columns in GS:
Equipments Amount . Equipment 1 Equipment 2
---------- ------- ----------- -----------
Equipment 1 2 Process 1 Process 3
Equipment 2 3 Process 2 Process 4
Process 5
I need to produce equipment 1 x2, and equipment 2 x3.
When equipments are produced, then Process 1 is executed 2 times, Process 2 - 2 times, Process 3 - 3 times, Process 4 - 3 times, Process 5 - 3 times.
So I need to generate such list:
Process 1
Process 1
Process 2
Process 2
Process 3
Process 3
Process 3
Process 4
Process 4
Process 4
Process 5
Process 5
Process 5
Of course, I want a formula which will be dynamic (e.g. can add another equipment or change processes in particular equipment)
1 list using rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",B2),C2:C<>"")),","))
Multy-list rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",VLOOKUP(D2:D,A:B,2,)),C2:C<>"")),","))
There is no easy way to solve your problem with formulas.
I would strongly suggest you write a script. It's easier than you think. You can even record an action, and then see the code you need to reproduce the action.

how to understand the memory issue of breadth-first-search in branch and bound

I was confused by the branch and bound method recently. There are three searching strategies in branch-and-bound method: deepth-first-search, breadth-first-search and best-first-search. All the books and literatures state that the breadth-first and best-first will take more memory of the computer used. How to understand this? Take a binary tree as an example, when take a node (father node) from the live node list to process, two sub-nodes (or son nodes) are generated and inserted into the live node list, but the father node should be deleted, thus, there is only one node's memory increase. From this point of view, all the three searching strategies take the same memories of the computer.
Am I right? It has been confused me for long. Could anyone give me some advice?
Well,
You could think about data structures:
Breadth-first-search: It´s implemented as a queue. When you expand a node (father node) you include son nodes in the queue. The father node is deleted.
Let´s make an example:
Expand 45: We include 20 and 70 in the queue and delete 45 so:
20 | 70
Expand 20: We expand the first node from the queue and include his sons:
70 | 10 | 28
Expand 70: We expand the first node from the queue and include his sons:
10 | 28 | 60 | 85
And so on...
As you can see space complexity is exponential: O() (b = branching factor ; d = depth, initially 0)
Deepth-first-search: It´s implemented as a stack:
Expand 45: We include 20 and 70 in the stack and delete 45 so:
20 | 70
Expand 20: We expand the first node from the top of the stack and include his sons:
10 | 28 | 70
Expand 10: We expand the first node from the top of the stack and include his sons:
1 | 18 | 28 |70
And so on...
Now space complexity is linear: O(d). Time complexity is O() in both algorithms.
Best-first-search: Sorts the queue according to a heuristic evaluation function f(n) and expands the succesor with the best f(n). Space complexity is linear: O(d).
Hope this helps.

Awk filtering out paragraphs

I have a plain txt file which contains paragraphs which contain about 15-40 lines and each paragraph is separated from the previous/next one with 3 empty lines, I'd like to print out all the paragraphs which contain the string "sasi89".
Example:
PokerStars Hand #61919020230: Tournament #393199063, $0.10+$0.01 USD Hold'em No Limit - Level IV (50/100) - 2011/05/10 12:11:58 ET
Table '393199063 1' 9-max Seat #9 is the button
Seat 1: bebe2829 (1529 in chips)
Seat 3: zng 111 (4374 in chips)
Seat 4: mal4o (11100 in chips)
Seat 6: gysomi (6118 in chips)
Seat 7: DEEAMAYA (2590 in chips)
Seat 9: sasi89 (235 in chips)
bebe2829: posts small blind 50
zng 111: posts big blind 100
*** HOLE CARDS ***
Dealt to sasi89 [Kc Th]
mal4o: folds
gysomi: folds
DEEAMAYA: folds
sasi89: raises 135 to 235 and is all-in
bebe2829: folds
zng 111: calls 135
*** FLOP *** [7h 9s Tc]
*** TURN *** [7h 9s Tc] [Qd]
*** RIVER *** [7h 9s Tc Qd] [9d]
*** SHOW DOWN ***
zng 111: shows [Jh Kh] (a straight, Nine to King)
sasi89: shows [Kc Th] (two pair, Tens and Nines)
zng 111 collected 520 from pot
sasi89 finished the tournament in 11th place
*** SUMMARY ***
Total pot 520 | Rake 0
Board [7h 9s Tc Qd 9d]
Seat 1: bebe2829 (small blind) folded before Flop
Seat 3: zng 111 (big blind) showed [Jh Kh] and won (520) with a straight, Nine to King
Seat 4: mal4o folded before Flop (didn't bet)
Seat 6: gysomi folded before Flop (didn't bet)
Seat 7: DEEAMAYA folded before Flop (didn't bet)
Seat 9: sasi89 (button) showed [Kc Th] and lost with two pair, Tens and Nines
PokerStars Hand #61918994165: Tournament #393199063, $0.10+$0.01 USD Hold'em No Limit - Level IV (50/100) - 2011/05/10 12:11:19 ET
Table '393199063 1' 9-max Seat #7 is the button
Seat 1: bebe2829 (1079 in chips)
Seat 3: zng 111 (4374 in chips)
Seat 4: mal4o (11500 in chips)
Seat 6: gysomi (6118 in chips)
Seat 7: DEEAMAYA (2590 in chips)
Seat 9: sasi89 (285 in chips)
sasi89: posts small blind 50
bebe2829: posts big blind 100
*** HOLE CARDS ***
Dealt to sasi89 [2d 7h]
zng 111: folds
mal4o: calls 100
gysomi: folds
DEEAMAYA: folds
sasi89: folds
bebe2829: checks
*** FLOP *** [8c Js 2h]
bebe2829: checks
mal4o: checks
*** TURN *** [8c Js 2h] [8h]
bebe2829: checks
mal4o: checks
*** RIVER *** [8c Js 2h 8h] [6h]
bebe2829: bets 300
mal4o: calls 300
*** SHOW DOWN ***
bebe2829: shows [Jc 3c] (two pair, Jacks and Eights)
mal4o: mucks hand
bebe2829 collected 850 from pot
*** SUMMARY ***
Total pot 850 | Rake 0
Board [8c Js 2h 8h 6h]
Seat 1: bebe2829 (big blind) showed [Jc 3c] and won (850) with two pair, Jacks and Eights
Seat 3: zng 111 folded before Flop (didn't bet)
Seat 4: mal4o mucked [6d Ac]
Seat 6: gysomi folded before Flop (didn't bet)
Seat 7: DEEAMAYA (button) folded before Flop (didn't bet)
Seat 9: sasi89 (small blind) folded before Flop
You can use awk in "paragraph mode" (see https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html) by setting RS to the empty string:
awk -v RS= '/sasi89/' file
The above assumes there's no other blank lines in your file except those between paragraphs.
awk to the rescue
awk 'BEGIN{ORS=RS="\n\n\n"} /sasi89/'
this will keep the 3 empty lines between paragraphs. If you want to normalize to single empty line remove ORS= or simply
awk -v RS="\n\n\n" '/sasi89/'

Resources