Parsing a huge file in Fortran - parsing

I am trying to parse an output file of a popular QM program, in order to extract data corresponding to two related properties: 'frequencies' and 'intensities'. An example of how the output file looks can be found below:
Max difference between off-diagonal Polar Derivs IMax= 2 JMax= 3 KMax= 13 EMax= 8.65D-04
Full mass-weighted force constant matrix:
Low frequencies --- -2.0296 -1.7337 -1.3848 -0.0005 -0.0003 0.0007
Low frequencies --- 216.4611 263.3990 368.1703
Diagonal vibrational polarizability:
18.1080784 9.1046025 11.9153848
Diagonal vibrational hyperpolarizability:
127.1032599 2.7794305 -8.7599786
Harmonic frequencies (cm**-1), IR intensities (KM/Mole), Raman scattering
activities (A**4/AMU), depolarization ratios for plane and unpolarized
incident light, reduced masses (AMU), force constants (mDyne/A),
and normal coordinates:
1 2 3
A A A
Frequencies -- 216.4611 263.3989 368.1703
Red. masses -- 3.3756 1.0427 3.0817
Frc consts -- 0.0932 0.0426 0.2461
IR Inten -- 3.6192 21.7801 0.2120
Raman Activ -- 1.0049 0.1635 0.9226
Depolar (P) -- 0.6948 0.6536 0.7460
Depolar (U) -- 0.8199 0.7905 0.8546
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.22 0.00 0.01 0.02 0.06 0.15 -0.01
2 7 0.00 0.00 0.00 0.00 0.00 0.00 0.10 -0.02 0.00
3 6 0.00 0.00 -0.23 0.00 -0.01 0.00 0.01 -0.07 0.00
4 6 0.00 0.00 0.00 0.00 0.00 0.00 -0.08 -0.02 0.00
5 6 0.00 0.00 0.21 0.00 0.01 -0.03 -0.06 0.15 0.00
6 6 0.00 0.00 0.11 0.00 0.01 0.00 -0.01 0.17 0.00
7 7 -0.02 0.00 -0.22 0.00 0.03 0.00 -0.01 -0.26 0.00
8 1 0.10 -0.02 -0.32 0.02 -0.30 0.66 0.34 -0.39 -0.13
9 1 0.07 -0.02 -0.39 -0.05 -0.25 -0.63 -0.37 -0.40 0.12
10 1 0.00 0.00 0.39 0.01 0.01 0.07 0.18 0.22 -0.03
11 1 0.00 0.00 -0.53 0.00 -0.01 0.01 0.02 -0.15 0.01
12 1 0.00 0.00 -0.03 -0.01 0.00 -0.02 -0.18 -0.09 0.00
13 1 0.00 0.00 0.31 0.00 0.00 -0.09 -0.18 0.22 0.03
4 5 6
A A A
Frequencies -- 411.0849 501.4206 548.5728
Red. masses -- 3.4204 2.8766 6.5195
Frc consts -- 0.3406 0.4261 1.1559
IR Inten -- 4.2311 30.8234 6.3698
Raman Activ -- 0.1512 0.8402 4.2329
Depolar (P) -- 0.7404 0.1511 0.4224
Depolar (U) -- 0.8508 0.2625 0.5939
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.20 0.00 -0.01 0.01 0.02 -0.12 -0.01
2 7 0.00 0.00 -0.21 0.00 0.00 -0.16 0.06 -0.18 0.02
3 6 0.00 0.00 -0.03 0.01 0.00 0.15 0.32 -0.01 -0.02
4 6 0.00 0.00 0.27 0.01 0.00 -0.08 0.18 0.10 0.01
5 6 0.00 0.00 -0.23 0.00 0.00 -0.03 0.11 0.19 0.00
6 6 0.00 0.00 -0.02 0.00 0.00 0.32 -0.26 0.01 -0.04
7 7 0.00 -0.01 0.01 -0.04 0.00 -0.04 -0.39 0.02 0.04
8 1 -0.01 0.05 -0.10 0.17 0.03 -0.36 -0.36 0.06 -0.08
9 1 -0.02 0.04 0.16 0.15 -0.01 -0.35 -0.30 0.02 -0.11
10 1 0.01 0.01 0.48 0.01 0.00 -0.35 0.22 -0.01 0.03
11 1 0.00 0.00 -0.12 0.01 0.00 0.23 0.31 0.13 -0.02
12 1 0.00 0.00 0.54 0.00 0.00 -0.39 -0.02 -0.03 0.05
13 1 -0.01 0.00 -0.47 0.01 0.00 -0.45 0.34 0.06 0.04
7 8 9
A A A
Frequencies -- 629.8582 652.6212 716.4846
Red. masses -- 7.0000 1.4491 2.4272
Frc consts -- 1.6362 0.3637 0.7341
IR Inten -- 9.4587 253.3389 18.8342
Raman Activ -- 3.5151 11.7363 0.2311
Depolar (P) -- 0.7397 0.2892 0.7423
Depolar (U) -- 0.8504 0.4486 0.8521
Atom AN X Y Z X Y Z X Y Z
1 6 0.24 -0.18 -0.01 -0.02 0.03 -0.04 0.00 0.00 -0.12
2 7 0.30 0.27 0.02 -0.02 0.00 0.04 0.00 0.00 0.17
3 6 0.06 0.12 -0.02 -0.03 -0.01 -0.04 0.00 0.00 -0.15
4 6 -0.23 0.23 0.01 0.02 -0.04 0.02 0.00 0.00 0.18
5 6 -0.22 -0.20 -0.01 0.02 0.00 -0.04 0.00 0.00 -0.08
6 6 -0.04 -0.15 -0.02 0.04 0.01 -0.04 0.00 0.00 0.13
7 7 -0.13 -0.07 0.06 -0.05 0.00 0.14 0.01 0.00 -0.01
8 1 0.02 -0.03 -0.20 0.30 0.13 -0.57 0.00 -0.02 0.05
9 1 0.00 -0.12 -0.26 0.29 -0.10 -0.63 -0.01 0.02 0.05
the code I'm using is:
program gau_parser
implicit none
integer :: ierr ! Error value for read statement
integer, parameter :: iu = 20 ! input unit
integer, parameter :: ou = 30 ! output unit
character (len=*), parameter :: search_str = " Frequencies --" ! this is the property I'm looking for
! ^===============^ there are 15 characters here. First character is blank.
!
! NOTE: a typical string looks like this: " Frequencies -- 411.0849 501.4206 548.5728"
! ============== ======== ======== ========
! search_str xx(1) xx(2) xx(3)
!
! the string length is 73 but may be variable but very seldomly more than 80
!
real :: xx(3) ! this will be the three values associated to the above property
character (len=80) :: text
character (len=15) :: word
open (unit=iu,file="dummy.log",action="read") ! read the file I wish to parse
open (unit=ou,file='output.log',action="write") ! Open a file where I wish the parse results to be written to!
do ! the search is done line by line, until the end of the file
read (iu,"(a)",iostat=ierr) text ! read line into character variable
if (ierr /= 0) then
cycle ! If a reading error occurs, advance to new line
end if
read (text,*) word ! read first word of line
if (word == search_str) then ! found search string at beginning of line
read (text,*) word,xx ! read the entire line
write(30,*) word,xx ! write the entire line
end if
end do ! finish the search cycle
end program gau_parser
My questions are following:
a) The present code is compilable, but 'hangs up' upon execution. Can anyone compile their own version and see if the same is happening to them? What (user induced) error may be causing such behavior?
b) How can I make the multiple values of 'xx' be written in a single array in sequence? That is, they should be read like this from the parsed file
word xx(1) xx(2) xx(3)
...
junk
...
word xx(4) xx(5) xx(6)
...
more junk
...
word xx(7) xx(8) xx(9)
I know that I've stated in the program the array to be of dimension(3), but that is just for test sake. In reality, it must be allocated but unspecified until, upon reaching the end of the parsed file, it must INQUIRE:SIZE. My idea is to print it into a scratch file, evaluate it, and the write it back in memory, as xx(INQUIRE:SIZE) dimension array. Any thought on the matter would be most welcome!
EDIT: After trying to debug the program, I realized that it was actually looping! I've inserted a couple of write statements to see what could be going wrong
open (unit=iu,file="dummy.log",action="read") ! read the file I wish to parse
print*,'file opened'
! open (unit=ou,file='output.log',action="write") ! Open a file where I wish the parse results to be written to!
do ! the search is done line by line, until the end of the file
print*,'Do loop has started'
read (iu,"(a)",iostat=ierr) text ! read line into character variable
if (ierr /= 0) then
write(*,*)'Error!'
cycle ! If a reading error occurs, advance to new line
end if
and ... voilĂ ! My screen started to be filled up by a flurry of
Error!
Do has started
messages! In essence, I'm stuck in a loop! Where have I failed?

There is a subtle error in the code. The statement
read (iu,"(a)",iostat=ierr) text ! read line into character variable
reads a line of text from the file into the variable text, and it uses the edit descriptor "(a)" which means that text is what you expect it to be. On the other hand the statement
read (text,*) word
uses list directed input (that's what the * means) and it does not get, for example, the string Frequencies from the line. Helpfully the compiler strips off the leading blank characters and word gets the string Frequencies (no leading space). This will never match the searched-for string.
An aside: especially when developing codes do not let loops run
indefinitely, put in a reasonable maximum loop iteration, eg do ix = 1,200 for your test case, this will stop you wasting time staring at
a computation which ain't ever going to finish.
The reason that the code runs forever is that there is no end condition. Instead, the block of code
if (ierr /= 0) then
cycle ! If a reading error occurs, advance to new line
end if
sends execution back to the do statement - ad infinitum. I would use a stopping condition like this:
IF (IS_IOSTAT_END(ierr)) EXIT
The function IS_IOSTAT_END frees you from having to figure out what error code end-of-file causes on your compiler, the values of those codes are not standardised. IS_IOSTAT_EOR is useful to check for end-of-record.
The next error you will find is that the statement
read (text,*) word
won't make word match Frequencies -- either. Again, using list-directed input means that the compiler will treat blank spaces in the input file as separators, and the line of code will only get Frequencies into word. But that leads to another problem,
read (text,*) word,xx ! read the entire line
will try to read the string -- into the real variable xx, with unhappy results.
One, perhaps the, solution to this series of problems, is to use an explicit edit descriptor in the read statements, like this. First change
read (text,*) word
to
read (text,'(a15)') word
Next, you have to change the line to read xx to something like
read (text,'(a15,3(f18.4))') word,xx ! read the entire line
You will find that, as it stands, this line does not read all 3 values into xx correctly. That's because the edit descriptor 3(f18.4) does not quite properly describe the layout of the line, in fact it may need f(18.4),2(fNN.4), where of course you replace NN by the proper field width for your file. And it's time you did some of the work.

Related

Table printing a list of lists Common lisp

I wish to print this data in a table with the columns aligned. I tried with Format but the columns were not aligned. Does anyone know how to do it ? Thank you.
(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none"))
I tried to align the columns wit the ~T directive, no way. Is there a piece of code that prints nicely table data?
Let's break this down.
First, let's give your data a nice name:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
Now, come up with a way to print each line using format and destructuring-bind. Widths of various fields are hard-coded in.
(defun print-line (line)
(destructuring-bind (a b c d e f) line
(format T "~20a ~5d ~6,2f ~6,2f ~10,2f ~4a~%" a b c d e f)))
Once you know you can print a line, you just need to do that for each line.
(mapcar 'print-line *data*)
Result:
tiscali 10000 2.31 0.84 -14700.00 none
atlantia 50 22.65 22.68 1.50 none
bper-banca 1000 1.59 2.01 423.00 none
alerion-cleanpower 30 44.14 36.45 -230.70 none
tesmec 10000 0.12 0.14 150.00 none
cover-50 120 8.95 9.60 78.00 none
ovs 1000 1.71 1.93 217.00 none
credito-emiliano 200 5.70 6.26 112.00 none
I have something like this in my personal code, that I reproduced here in a simplified way:
(defpackage :tabular (:use :cl))
(in-package :tabular)
I have a function that turns any object into a list of values (a row), here the usage is for a list of values, so it is already in the correct shape.
(defgeneric columnize (object)
(:documentation "Representation of object as a list of fields")
(:method ((o list)) o))
I also define a transpose method that works with lists of various sizes:
(defun transpose (lists)
(when (notany #'null lists)
(cons
(mapcar #'first lists)
(transpose (mapcar #'cdr lists)))))
Here is your data, as defined by Chris:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
And finally, a function that prints a list of objects in a tabular way.
Basically, I convert all objects to list of values, convert them to string, and compute their size. This gives a matrix of size that I transpose to have a list of sizes for the same column: this is used to compute the width of each column, based on the maximum size of the actual data.
In practice, I allow also the generic function to add indicators like how to justify (left/right), etc.
(defun tabulate (stream objects)
(loop
for n from 0
for o in objects
for row = (mapcar #'princ-to-string (columnize o))
collect row into rows
collect (mapcar #'length row) into row-widths
finally
(flet ((build-format-arguments (max-width row)
(when (> max-width 0)
(list max-width #\space row))))
(loop
with number-width = (ceiling (log n 10))
with col-widths = (transpose row-widths)
with max-col-widths = (mapcar (lambda (s) (reduce #'max s)) col-widths)
for index from 0
for row in rows
for entries = (mapcan #'build-format-arguments max-col-widths row)
do (format stream
"~v,'0d. ~{~v,,,va~^ ~}~%"
number-width index entries)))))
For example:
(fresh-line)
(tabulate *standard-output* *data*)
Gives:
0. tiscali 10000 2.31 0.84 -14700.0 none
1. atlantia 50 22.65 22.68 1.5 none
2. bper-banca 1000 1.59 2.01 423.0 none
3. alerion-cleanpower 30 44.14 36.45 -230.7 none
4. tesmec 10000 0.12 0.14 150.0 none
5. cover-50 120 8.95 9.6 78.0 none
6. ovs 1000 1.71 1.93 217.0 none
7. credito-emiliano 200 5.7 6.26 112.0 none
As you can see there is some adjustments that could be made to format floating points values so that they align on the dot, but this is already quite useful.

how to read the last value from the column in tableau

I have a scenario here.
Cycle Values
1 0.5
5 1.7
6 0.65
7 2.5
8 0.14
In Tableau By calculation I need to get the last value 0.14

when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases?

According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1.
However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)?
If I run:
library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")
I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?
for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?
> getTree(rf,134)
left daughter right daughter split var split point status prediction
1 2 3 3 2.50 1 0
2 0 0 0 0.00 -1 1
3 4 5 4 1.75 1 0
4 6 7 3 4.95 1 0
5 8 9 3 4.85 1 0
6 10 11 4 1.60 1 0
7 12 13 1 6.50 1 0
8 14 15 1 5.95 1 0
9 0 0 0 0.00 -1 3
10 0 0 0 0.00 -1 2
11 0 0 0 0.00 -1 3
12 0 0 0 0.00 -1 3
13 0 0 0 0.00 -1 2
14 0 0 0 0.00 -1 2
15 0 0 0 0.00 -1 3
I would be greatful if someone can explain
"Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.
In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.

Grid Search in SVM not at all improving the model

I want my SVM to classify the given data into three classes 0,1,2. Initially i'm getting 0 prediction in class 1. So i used Grid search and even after using grid search, class 1 is getting 0.0 precision. What might be wrong? How can i make my model more precise?
before grid search:
precision recall f1-score support
0 0.75 0.44 0.55 41
1 0.00 0.00 0.00 37
2 0.50 0.98 0.66 55
accuracy 0.54 133
macro avg 0.42 0.47 0.40 133
weighted avg 0.44 0.54 0.44 133
after grid search: {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
precision recall f1-score support
0 0.72 0.56 0.63 41
1 0.00 0.00 0.00 37
2 0.52 0.96 0.68 55
accuracy 0.57 133
macro avg 0.41 0.51 0.44 133
weighted avg 0.44 0.57 0.48 133
Looking at the plot, it seems that the data is linearly separable with high amount of noise. In this case you can use linear kernel of SVM.

Profiling a JRUBY rails application outputs <unknown> elements

Environment: Linux Mint 32 bit, JRuby-1.6.5 [ i386 ], Rails 3.1.3.
I am trying to profile my rails application deployed on JRuby 1.6.5 on WEBrick (in development mode).
My JRUBY_OPTS: "-Xlaunch.inproc=false --profile.flat"
In one of my models, I introduced an explicit sleep(5) and ensured that this method is called as part of before_save hook while saving the model. Pseudo code...
class Invoice < ActiveRecord::Base
<some properties here...>
before_save :delay
private
def delay
sleep(5)
end
end
The above code ensures that just before an instance of Invoice gets persisted, the method, delay is invoked automatically.
Now, when I profile the code that creates this model instance (through an rspec unit test), I get the following output:
6.31 0.00 6.31 14 RSpec::Core::ExampleGroup.run
6.30 0.00 6.30 14 RSpec::Core::ExampleGroup.run_examples
6.30 0.00 6.30 1 RSpec::Core::Example#run
6.30 0.00 6.30 1 RSpec::Core::Example#with_around_hooks
5.58 0.00 5.58 1 <unknown>
5.43 0.00 5.43 2 Rails::Application::RoutesReloader#reload!
5.00 0.00 5.00 1 <unknown>
5.00 5.00 0.00 1 Kernel#sleep
4.87 0.00 4.87 40 ActiveSupport.execute_hook
4.39 0.00 4.39 3 ActionDispatch::Routing::RouteSet#eval_block
4.38 0.00 4.38 2 Rails::Application::RoutesReloader#load_paths
In the above output, why do I see those two elements instead of Invoice.delay or something similar.
In fact, when I start my rails server (WEBrick) with the same JRUBY_OPTS (mentioned above), all my application code frames show up as unknown elements in the profiler output !
Am I doing anything wrong ?
Looks like you max of the profile methods limit
Set -Xprofile.max.methods JRUBY_OPTS to a big number (default is 100000 and is never enough). E.g.
export JRUBY_OPTS="--profile.flat -Xprofile.max.methods=10000000"

Resources