Parsing Data in C

Parsing Data in C - parsing

I am trying to parse some data using C.
The data is of the form:
REMARK 280 100 MM MES PH 6.5, 5 % GLYCEROL
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 1 21 1
REMARK 290
REMARK 290 SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290 1555 X,Y,Z
REMARK 290 2555 -X,Y+1/2,-Z
I want to extract the "Symmetry Operator" data: X,Y,Z and -X,Y+1/2,-Z and turn the data into two matrices for each set of symmetry operators of the form:
[1 0 0 [0 [-1 0 0 [0
0 1 0 0 and 0 1 0 1/2
0 0 1] 0] 0 0 -1] 0]
for X,Y,Z, and -X,Y+1/2,-Z respectively.
I have not done much data parsing and would appreciate any help anyone could offer.

Related

Terrible performance using XGBoost H2O

Very different Model Performance using XGBoost on H2O
I am training a XGBoost model using 5-fold croos validation on a very imbalanced binary classification problem. The dataset has 1200 columns (multi-document word2vec document embeddings).
The only parameters specified to train the XGBoost model were:
min_split_improvement = 1e-5
seed=1
nfolds = 5
The reported performance on train data was extremely high (probably overfitting!!!):
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
AUC: 0.9999991404060721
The performance on cross validation data was terrible:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
AUC: 0.6015883863129724
I know H2O cross validation generates an extra model using the whole data available and different performances are expected.
But, could be the cause that generated too bad performance on the resulting model?
Ps: XGBoost on a multi node H2O cluster with OMP
Model Type: classifier
Performance do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on train data. **
MSE: 0.0008688085383330077
RMSE: 0.029475558320971762
LogLoss: 0.00836528606162877
Mean Per-Class Error: 5.931198102016033e-05
AUC: 0.9999991404060721
pr_auc: 0.9975495622569983
Gini: 0.9999982808121441
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.28144 0.99759 195
max f2 0.28144 0.999035 195
max f0point5 0.553885 0.998053 191
max accuracy 0.28144 0.999884 195
max precision 0.990297 1 0
max recall 0.28144 1 195
max specificity 0.990297 1 0
max absolute_mcc 0.28144 0.997534 195
max min_per_class_accuracy 0.28144 0.999881 195
max mean_per_class_accuracy 0.28144 0.999941 195
max tns 0.990297 16860 0
max fns 0.990297 413 0
max fps 0.000111383 16860 399
max tps 0.28144 414 195
max tnr 0.990297 1 0
max fnr 0.990297 0.997585 0
max fpr 0.000111383 1 399
max tpr 0.28144 1 195
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 2.42 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- ------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- ------- -----------------
1 0.0100151 0.873526 41.7246 41.7246 1 0.907782 1 0.907782 0.417874 0.417874 4072.46 4072.46
2 0.0200301 0.776618 41.7246 41.7246 1 0.834968 1 0.871375 0.417874 0.835749 4072.46 4072.46
3 0.0300452 0.0326301 16.4004 33.2832 0.393064 0.303206 0.797688 0.681985 0.164251 1 1540.04 3228.32
4 0.0400023 0.0224876 0 24.9986 0 0.0263919 0.599132 0.518799 0 1 -100 2399.86
5 0.0500174 0.0180858 0 19.9931 0 0.0201498 0.479167 0.418953 0 1 -100 1899.31
6 0.100035 0.0107386 0 9.99653 0 0.0136044 0.239583 0.216279 0 1 -100 899.653
7 0.149994 0.00798337 0 6.66692 0 0.00922284 0.159784 0.147313 0 1 -100 566.692
8 0.200012 0.00629476 0 4.99971 0 0.00709438 0.119826 0.112249 0 1 -100 399.971
9 0.299988 0.00436827 0 3.33346 0 0.00522157 0.0798919 0.0765798 0 1 -100 233.346
10 0.400023 0.00311204 0 2.49986 0 0.00370085 0.0599132 0.0583548 0 1 -100 149.986
11 0.5 0.00227535 0 2 0 0.00267196 0.0479333 0.0472208 0 1 -100 100
12 0.599977 0.00170271 0 1.66673 0 0.00197515 0.039946 0.0396813 0 1 -100 66.6731
13 0.700012 0.00121528 0 1.42855 0 0.00145049 0.0342375 0.034218 0 1 -100 42.8548
14 0.799988 0.000837358 0 1.25002 0 0.00102069 0.0299588 0.0300692 0 1 -100 25.0018
15 0.899965 0.000507632 0 1.11115 0 0.000670878 0.0266306 0.0268033 0 1 -100 11.1154
16 1 3.35288e-05 0 1 0 0.00033002 0.0239667 0.0241551 0 1 -100 0
Performance da validação cruzada (xval) do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **
MSE: 0.023504756648164406
RMSE: 0.15331261085822134
LogLoss: 0.14134815775808462
Mean Per-Class Error: 0.4160864407653825
AUC: 0.6015883863129724
pr_auc: 0.04991836222189148
Gini: 0.2031767726259448
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- --------- -----
max f1 0.016816 0.0858434 209
max f2 0.00409934 0.138433 318
max f0point5 0.0422254 0.0914205 127
max accuracy 0.905155 0.976323 3
max precision 0.99221 1 0
max recall 9.60076e-05 1 399
max specificity 0.99221 1 0
max absolute_mcc 0.825434 0.109684 5
max min_per_class_accuracy 0.00238436 0.572464 345
max mean_per_class_accuracy 0.00262155 0.583914 341
max tns 0.99221 16860 0
max fns 0.99221 412 0
max fps 9.60076e-05 16860 399
max tps 9.60076e-05 414 399
max tnr 0.99221 1 0
max fnr 0.99221 0.995169 0
max fpr 9.60076e-05 1 399
max tpr 9.60076e-05 1 399
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 0.54 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- -------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- --------- -----------------
1 0.0100151 0.0540408 4.34129 4.34129 0.104046 0.146278 0.104046 0.146278 0.0434783 0.0434783 334.129 334.129
2 0.0200301 0.033963 2.41183 3.37656 0.0578035 0.0424722 0.0809249 0.094375 0.0241546 0.0676329 141.183 237.656
3 0.0300452 0.0251807 2.17065 2.97459 0.0520231 0.0292894 0.0712909 0.0726798 0.0217391 0.089372 117.065 197.459
4 0.0400023 0.02038 2.18327 2.77762 0.0523256 0.0225741 0.0665702 0.0602078 0.0217391 0.111111 118.327 177.762
5 0.0500174 0.0174157 1.92946 2.60779 0.0462428 0.0188102 0.0625 0.0519187 0.0193237 0.130435 92.9463 160.779
6 0.100035 0.0103201 1.59365 2.10072 0.0381944 0.0132217 0.0503472 0.0325702 0.0797101 0.210145 59.3649 110.072
7 0.149994 0.00742152 1.06366 1.7553 0.0254925 0.00867473 0.0420687 0.0246112 0.0531401 0.263285 6.3664 75.5301
8 0.200012 0.00560037 1.11073 1.59411 0.0266204 0.00642966 0.0382055 0.0200645 0.0555556 0.318841 11.0725 59.4111
9 0.299988 0.00366149 1.30465 1.49764 0.0312681 0.00452583 0.0358935 0.0148859 0.130435 0.449275 30.465 49.7642
10 0.400023 0.00259159 1.13487 1.40692 0.0271991 0.00306994 0.0337192 0.0119311 0.113527 0.562802 13.4872 40.6923
11 0.5 0.00189 0.579844 1.24155 0.0138969 0.00220612 0.0297557 0.00998654 0.057971 0.620773 -42.0156 24.1546
12 0.599977 0.00136983 0.990568 1.19972 0.0237406 0.00161888 0.0287534 0.0085922 0.0990338 0.719807 -0.943246 19.9724
13 0.700012 0.000980029 0.676094 1.1249 0.0162037 0.00116698 0.02696 0.0075311 0.0676329 0.78744 -32.3906 12.4895
14 0.799988 0.00067366 0.797286 1.08395 0.0191083 0.000820365 0.0259787 0.00669244 0.0797101 0.86715 -20.2714 8.39529
15 0.899965 0.000409521 0.797286 1.05211 0.0191083 0.000540092 0.0252155 0.00600898 0.0797101 0.94686 -20.2714 5.21072
16 1 2.55768e-05 0.531216 1 0.0127315 0.000264023 0.0239667 0.00543429 0.0531401 1 -46.8784 0

For the non cross-validation case, try splitting your data up front into training and validation frames.
I expect you will get a worse AUC for the validation case.
Although for highly imbalanced cases, sometimes you just need to go by the error rate for each class.
Since there are so many true negatives, that can dominate the AUC (vast majority of predictions are correctly predicting “not interesting”). Some people will upsample the minority class in this situation using row weights to make the model more sensitive to them.

Confused about the Text Matrix and Transformation matrix of a pdf parser

I'm developing a PdfParser and I want to print the text content of the pdf on a coordinate plane. Below is the text object and matrices that are used to render text. How can I isolate the scaling, rotation and translation and use for printing the text content on exact coordinates on a canvas?
//Decoded text stream containing text objects
S
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 20.000 13.600 Tm
[<007a>]TJ
ET
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
1.000 0.416 0.000 rg
/F10 6.667 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 136.667 13.600 Tm
[<0024>12<0046><0046><0058><0055>6<0048><0003><0032><0058><0057><0053><0058><0057><0003><0036>-4<0052><004f><0058><0057><004c><0052><0051><0003><0026>3<004f><0052><0058><0047><0003><0048><0051>18<0059><004c><0055>6<0052><0051><0050><0048><0051>3<0057>7<000f><0003><0027><0028><0030><0032><0003><0044><0046><0046><0058><0055>6<0048>]TJ
ET
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 603.333 13.600 Tm
[<007a>]TJ
ET
Q
q

The initial S Q is a leftover of a previous instruction block ending in some path stroking and graphics state restoring. As we don't know anything to the contrary, let's assume that 'Q' restores to the initial graphics state, in particular to an unmodified current transformation matrix (CTM).
As we are interested in coordinates according to the default user space coordinate system, we can assume accordingly that the current CTM is the identity matrix,
Let's take a look at the block
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 20.000 13.600 Tm
[<007a>]TJ
ET
Q
As you implied yourself in a comment, the only relevant instructions for the total transformation matrix at the time the text rendering instruction [<007a>]TJ begins to be executed are
0.000 0.750 0.750 -0.000 15.000 301.890 cm
and
1 0 0 -1 20.000 13.600 Tm
setting the current transformation matrix to
0 0.75 0 1 0 0 0 0.75 0
0.75 0 0 * 0 1 0 = 0.75 0 0
15.00 301.89 1 0 0 1 15.00 301.89 1
and the text and text line matrices both to
1 0 0
0 -1 0
20.0 13.6 1
Thus, the effects of text matrix and current transformation matrix combine to:
1 0 0 0 0.75 0 0 0.75 0
0 -1 0 * 0.75 0 0 = -0.75 0 0
20.0 13.6 1 15.00 301.89 1 25.2 316.89 1
You can split up that combined matrix in a scaling, rotation, and translation like this:
0 0.75 0 0.75 0 0 0 1 0 1 0 0
-0.75 0 0 = 0 0.75 0 * -1 0 0 * 0 1 0
25.2 316.89 1 0 0 1 0 0 1 25.2 316.89 1
We have a scaling by .75, a rotation by 90° counterclockwise, and a translation by (25.2, 316.89).
(Of course this can still be subject to a page rotation...)

How does Weka evaluate classifier model

I used random forest algorithm and got this result
=== Summary ===
Correctly Classified Instances 10547 97.0464 %
Incorrectly Classified Instances 321 2.9536 %
Kappa statistic 0.9642
Mean absolute error 0.0333
Root mean squared error 0.0952
Relative absolute error 18.1436 %
Root relative squared error 31.4285 %
Total Number of Instances 10868
=== Confusion Matrix ===
a b c d e f g h i <-- classified as
1518 1 3 1 0 14 0 0 4 | a = a
3 2446 0 0 0 1 1 27 0 | b = b
0 0 2942 0 0 0 0 0 0 | c = c
0 0 0 470 0 1 1 2 1 | d = d
9 0 0 9 2 19 0 3 0 | e = e
23 1 2 19 0 677 1 22 6 | f = f
4 0 2 0 0 13 379 0 0 | g = g
63 2 6 17 0 15 0 1122 3 | h = h
9 0 0 0 0 9 0 4 991 | i = i
I wonder how Weka evaluate errors(mean absolute error, root mean squared error, ...) using non numerical values('a', 'b', ...).
I mapped each classes to numbers from 0 to 8 and evaluated errors manually, but the evaluation was different from Weka.
How to reimplemen the evaluating steps of Weka?

Random Forest overfitting?

I'm facing the following problem: i'm training a random forest for binary prediction. the data is so structured:
> str(data)
'data.frame': 120269 obs. of 11 variables:
$ SeriousDlqin2yrs : num 1 0 0 0 0 0 0 0 0 0 ...
$ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
$ age : num 45 40 38 30 49 74 39 57 30 51 ...
$ NumberOfTime30.59DaysPastDueNotWorse: num 2 0 1 0 1 0 0 0 0 0 ...
$ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
$ MonthlyIncome : num 9120 2600 3042 3300 63588 ...
$ NumberOfOpenCreditLinesAndLoans : num 13 4 2 5 7 3 8 9 5 7 ...
$ NumberOfTimes90DaysLate : num 0 0 1 0 0 0 0 0 0 0 ...
$ NumberRealEstateLoansOrLines : num 6 0 0 0 1 1 0 4 0 2 ...
$ NumberOfTime60.89DaysPastDueNotWorse: num 0 0 0 0 0 0 0 0 0 0 ...
$ NumberOfDependents : num 2 1 0 0 0 1 0 2 0 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:29731] 7 9 17 33 42 53 59 63 72 87 ...
.. ..- attr(*, "names")= chr [1:29731] "7" "9" "17" "33" ...
I split the data
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
then i run the model and try to make predictions:
model.rf <- randomForest(as.factor(train[,1]) ~ ., data=train,ntree=1000,mtry=10,importance=TRUE)
pred.rf <- predict(model.rf, test, type = "prob")
rfpred <- c(1:22773)
rfpred[pred.rf[,1]<=0.5] <- "yes"
rfpred[pred.rf[,1]>0.5] <- "no"
rfpred <- factor(rfpred)
test[,1][test[,1]==1] <- "yes"
test[,1][test[,1]==0] <- "no"
test[,1] <- factor(test[,1])
confusionMatrix(as.factor(rfpred), as.factor(test$Y))
what I get is the following output:
> print(model.rf)
Call:
randomForest(formula = as.factor(train[, 1]) ~ ., data = train, ntree = 1000, mtry = 10, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 10
OOB estimate of error rate: 0%
Confusion matrix:
0 1 class.error
0 43093 0 0
1 0 25225 0
> head(pred.rf)
0 1
45868.1 1 0
112445 1 0
39001 1 0
133443 1 0
137460 1 0
125835.1 1 0
> confusionMatrix(as.factor(rfpred), as.factor(test$Y))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 14570 0
yes 0 8203
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.6398
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.6398
Detection Rate : 0.6398
Detection Prevalence : 0.6398
Balanced Accuracy : 1.0000
'Positive' Class : no
obviously the model cannot be so accurate!! what's wrong with my code?

Parsing line-based structure (ray tracer) without using too many vars

I want to parse a file in scala (probably using JavaTokerParsers?). Possibly without using too many vars :-)
The file is the input for a ray tracer.
It is a line based file structure.
Three types of lines exists: empty line, comment line and command line
The comment line starts with # (maybe has some whitespace before the #)
Command line starts with an identifier optionally followed by a number of parameters (float or filename).
How would I go about this. I would want to parser to be called like this
val scene = parseAll(sceneFile, file);
Sample file:
#Cornell Box
size 640 480
camera 0 0 1 0 0 -1 0 1 0 45
output scene6.png
maxdepth 5
maxverts 12
#planar face
vertex -1 +1 0
vertex -1 -1 0
vertex +1 -1 0
vertex +1 +1 0
#cube
vertex -1 +1 +1
vertex +1 +1 +1
vertex -1 -1 +1
vertex +1 -1 +1
vertex -1 +1 -1
vertex +1 +1 -1
vertex -1 -1 -1
vertex +1 -1 -1
ambient 0 0 0
specular 0 0 0
shininess 1
emission 0 0 0
diffuse 0 0 0
attenuation 1 0.1 0.05
point 0 0.44 -1.5 0.8 0.8 0.8
directional 0 1 -1 0.2 0.2 0.2
diffuse 0 0 1
#sphere 0 0.8 -1.5 0.1
pushTransform
#red
pushTransform
translate 0 0 -3
rotate 0 1 0 60
scale 10 10 1
diffuse 1 0 0
tri 0 1 2
tri 0 2 3
popTransform
#green
pushTransform
translate 0 0 -3
rotate 0 1 0 -60
scale 10 10 1
diffuse 0 1 0
tri 0 1 2
tri 0 2 3
popTransform
#back
pushTransform
scale 10 10 1
translate 0 0 -2
diffuse 1 1 1
tri 0 1 2
tri 0 2 3
popTransform
#sphere
diffuse 0.7 0.5 0.2
specular 0.2 0.2 0.2
pushTransform
translate 0 -0.7 -1.5
scale 0.1 0.1 0.1
sphere 0 0 0 1
popTransform
#cube
diffuse 0.5 0.7 0.2
specular 0.2 0.2 0.2
pushTransform
translate -0.25 -0.4 -1.8
rotate 0 1 0 15
scale 0.25 0.4 0.2
diffuse 1 1 1
tri 4 6 5
tri 6 7 5
tri 4 5 8
tri 5 9 8
tri 7 9 5
tri 7 11 9
tri 4 8 10
tri 4 10 6
tri 6 10 11
tri 6 11 7
tri 10 8 9
tri 10 9 11
popTransform
popTransform
popTransform

Maybe I've pushed it too hard for the one liner but that's my take (although idiomatic it might not be optimal):
First, CommandParams represents a command along with its arguments in a list format. If no arguments then we have None args:
case class CommandParams(command:String, params:Option[List[String]])
Then here's the file parsing and construction one liner along with line-by-line explanation:
val fileToDataStructure = Source.fromFile("file.txt").getLines() //open file and get lines iterator
.filter(!_.isEmpty) //exclude empty lines
.filter(!_.startsWith("#")) //exclude comments
.foldLeft(List[CommandParams]()) //iterate and store in a list of CommandParams
{(listCmds:List[CommandParams], line:String) => //tuple of a list of objs so far and the current line
val arr = line.split("\\s") //split line on any space delim
val command = arr.head //first element of array is the command
val args = if(arr.tail.isEmpty) None else Option(arr.tail.toList) //rest are their params
new CommandParams(command, args)::listCmds //construct the obj and cons it to the list
}
.reverse //due to cons concat we need to reverse to preserve order
A demo output iterating through it:
fileToDataStructure.foreach(println)
yields:
CommandParams(size,Some(List(640, 480)))
CommandParams(camera,Some(List(0, 0, 1, 0, 0, -1, 0, 1, 0, 45)))
CommandParams(output,Some(List(scene6.png)))
CommandParams(maxdepth,Some(List(5)))
CommandParams(maxverts,Some(List(12)))
CommandParams(vertex,Some(List(-1, +1, 0)))
...
CommandParams(pushTransform,None)
CommandParams(pushTransform,None)
CommandParams(translate,Some(List(0, 0, -3)))
...
A demo of how to iterate through it to do actual work once loaded:
fileToDataStructure.foreach{
cmdParms => cmdParms match {
case CommandParams(cmd, None) => println(s"I'm a ${cmd} with no args")
case CommandParams(cmd, Some(args))=> println(s"I'm a ${cmd} with args: ${args.mkString(",")}")
}
}
yields output:
I'm a size with args: 640,480
I'm a camera with args: 0,0,1,0,0,-1,0,1,0,45
I'm a output with args: scene6.png
I'm a maxdepth with args: 5
I'm a maxverts with args: 12
I'm a vertex with args: -1,+1,0
...
I'm a popTransform with no args
I'm a popTransform with no args

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing Data in C - parsing

Related

Terrible performance using XGBoost H2O

Confused about the Text Matrix and Transformation matrix of a pdf parser

How does Weka evaluate classifier model

Random Forest overfitting?

Parsing line-based structure (ray tracer) without using too many vars

Categories

Resources