Confused what a glyph is, in pdf parsing - parsing

//Text object of a decoded text stream in a pdf
q
0.750 0.000 0.000 -0.750 0.000 841.890 cm
0.000 g
/F10 10.667 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 189.306 981.342 Tm
[<0003><0047><0052>-1<0048><0056><0051>33<006e>11<0057><0003><0052><0051><004f><005c><0003><004f><0052>7<005a>9<0048><0055><0003><005c>10<0052><0058><0055><0003><0046>5<0052><0056><0057><0056><0003><0003><0049>12<0052><0055><0003><0046>5<0052><0051><0056><0044><0051>3<0057><0056>8<0011>]TJ
ET
Q
What are the glyphs of above TJ entry and how to identify a glyph separately?

Related

Terrible performance using XGBoost H2O

Very different Model Performance using XGBoost on H2O
I am training a XGBoost model using 5-fold croos validation on a very imbalanced binary classification problem. The dataset has 1200 columns (multi-document word2vec document embeddings).
The only parameters specified to train the XGBoost model were:
min_split_improvement = 1e-5
seed=1
nfolds = 5
The reported performance on train data was extremely high (probably overfitting!!!):
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
AUC: 0.9999991404060721
The performance on cross validation data was terrible:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
AUC: 0.6015883863129724
I know H2O cross validation generates an extra model using the whole data available and different performances are expected.
But, could be the cause that generated too bad performance on the resulting model?
Ps: XGBoost on a multi node H2O cluster with OMP
Model Type: classifier
Performance do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on train data. **
MSE: 0.0008688085383330077
RMSE: 0.029475558320971762
LogLoss: 0.00836528606162877
Mean Per-Class Error: 5.931198102016033e-05
AUC: 0.9999991404060721
pr_auc: 0.9975495622569983
Gini: 0.9999982808121441
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.2814398407936096:
A D Error Rate
----- ----- --- ------- -------------
A 16858 2 0.0001 (2.0/16860.0)
D 0 414 0 (0.0/414.0)
Total 16858 416 0.0001 (2.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.28144 0.99759 195
max f2 0.28144 0.999035 195
max f0point5 0.553885 0.998053 191
max accuracy 0.28144 0.999884 195
max precision 0.990297 1 0
max recall 0.28144 1 195
max specificity 0.990297 1 0
max absolute_mcc 0.28144 0.997534 195
max min_per_class_accuracy 0.28144 0.999881 195
max mean_per_class_accuracy 0.28144 0.999941 195
max tns 0.990297 16860 0
max fns 0.990297 413 0
max fps 0.000111383 16860 399
max tps 0.28144 414 195
max tnr 0.990297 1 0
max fnr 0.990297 0.997585 0
max fpr 0.000111383 1 399
max tpr 0.28144 1 195
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 2.42 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- ------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- ------- -----------------
1 0.0100151 0.873526 41.7246 41.7246 1 0.907782 1 0.907782 0.417874 0.417874 4072.46 4072.46
2 0.0200301 0.776618 41.7246 41.7246 1 0.834968 1 0.871375 0.417874 0.835749 4072.46 4072.46
3 0.0300452 0.0326301 16.4004 33.2832 0.393064 0.303206 0.797688 0.681985 0.164251 1 1540.04 3228.32
4 0.0400023 0.0224876 0 24.9986 0 0.0263919 0.599132 0.518799 0 1 -100 2399.86
5 0.0500174 0.0180858 0 19.9931 0 0.0201498 0.479167 0.418953 0 1 -100 1899.31
6 0.100035 0.0107386 0 9.99653 0 0.0136044 0.239583 0.216279 0 1 -100 899.653
7 0.149994 0.00798337 0 6.66692 0 0.00922284 0.159784 0.147313 0 1 -100 566.692
8 0.200012 0.00629476 0 4.99971 0 0.00709438 0.119826 0.112249 0 1 -100 399.971
9 0.299988 0.00436827 0 3.33346 0 0.00522157 0.0798919 0.0765798 0 1 -100 233.346
10 0.400023 0.00311204 0 2.49986 0 0.00370085 0.0599132 0.0583548 0 1 -100 149.986
11 0.5 0.00227535 0 2 0 0.00267196 0.0479333 0.0472208 0 1 -100 100
12 0.599977 0.00170271 0 1.66673 0 0.00197515 0.039946 0.0396813 0 1 -100 66.6731
13 0.700012 0.00121528 0 1.42855 0 0.00145049 0.0342375 0.034218 0 1 -100 42.8548
14 0.799988 0.000837358 0 1.25002 0 0.00102069 0.0299588 0.0300692 0 1 -100 25.0018
15 0.899965 0.000507632 0 1.11115 0 0.000670878 0.0266306 0.0268033 0 1 -100 11.1154
16 1 3.35288e-05 0 1 0 0.00033002 0.0239667 0.0241551 0 1 -100 0
Performance da validação cruzada (xval) do modelo < XGBoost_model_python_1575650180928_617 >:
ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **
MSE: 0.023504756648164406
RMSE: 0.15331261085822134
LogLoss: 0.14134815775808462
Mean Per-Class Error: 0.4160864407653825
AUC: 0.6015883863129724
pr_auc: 0.04991836222189148
Gini: 0.2031767726259448
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.016815993119962513:
A D Error Rate
----- ----- --- ------- ----------------
A 16003 857 0.0508 (857.0/16860.0)
D 357 57 0.8623 (357.0/414.0)
Total 16360 914 0.0703 (1214.0/17274.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- --------- -----
max f1 0.016816 0.0858434 209
max f2 0.00409934 0.138433 318
max f0point5 0.0422254 0.0914205 127
max accuracy 0.905155 0.976323 3
max precision 0.99221 1 0
max recall 9.60076e-05 1 399
max specificity 0.99221 1 0
max absolute_mcc 0.825434 0.109684 5
max min_per_class_accuracy 0.00238436 0.572464 345
max mean_per_class_accuracy 0.00262155 0.583914 341
max tns 0.99221 16860 0
max fns 0.99221 412 0
max fps 9.60076e-05 16860 399
max tps 9.60076e-05 414 399
max tnr 0.99221 1 0
max fnr 0.99221 0.995169 0
max fpr 9.60076e-05 1 399
max tpr 9.60076e-05 1 399
Gains/Lift Table: Avg response rate: 2.40 %, avg score: 0.54 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- -------- ----------------- --------------- ----------- -------------------------- ------------------ -------------- ------------------------- --------- -----------------
1 0.0100151 0.0540408 4.34129 4.34129 0.104046 0.146278 0.104046 0.146278 0.0434783 0.0434783 334.129 334.129
2 0.0200301 0.033963 2.41183 3.37656 0.0578035 0.0424722 0.0809249 0.094375 0.0241546 0.0676329 141.183 237.656
3 0.0300452 0.0251807 2.17065 2.97459 0.0520231 0.0292894 0.0712909 0.0726798 0.0217391 0.089372 117.065 197.459
4 0.0400023 0.02038 2.18327 2.77762 0.0523256 0.0225741 0.0665702 0.0602078 0.0217391 0.111111 118.327 177.762
5 0.0500174 0.0174157 1.92946 2.60779 0.0462428 0.0188102 0.0625 0.0519187 0.0193237 0.130435 92.9463 160.779
6 0.100035 0.0103201 1.59365 2.10072 0.0381944 0.0132217 0.0503472 0.0325702 0.0797101 0.210145 59.3649 110.072
7 0.149994 0.00742152 1.06366 1.7553 0.0254925 0.00867473 0.0420687 0.0246112 0.0531401 0.263285 6.3664 75.5301
8 0.200012 0.00560037 1.11073 1.59411 0.0266204 0.00642966 0.0382055 0.0200645 0.0555556 0.318841 11.0725 59.4111
9 0.299988 0.00366149 1.30465 1.49764 0.0312681 0.00452583 0.0358935 0.0148859 0.130435 0.449275 30.465 49.7642
10 0.400023 0.00259159 1.13487 1.40692 0.0271991 0.00306994 0.0337192 0.0119311 0.113527 0.562802 13.4872 40.6923
11 0.5 0.00189 0.579844 1.24155 0.0138969 0.00220612 0.0297557 0.00998654 0.057971 0.620773 -42.0156 24.1546
12 0.599977 0.00136983 0.990568 1.19972 0.0237406 0.00161888 0.0287534 0.0085922 0.0990338 0.719807 -0.943246 19.9724
13 0.700012 0.000980029 0.676094 1.1249 0.0162037 0.00116698 0.02696 0.0075311 0.0676329 0.78744 -32.3906 12.4895
14 0.799988 0.00067366 0.797286 1.08395 0.0191083 0.000820365 0.0259787 0.00669244 0.0797101 0.86715 -20.2714 8.39529
15 0.899965 0.000409521 0.797286 1.05211 0.0191083 0.000540092 0.0252155 0.00600898 0.0797101 0.94686 -20.2714 5.21072
16 1 2.55768e-05 0.531216 1 0.0127315 0.000264023 0.0239667 0.00543429 0.0531401 1 -46.8784 0
For the non cross-validation case, try splitting your data up front into training and validation frames.
I expect you will get a worse AUC for the validation case.
Although for highly imbalanced cases, sometimes you just need to go by the error rate for each class.
Since there are so many true negatives, that can dominate the AUC (vast majority of predictions are correctly predicting “not interesting”). Some people will upsample the minority class in this situation using row weights to make the model more sensitive to them.

Confused about the Text Matrix and Transformation matrix of a pdf parser

I'm developing a PdfParser and I want to print the text content of the pdf on a coordinate plane. Below is the text object and matrices that are used to render text. How can I isolate the scaling, rotation and translation and use for printing the text content on exact coordinates on a canvas?
//Decoded text stream containing text objects
S
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 20.000 13.600 Tm
[<007a>]TJ
ET
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
1.000 0.416 0.000 rg
/F10 6.667 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 136.667 13.600 Tm
[<0024>12<0046><0046><0058><0055>6<0048><0003><0032><0058><0057><0053><0058><0057><0003><0036>-4<0052><004f><0058><0057><004c><0052><0051><0003><0026>3<004f><0052><0058><0047><0003><0048><0051>18<0059><004c><0055>6<0052><0051><0050><0048><0051>3<0057>7<000f><0003><0027><0028><0030><0032><0003><0044><0046><0046><0058><0055>6<0048>]TJ
ET
Q
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 603.333 13.600 Tm
[<007a>]TJ
ET
Q
q
The initial S Q is a leftover of a previous instruction block ending in some path stroking and graphics state restoring. As we don't know anything to the contrary, let's assume that 'Q' restores to the initial graphics state, in particular to an unmodified current transformation matrix (CTM).
As we are interested in coordinates according to the default user space coordinate system, we can assume accordingly that the current CTM is the identity matrix,
Let's take a look at the block
q
0.000 0.750 0.750 -0.000 15.000 301.890 cm
0.000 g
/F10 16.000 Tf
0 Tr
0.000 Tc
BT
1 0 0 -1 20.000 13.600 Tm
[<007a>]TJ
ET
Q
As you implied yourself in a comment, the only relevant instructions for the total transformation matrix at the time the text rendering instruction [<007a>]TJ begins to be executed are
0.000 0.750 0.750 -0.000 15.000 301.890 cm
and
1 0 0 -1 20.000 13.600 Tm
setting the current transformation matrix to
0 0.75 0 1 0 0 0 0.75 0
0.75 0 0 * 0 1 0 = 0.75 0 0
15.00 301.89 1 0 0 1 15.00 301.89 1
and the text and text line matrices both to
1 0 0
0 -1 0
20.0 13.6 1
Thus, the effects of text matrix and current transformation matrix combine to:
1 0 0 0 0.75 0 0 0.75 0
0 -1 0 * 0.75 0 0 = -0.75 0 0
20.0 13.6 1 15.00 301.89 1 25.2 316.89 1
You can split up that combined matrix in a scaling, rotation, and translation like this:
0 0.75 0 0.75 0 0 0 1 0 1 0 0
-0.75 0 0 = 0 0.75 0 * -1 0 0 * 0 1 0
25.2 316.89 1 0 0 1 0 0 1 25.2 316.89 1
We have a scaling by .75, a rotation by 90° counterclockwise, and a translation by (25.2, 316.89).
(Of course this can still be subject to a page rotation...)

How to calculate multiclass overall accuracy, sensitivity and specificity?

Can anyone explain how to calculate the accuracy, sensitivity and specificity of multi-class dataset?
Sensitivity of each class can be calculated from its
TP/(TP+FN)
and specificity of each class can be calculated from its
TN/(TN+FP)
For more information about concept and equations
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
For multi-class classification, you may use one against all approach.
Suppose there are three classes: C1, C2, and C3
"TP of C1" is all C1 instances that are classified as C1.
"TN of C1" is all non-C1 instances that are not classified as C1.
"FP of C1" is all non-C1 instances that are classified as C1.
"FN of C1" is all C1 instances that are not classified as C1.
To find these four terms of C2 or C3 you can replace C1 with C2 or C3.
In a simple sentences :
In a 2x2, once you have picked one category as positive, the other is automatically negative. With 9 categories, you basically have 9 different sensitivities, depending on which of the nine categories you pick as "positive". You could calculate these by collapsing to a 2x2, i.e. Class1 versus not-Class1, then Class2 versus not-Class2, and so on.
Example :
we get a confusion matrix for the 7 types of glass:
=== Confusion Matrix ===
a b c d e f g <-- classified as
50 15 3 0 0 1 1 | a = build wind float
16 47 6 0 2 3 2 | b = build wind non-float
5 5 6 0 0 1 0 | c = vehic wind float
0 0 0 0 0 0 0 | d = vehic wind non-float
0 2 0 0 10 0 1 | e = containers
1 1 0 0 0 7 0 | f = tableware
3 2 0 0 0 1 23 | g = headlamps
a true positive rate (sensitivity) calculated for each type of glass, plus an overall weighted average:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.714 0.174 0.667 0.714 0.690 0.532 0.806 0.667 build wind float
0.618 0.181 0.653 0.618 0.635 0.443 0.768 0.606 build wind non-float
0.353 0.046 0.400 0.353 0.375 0.325 0.766 0.251 vehic wind float
0.000 0.000 0.000 0.000 0.000 0.000 ? ? vehic wind non-float
0.769 0.010 0.833 0.769 0.800 0.788 0.872 0.575 containers
0.778 0.029 0.538 0.778 0.636 0.629 0.930 0.527 tableware
0.793 0.022 0.852 0.793 0.821 0.795 0.869 0.738 headlamps
0.668 0.130 0.670 0.668 0.668 0.539 0.807 0.611 Weighted Avg.
You may print a classification report from the link below, you will get the overall accuracy of your model.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report
compute sensitivity and specificity for multi classification
from sklearn.metrics import precision_recall_fscore_support
res = []
for l in [0,1,2,3]:
prec,recall,_,_ = precision_recall_fscore_support(np.array(y_true)==l,
np.array(y_prediction)==l,
pos_label=True,average=None)
res.append([l,recall[0],recall[1]])
pd.DataFrame(res,columns = ['class','sensitivity','specificity'])

How to parse ASCII PGM files in data structures with Haskell?

I am completely new to functional programming and Haskell and i need to parse an ASCII PGM image into a data structure but i can't figure out how to do that.
I have looked at quite a few examples (including the Graphics.Pgm module) but still don't know how to write it in Haskell. Here is what i have so far (this code does not compile):
import System.IO
import Control.Monad
import Control.Applicative
import Data.Attoparsec.Char8
import qualified Data.ByteString as B
data ASCIIGreymap = ASCIIGreymap {
aGreyType :: String
, aGreyComment :: String
, aGreyWidth :: Int
, aGreyHeight :: Int
, aGreyMax :: Int
, aGreyData :: [Int]
} deriving (Eq)
instance Show ASCIIGreymap where
show ( ASCIIGreymap t c w h m _ ) = "ASCIIGreymap Type: "++show t ++ "Comment: " ++ show c ++ " w: " ++ show w ++ " h: " ++ show h ++ " max: " ++ show m
parseASCIIGreymap :: Parser ASCIIGreymap
parseASCIIGreymap = do
pgmType <- string
pgmComment <- string
pgmWidth <- integer
char ' '
pgmHeight <- integer
pgmMax <- integer
pgmGreyData <- [integer]
return $ ASCIIGreymap pgmType pgmComment pgmWidth pgmHeight pgmMax pgmGreyData
pgmFile :: FilePath
pgmFile = "test_ascii.pgm"
main = B.readFile logFile >>= print . parseOnly parseASCIIGreymap
A example file (test_ascii.pgm) looks like this:
P2
# CREATOR: GIMP PNM Filter Version 1.1
10 10
255
0
0
0
0
0
64
255
255
255
179
0
0
0
0
0
159
255
255
255
243
0
0
0
0
96
223
255
255
255
255
0
0
64
96
223
255
255
255
255
255
128
128
191
223
255
255
255
255
255
255
255
255
255
255
255
249
217
179
128
128
255
255
255
255
249
198
89
51
0
0
255
255
255
249
198
77
0
0
0
0
255
255
255
236
128
0
0
0
0
0
191
255
255
218
51
0
0
0
0
0
The first line holds the "magicNumber" where P2 stands for 8bit grey
image
The second line is a comment
The third line has the width and height of the image separated with
a space
In the fourth line is the max grey value
From here on to the end of the file are the grey values for each
pixel
I would like to parse this pgm file into the data structure (ASCIIGreymap) to compare two images later. But like i said i don't know how to get there. If my approach is wrong or if there are better ways of parsing a pgm image please let me know.
Any help is much appreciated!
Edit: Since i haven't made any progress with parsing a pgm file i am not so sure anymore if my approach is correct.
Can someone please comment on my general idea to take the content of the file and put it in a data structure to further work with the data? Or is there a better way?
Thanks again!

Parsing Data in C

I am trying to parse some data using C.
The data is of the form:
REMARK 280 100 MM MES PH 6.5, 5 % GLYCEROL
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 1 21 1
REMARK 290
REMARK 290 SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290 1555 X,Y,Z
REMARK 290 2555 -X,Y+1/2,-Z
I want to extract the "Symmetry Operator" data: X,Y,Z and -X,Y+1/2,-Z and turn the data into two matrices for each set of symmetry operators of the form:
[1 0 0 [0 [-1 0 0 [0
0 1 0 0 and 0 1 0 1/2
0 0 1] 0] 0 0 -1] 0]
for X,Y,Z, and -X,Y+1/2,-Z respectively.
I have not done much data parsing and would appreciate any help anyone could offer.

Resources