Evaluation metric of DBNet 2019 and DBNet++ 2022 - machine-learning

What is a evaluation metric of DBNet 2019 and DBNet++
https://paperswithcode.com/paper/real-time-scene-text-detection-with (Real-time Scene Text Detection with Differentiable Binarization)
https://paperswithcode.com/paper/real-time-scene-text-detection-with-1 (Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion)
I just need a name for searching more detail.
It's evaluation metric is TedEval isn't it?
I have try many research and reading but can not find any solution
I just need a name of evaluation metric :(

Related

Should I normalize my features before throwing them into RNN?

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)
It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.
Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.
I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

How to compute DIR#FAR1% for face identification?

Recently, In some papers face recognition approaches are being evaluated through a new proposed protocol, names as closed-set and open-set face identification over LFW dataset. For open-set one, the Rank-1 accuracy is reported as Detection and Identification Rate (DIR) at a fixed False Alarm/Acceptance Rate (FAR). I have a gallery and a probe set and am using KNN for classification, however I don't know how to compute the DIR#FAR1%.
Update:
Specifically, what is ambiguous to me is fixating the FAR at a fixed threshold, or how the curves such as ROC, precision-recall and etc are plotted for face recognition. What does the threshold in the following paragraph mean?
Hence the performance is evaluated based on (i) Rank-1 detection and identification rate (DIR), which is the fraction of genuine probes matched correctly at Rank-1, and not rejected at a given threshold, and (ii) the false alarm rate (FAR) of the rejection step (i.e. the fraction of impostor probe images which are not rejected). We report the DIR vs. FAR curve describing the trade-off between true Rank-1 identifications and false alarms.
The reference paper is downloadable here.
Any help would be welcome.
I guess the DIR metric was established by the Biometrics society. This metric includes both the detection (exceeding some threshold) and the identification (rank). Let the gallery consists of a set of enrolled users in a biometric
database and the probe set may contain users who may or may not be present
in the database. Let g and p are two elements of the gallery and probe sets respectively. Moreover, let the probe set include two disjoint subsets: P1 including the samples of those who belong to the gallery subjects and P0 including those who do not.
Assume s(p,g) is a similarity score between a probe and a gallery elements, t is a threshold and k is the identification rank. Then DIR is given by:
You can find the complete formula in this reference:
Poh, N., et al. "Description of Metrics For the Evaluation of Biometric Performance." Seventh Framework Programme of Biometrics Evaluation and Testing (2012): 1-22.

Restricted Boltzmann Machine for real-valued data - gaussian linear units (glu) -

I want my Restricted Boltzmann Machine to learn a new representation of real-valued data (see: Hinton - 2010 - A Practical Guide to Training RBMs). I'm struggling with an implementation of Gaussian linear units.
With Gaussian linear units in the visible layer the energy changes to E(v,h)= ∑ (v-a)²/2σ - ∑ bh - ∑v/σ h w. Now I don't know how to change the Contrastive Divergence Learning Algorithm. The visible units won't be sampled any more as they are linear. I use the expectation (mean-fied activation) p(v_i=1|h)= a +∑hw + N(0,1) as their state. The associations are left unchangend ( pos: data*p(h=1|v)' neg: p(v=1|h)*p(h=1|v)' ). But this only leads to random noise when I want to reconstruct the data. The error rate will stop improving around 50%.
Finally I want to use Gaussian linear units in both layers. How will I get the states of the hidden units then? I suggest by using the mean-field activation p(h_i=1|v)= b +∑vw + N(0,1) but I'm not sure.
You could take a look at the gaussian RBM that Hinton himself has provided
Please find it here.
http://www.cs.toronto.edu/~hinton/code/rbmhidlinear.m
I have been working on a similar project implementing an RBM with c++ and matlab mexfunction.
i have found from the implementation of Professor Hinton (in matlab) that the the binary activation for visible units uses the simoid function.
ie : vs = sigmoid(bsxfun(#plus, hs*obj.W2', obj.b));
But when implementing the gaussian visible units RBM you simply have to sample the visible units without using the simoid.
ie: vs = bsxfun(#plus, h0*obj.W2', obj.b);
or better look at the Professor Hinton implementation (see: http://www.cs.toronto.edu/~hinton/code/rbmhidlinear.m)

Determine skeleton joints with a webcam (not Kinect)

I'm trying to determine skeleton joints (or at the very least to be able to track a single palm) using a regular webcam. I've looked all over the web and can't seem to find a way to do so.
Every example I've found is using Kinect. I want to use a single webcam.
There's no need for me to calculate the depth of the joints - I just need to be able to recognize their X, Y position in the frame. Which is why I'm using a webcam, not a Kinect.
So far I've looked at:
OpenCV (the "skeleton" functionality in it is a process of simplifying graphical models, but it's not a detection and/or skeletonization of a human body).
OpenNI (with NiTE) - the only way to get the joints is to use the Kinect device, so this doesn't work with a webcam.
I'm looking for a C/C++ library (but at this point would look at any other language), preferably open source (but, again, will consider any license) that can do the following:
Given an image (a frame from a webcam) calculate the X, Y positions of the visible joints
[Optional] Given a video capture stream call back into my code with events for joints' positions
Doesn't have to be super accurate, but would prefer it to be very fast (sub-0.1 sec processing time per frame)
Would really appreciate it if someone can help me out with this. I've been stuck on this for a few days now with no clear path to proceed.
UPDATE
2 years later a solution was found: http://dlib.net/imaging.html#shape_predictor
To track a hand using a single camera without depth information is a serious task and topic of ongoing scientific work. I can supply you a bunch of interesting and/or highly cited scientific papers on the topic:
M. de La Gorce, D. J. Fleet, and N. Paragios, “Model-Based 3D Hand Pose Estimation from Monocular Video.,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, Feb. 2011.
R. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Transactions on Graphics (TOG), 2009.
B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-based hand tracking using a hierarchical Bayesian filter.,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 9, pp. 1372–84, Sep. 2006.
J. M. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated objects,” in Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 612–617.
Hand tracking literature survey in the 2nd chapter:
T. de Campos, “3D Visual Tracking of Articulated Objects and Hands,” 2006.
Unfortunately I don't know about some freely available hand tracking library.
there is a simple way for detecting hand using skin tone. perhaps this could help... you can see the results on this youtube video. caveat: the background shouldn't contain skin colored things like wood.
here is the code:
''' Detect human skin tone and draw a boundary around it.
Useful for gesture recognition and motion tracking.
Inspired by: http://stackoverflow.com/a/14756351/1463143
Date: 08 June 2013
'''
# Required moduls
import cv2
import numpy
# Constants for finding range of skin color in YCrCb
min_YCrCb = numpy.array([0,133,77],numpy.uint8)
max_YCrCb = numpy.array([255,173,127],numpy.uint8)
# Create a window to display the camera feed
cv2.namedWindow('Camera Output')
# Get pointer to video frames from primary device
videoFrame = cv2.VideoCapture(0)
# Process the video frames
keyPressed = -1 # -1 indicates no key pressed
while(keyPressed < 0): # any key pressed has a value >= 0
# Grab video frame, decode it and return next video frame
readSucsess, sourceImage = videoFrame.read()
# Convert image to YCrCb
imageYCrCb = cv2.cvtColor(sourceImage,cv2.COLOR_BGR2YCR_CB)
# Find region with skin tone in YCrCb image
skinRegion = cv2.inRange(imageYCrCb,min_YCrCb,max_YCrCb)
# Do contour detection on skin region
contours, hierarchy = cv2.findContours(skinRegion, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Draw the contour on the source image
for i, c in enumerate(contours):
area = cv2.contourArea(c)
if area > 1000:
cv2.drawContours(sourceImage, contours, i, (0, 255, 0), 3)
# Display the source image
cv2.imshow('Camera Output',sourceImage)
# Check for user input to close program
keyPressed = cv2.waitKey(1) # wait 1 milisecond in each iteration of while loop
# Close window and camera after exiting the while loop
cv2.destroyWindow('Camera Output')
videoFrame.release()
the cv2.findContour is quite useful, you can find the centroid of a "blob" by using cv2.moments after u find the contours. have a look at the opencv documentation on shape descriptors.
i havent yet figured out how to make the skeletons that lie in the middle of the contour but i was thinking of "eroding" the contours till it is a single line. in image processing the process is called "skeletonization" or "morphological skeleton". here is some basic info on skeletonization.
here is a link that implements skeletonization in opencv and c++
here is a link for skeletonization in opencv and python
hope that helps :)
--- EDIT ----
i would highly recommend that you go through these papers by Deva Ramanan (scroll down after visiting the linked page): http://www.ics.uci.edu/~dramanan/
C. Desai, D. Ramanan. "Detecting Actions, Poses, and Objects with
Relational Phraselets" European Conference on Computer Vision
(ECCV), Florence, Italy, Oct. 2012.
D. Park, D. Ramanan. "N-Best Maximal Decoders for Part Models" International Conference
on Computer Vision (ICCV) Barcelona, Spain, November 2011.
D. Ramanan. "Learning to Parse Images of Articulated Objects" Neural Info. Proc.
Systems (NIPS), Vancouver, Canada, Dec 2006.
The most common approach can be seen in the following youtube video. http://www.youtube.com/watch?v=xML2S6bvMwI
This method is not quite robust, as it tends to fail if the hand is rotated to much (eg; if the camera is looking at the side of the hand or at a partially bent hand).
If you do not mind using two camera's you can look into the work Robert Wang. His current company (3GearSystems) uses this technology, augmented with a kinect, to provide tracking. His original paper uses two webcams but has much worse tracking.
Wang, Robert, Sylvain Paris, and Jovan Popović. "6d hands: markerless hand-tracking for computer aided design." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.
Another option (again if using "more" than a single webcam is possible), is to use a IR emitter. Your hand reflects IR light quite well whereas the background does not. By adding a filter to the webcam that filters normal light (and removing the standard filter that does the opposite) you can create a quite effective hand tracking. The advantage of this method is that the segmentation of the hand from the background is much simpler. Depending on the distance and the quality of the camera, you would need more IR leds, in order to reflect sufficient light back into the webcam. The leap motion uses this technology to track the fingers & palms (it uses 2 IR cameras and 3 IR leds to also get depth information).
All that being said; I think the Kinect is your best option in this. Yes, you don't need the depth, but the depth information does make it a lot easier to detect the hand (using the depth information for the segmentation).
My suggestion, given your constraints, would be to use something like this:
http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html
Here is a tutorial for using it for face detection:
http://opencv.willowgarage.com/wiki/FaceDetection?highlight=%28facial%29|%28recognition%29
The problem you have described is quite difficult, and I'm not sure that trying to do it using only a webcam is a reasonable plan, but this is probably your best bet. As explained here (http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html?highlight=load#cascadeclassifier-load), you will need to train the classifier with something like this:
http://docs.opencv.org/doc/user_guide/ug_traincascade.html
Remember: Even though you don't require the depth information for your use, having this information makes it easier for the library to identify a hand.
At last I've found a solution. Turns out a dlib open-source project has a "shape predictor" that, once properly trained, does exactly what I need: it guesstimates (with a pretty satisfactory accuracy) the "pose". A "pose" is loosely defined as "whatever you train it to recognize as a pose" by training it with a set of images, annotated with the shapes to extract from them.
The shape predictor is described in here on dlib's website
I don't know about possible existing solutions. If supervised (or semi-supervised) learning is an option, training decision trees or neural networks might already be enough (kinect uses random forests from what i have heard). Before you go such a path, do everything you can to find an existing solution. Getting Machine Learning stuff right takes a lot of time and experimentation.
OpenCV has machine learning components, what you would need is training data.
With the motion tracking features of the open source Blender project it is possible to create a 3D model based on 2D footage. No kinect needed. Since blender is open source you might be able to use their pyton scripts outside the blender framework for your own purposes.
Have you ever heard about Eyesweb
I have been using it for one of my project and I though it might be usefull for what you want to achieve.
Here are some interesting publication LNAI 3881 - Finger Tracking Methods Using EyesWeb and Powerpointing-HCI using gestures
Basically the workflow is:
You create your patch in EyesWeb
Prepare the datas you want to send with a network client
Use theses processed datas on your own server (your app)
However, I don't know if there is a way to embed the real time image processing part of Eyes Web into a soft as a library.

How do I cluster with KL-divergence?

I want to cluster my data with KL-divergence as my metric.
In K-means:
Choose the number of clusters.
Initialize each cluster's mean at random.
Assign each data point to a cluster c with minimal distance value.
Update each cluster's mean to that of the data points assigned to it.
In the Euclidean case it's easy to update the mean, just by averaging each vector.
However, if I'd like to use KL-divergence as my metric, how do I update my mean?
Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.
d(x,y) = KLD(x,y)/2 + KLD(y,x)/2
It is not a good idea to use KLD for two reasons:-
It is not symmetry KLD(x,y) ~= KLD(y,x)
You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.
Adding a small number may affect the accuracy.
Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.
However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.
This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.
You can find more information:
FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix
Factorization with the β-Divergence”, Neural Computation, v. 23, n.
9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168.
Dis- ponível em: .
LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative
Matrix Factorization and Its Robust Extensions for Topic Modeling.”
In: AAAI, pp. 2308–2314, 2017.
KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for
in- teractive topic modeling and document clustering”. In:
Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.
And as #mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

Resources