In Abaqus, what is meant by "Second-order accuracy" using Linear Explicit Hexahedral elements? - abaqus

When using a Linear Explicit Hexahedral 3D stress element in Abaqus, what is meant by the element controls option: second-order accurate formulation?
I understand that the reduced integration is changing the number of gauss points from eight to one using this 3D element, but I'm unclear what this second-order accurate formulation means.
Is that specifically referring to numerical round-off, i.e. the number of decimal points used in the calculation, i.e. (0.42356789 vs 0.423)? Or is this referring to the gauss points separately, i.e. (x=0.0000, w=2.0000 vs. x=0.0, w=2.0) used in the quadrature integration?

Related

What “information” in document vectors makes sentiment prediction work?

Sentiment prediction based on document vectors works pretty well, as examples show:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
I wonder what pattern is in the vectors making that possible. I thought it should be similarity of vectors making that somehow possible. Gensim similarity measures rely on cosine similarity. Therefore, I tried the following:
Randomly initialised a fix “compare” vector, get cosine similarity of the “compare” vector with all other vectors in training and test set, use the similarities and the labels of the train set to estimate a logistic regression model, evaluate the model with the test set.
Looks like this, where train/test_arrays contain document vectors and train/test_labels labels either 0 or 1. (Notice, document vectors are obtained from genism doc2vec and are well trained, predicting the test set 80% right if directly used as input for the logistic regression):
fix_vec = numpy.random.rand(100,1)
def cos_distance_to_fix(x):
return scipy.spatial.distance.cosine(fix_vec, x)
train_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=train_arrays), newshape=(-1,1))
test_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=test_arrays), newshape=(-1,1))
classifier = LogisticRegression()
classifier.fit(train_arrays_cos, train_labels)
classifier.score(test_arrays_cos, test_labels)
It turns out, that this approach does not work, predicting the test set only to 50%....
So, my question is, what “information” is in the vectors, making the prediction based on vectors work, if it is not the similarity of vectors? Or is my approach simply not possible to capture similarity of vectors correct?
This is less a question about Doc2Vec than about machine-learning principles with high-dimensional data.
Your approach is collapsing 100-dimensions to a single dimension – the distance to your random point. Then, you're hoping that single dimension can still be predictive.
And roughly all LogisticRegression can do with that single-valued input is try to pick a threshold-number that, when your distance is on one side of that threshold, predicts a class – and on the other side, predicts not-that-class.
Recasting that single-threshold-distance back to the original 100-dimensional space, it's essentially trying to find a hypersphere, around your random point, that does a good job collecting all of a single class either inside or outside its volume.
What are the odds your randomly-placed center-point, plus one adjustable radius, can do that well, in a complex high-dimensional space? My hunch is: not a lot. And your results, no better than random guessing, seems to suggest the same.
The LogisticRegression with access to the full 100-dimensions finds a discriminating-frontier for assigning the class that's described by 100 coefficients and one intercept-value – and all of those 101 values (free parameters) can be adjusted to improve its classification performance.
In comparison, your alternative LogisticRegression with access to only the one 'distance-from-a-random-point' dimension can pick just one coefficient (for the distance) and an intercept/bias. It's got 1/100th as much information to work with, and only 2 free parameters to adjust.
As an analogy, consider a much simpler space: the surface of the Earth. Pick a 'random' point, like say the South Pole. If I then tell you that you are in an unknown place 8900 miles from the South Pole, can you answer whether you are more likely in the USA or China? Hardly – both of those 'classes' of location have lots of instances 8900 miles from the South Pole.
Only in the extremes will the distance tell you for sure which class (country) you're in – because there are parts of the USA's Alaska and Hawaii further north and south than parts of China. But even there, you can't manage well with just a single threshold: you'd need a rule which says, "less than X or greater than Y, in USA; otherwise unknown".
The 100-dimensional space of Doc2Vec vectors (or other rich data sources) will often only be sensibly divided by far more complicated rules. And, our intuitions about distances and volumes based on 2- or 3-dimensional spaces will often lead us astray, in high dimensions.
Still, the Earth analogy does suggest a way forward: there are some reference points on the globe that will work way better, when you know the distance to them, at deciding if you're in the USA or China. In particular, a point at the center of the US, or at the center of China, would work really well.
Similarly, you may get somewhat better classification accuracy if rather than a random fix_vec, you pick either (a) any point for which a class is already known; or (b) some average of all known points of one class. In either case, your fix_vec is then likely to be "in a neighborhood" of similar examples, rather than some random spot (that has no more essential relationship to your classes than the South Pole has to northern-Hemisphere temperate-zone countries).
(Also: alternatively picking N multiple random points, and then feeding the N distances to your regression, will preserve more of the information/shape of the original Doc2Vec data, and thus give the classifier a better chance of finding a useful separating-threshold. Two would likely do better than your one distance, and 100 might approach or surpass the 100 original dimensions.)
Finally, some comment about the Doc2Vec aspect:
Doc2Vec optimizes vectors that are somewhat-good, within their constrained model, at predicting the words of a text. Positive-sentiment words tend to occur together, as do negative-sentiment words, and so the trained doc-vectors tend to arrange themselves in similar positions when they need to predict similar-meaning-words. So there are likely to be 'neighborhoods' of the doc-vector space that correlate well with predominantly positive-sentiment or negative-sentiment words, and thus positive or negative sentiments.
These won't necessarily be two giant neighborhoods, 'positive' and 'negative', separated by a simple boundary –or even a small number of neighborhoods matching our ideas of 3-D solid volumes. And many subtleties of communication – such as sarcasm, referencing a not-held opinion to critique it, spending more time on negative aspects but ultimately concluding positive, etc – mean incursions of alternate-sentiment words into texts. A fully-language-comprehending human agent could understand these to conclude the 'true' sentiment, while these word-occurrence based methods will still be confused.
But with an adequate model, and the right number of free parameters, a classifier might capture some generalizable insight about the high-dimensional space. In that case, you can achieve reasonably-good predictions, using the Doc2Vec dimensions – as you've seen with the ~80%+ results on the full 100-dimensional vectors.

facial expression classification using k-means

My method for classifying facial expressions using k-means is:
Use opencv to detect the face in the image
Use ASM and stasm to get the facial feature point
Calculate the distance between facial features (as show in the picture). There'll be 5 distances.
Calculate the centroid for each distance for each facial expression (exp: in the distance D1 there are 7 centroids for each expression 'happy, angry...').
Use 5 k-means each k-means for a distance and each k-means will have as a result the expression shown by the distance closest to the Centroid calculated in the first step.
Final expression will be the expression that appears in the most k-means results
However, using that method my results are wrong?
Is my method correct or is it wrong somewhere?
K-means is not a classification algorithm. Once runned, it simply finds centroids of K elements, so it splits data into K parts, but in most cases it won't have anything to do with desired classes. This algorithm (as all the clustering methods) should be used when you want to explore data and find some distinguishable objects. Distinguishable in any sense. If your task is to build a system, which recognizes some given classes, then it is a classification problem, not clustering. One of the most simple methods, which are easy to both implement and understand is KNN (K-nearest neighbours), which roughly does what you are trying to accomplish - checks which classes' objects are the closest ones to some predefined ones.
To better see the difference let us consider your case - you are trying to detect emotional state based on the face features. Running k-means on such data can split your face photos into many groups:
If you use photos of different people, it can cluster photos of particular people together (as their distances differ from others)
it can split data into for example man and woman, as there are gender specific differences in such features
it can even split your data based on the distance from the camera, as the perspective changes your features, creating "clusters".
etc.
As you can see, there are dozens possible "reasonable" (and even more completely not interpretable) splits, and K-means (and any) other clustering algorithm will simply find one of them (in most cases - the not interpretable one). Classification methods are used to overcome this issue, to "explain" the algorithm what are you expecting.

Unified distance measure to use in different implementations of OpenCV feature matching?

I'm trying to do some key feature matching in OpenCV, and for now I've been using cv::DescriptorMatcher::match and, as expected, I'm getting quite a few false matches.
Before I start to write my own filter and pruning procedures for the extracted matches, I wanted to try out the cv::DescriptorMatcher::radiusMatch function, which should only return the matches closer to each other than the given float maxDistance.
I would like to write a wrapper for the available OpenCV matching algorithms so that I could use them through an interface which allows for additional functionalities as well as additional extern (mine) matching implementations.
Since in my code, there is only one concrete class acting as a wrapper to OpenCV feature matching (similarly as cv::DescriptorMatcher, it takes the name of the specific matching algorithm and constructs it internally through a factory method), I would also like to write a universal method to implement matching utilizing cv::DescriptorMatcher::radiusMatch that would work for all the different matcher and feature choices (I have a similar wrapper that allows me to change between different OpenCV feature detectors and also implement some of my own).
Unfortunately, after looking through the OpenCV documentation and the cv::DescriptorMatcher interface, I just can't find any information about the distance measure used to calculate the actual distance between the matches. I found a pretty good matching example here using Surf features and descriptors, but I did not manage to understand the actual meaning of a specific value of the argument.
Since I would like to compare the results I'd get when using different feature/descriptor combinations, I would like to know what kind of distance measure is used (and if it can easily be changed), so that I can use something that makes sense with all the combinations I try out.
Any ideas/suggestions?
Update
I've just printed out the feature distances I get when using cv::DescriptorMatcher::match with various feature/descriptor combinations, and what I got was:
MSER/SIFT order of magnitude: 100
SURF/SURF order of magnitude: 0.1
SURF/SIFT order of magnitude: 50
MSER/SURF order of magnitude: 0.2
From this I can conclude that whichever distance measure is applied to the features, it is definitely not normalized. Since I am using OpenCV's and my own interfaces to work with different feature extraction, descriptor calculation and matching methods, I would like to have some argument for ::radiusMatch that I could use with all (most) of the different combinations. (I've tried matching using BruteForce and FlannBased matchers, and while the matches are slightly different, the discances between the matches are on the same order of magnitude for each of the combinations).
Some context:
I'm testing this on two pictures acquired from a camera mounted on top of a (slow) moving vehicle. The images should be around 5 frames (1 meter of vehicle motion) apart, so most of the features should be visible, and not much different (especially those that are far away from the camera in both images).
The magnitude of the distance is indeed dependent on the type of feature used. That is because some specialized feature descriptors also come with a specialized feature matcher that makes optimal use of the descriptor. If you want to obtain weights for the match distances of different feature types, your best bet is probably to make a training set of a dozen or more 1:1 matches, unleash each feature detector/matcher on it, and normalize the distances so that each detector has an average distance of 1 over all matches. You can then use the obtained weights on other datasets.
You should have a look at the following function in features2d.hpp in opencv library.
template<class Distance> void BruteForceMatcher<Distance>::commonRadiusMatchImpl()
Usually we use L2 distance to measure distance between matches. It depends on the descriptor you use. For example, Hamming distance is useful for the Brief descriptor since it counts the bit differences between two strings.

What's the difference between local and global basis functions in a regression?

For a regression with some basis functions, I read that gaussian basis functions are local whereas polynomial basis functions are global. What does it mean ?
Thank you
A gaussian is centered around a certain value and tapers off to 0 as you get far away from it. In contrast, a polynomial extends over the whole range.
This means that a gaussian will model a local feature of the data (like a bump or valley), whereas a polynomial will model global patterns in the data (say, an overall downward or upward trend).
A local basis function (you will also often see referred to as compactly supported basis function) is essentially non-zero only on a particular interval. Examples of such functions used in approximation / regression are B-Splines, wavelets etc. Polynomials on the other hand, are non-zero everywhere apart from at their roots. Consider a least squares regression curve using monomial basis - your resulting vandermonde matrix will not exhibit any kind of structure - an element can only be zero if x=0. Now assume that you try the same problem with a BSpline curve with fixed knots. Now because of the fact that the basis functions are local, your matrix will be banded - each row will contain zero items because the effect of the basis function is only present on a certain interval.

What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors.
I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance?
I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered.
I just need to know the relation.
Thanks.
p.s: I use a linear kernel.
Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.
Hard Margin Linear SVM
In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)
All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.
Soft-Margin Linear SVM
But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.
Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.
Non-Linear SVM
This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:
Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.
Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.
I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.
A good resource:
A Tutorial on Support Vector Machines for Pattern Recognition
800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.
Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.
Both number of samples and number of attributes may influence the number of support vectors, making model more complex. I believe you use words or even ngrams as attributes, so there are quite many of them, and natural language models are very complex themselves. So, 800 support vectors of 1000 samples seem to be ok. (Also pay attention to #karenu's comments about C/nu parameters that also have large effect on SVs number).
To get intuition about this recall SVM main idea. SVM works in a multidimensional feature space and tries to find hyperplane that separates all given samples. If you have a lot of samples and only 2 features (2 dimensions), the data and hyperplane may look like this:
Here there are only 3 support vectors, all the others are behind them and thus don't play any role. Note, that these support vectors are defined by only 2 coordinates.
Now imagine that you have 3 dimensional space and thus support vectors are defined by 3 coordinates.
This means that there's one more parameter (coordinate) to be adjusted, and this adjustment may need more samples to find optimal hyperplane. In other words, in worst case SVM finds only 1 hyperplane coordinate per sample.
When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops. With increasing number of samples this "anomaly" is reduced (more insignificant samples appear), but absolute number of support vectors stays very high.
SVM classification is linear in the number of support vectors (SVs). The number of SVs is in the worst case equal to the number of training samples, so 800/1000 is not yet the worst case, but it's still pretty bad.
Then again, 1000 training documents is a small training set. You should check what happens when you scale up to 10000s or more documents. If things don't improve, consider using linear SVMs, trained with LibLinear, for document classification; those scale up much better (model size and classification time are linear in the number of features and independent of the number of training samples).
There is some confusion between sources. In the textbook ISLR 6th Ed, for instance, C is described as a "boundary violation budget" from where it follows that higher C will allow for more boundary violations and more support vectors.
But in svm implementations in R and python the parameter C is implemented as "violation penalty" which is the opposite and then you will observe that for higher values of C there are fewer support vectors.

Resources