Plotting values onto a histogram in rails - ruby-on-rails

I realize there are several histogram gems, but my question is a bit unique. I don't need a graph or image of any kind. My rails app has an algorithm that gives each user a score between 0 and 1. e.g. billybob's raw_score might be .00901 and frankiejoe's raw_score might be .00071.
Without going into why I want to do this, I'd like to plot these values on a histogram, then map the mean raw_score as 50% and the mean plus standard deviation at about 65% (the mean minus standard deviation at 35%), the mean plus 2 x standard deviation at 80% etc. So 15 percentile for each standard deviation unit.
I don't need the actual histogram chart/image, I just want their corresponding histogram values after being loaded onto a histogram. I am essentially converting the numbers into a more aesthetically pleasing score, e.g. billybob's histogram_score might now be .987 and frankiejoe's might be .471. For now, it's only dozens of users or scores, but I'd like ability to handle thousands of users/scores.
I'd like to store the converted value into my database. The numbers I have now are raw_score:decimal and I'll store them as histogram_score:decimal.
How might I do this in my rails app?
Thank you!

Figured this out. So the descriptive_statistics gem is going to accomplish this for me.
require 'descriptive_statistics'
data = [0.15, 0.25, 0.10, 0.05, 0.35, 0.10]
data.map {|score| data.percentile_rank(score)}
=> [66.66666666666666, 83.33333333333334, 50.0, 16.666666666666664, 100.0, 50.0]
In my case, I'm just looping through each user's raw_score and storing it as the percentile_score. Works great!

Related

Imbalanced dataset, size limitation of 60mb, email categorization

I have a highly imbalanced dataset(approx - 1:100) of 1gb of raw emails, have to categorize these mails in 15 categories.
Problem that i have is that the size limit of file which will be used to train the model can not be more than 40mb.
So i want to filter out mails for each category which best represent the whole category.
For eg: for a category A, there are 100 emails in the dataset, due to size limitation i want to filter out only 10 emails which will represent the max features of all 100 emails.
I read that tfidf can be used to do this, for all the categories create a corpus of all the emails for that particular category and then try to find the emails that best represent but not sure how to do that. A code snippet will be of great help.
plus there are a lot of junk words and hash values in the dataset, should i clean all of those, even if i try its a lot to clean and manually its hard.
TF-IDF stands for Term Frequency, Inverse Term Frequency. The idea is to find out which words are more representative based on generality and specificity.
The idea that you were proposed is not that bad and could work for a shallow approach. Here's a snippet to help you understand how to do it:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
## Suppose Docs1 and Docs2 are the groups of e-mails. Notice that docs1 has more lines than docs2
docs1 = ['In digital imaging, a pixel, pel,[1] or picture element[2] is a physical point in a raster image, or the smallest addressable element in an all points addressable display device; so it is the smallest controllable element of a picture represented on the screen',
'Each pixel is a sample of an original image; more samples typically provide more accurate representations of the original. The intensity of each pixel is variable. In color imaging systems, a color is typically represented by three or four component intensities such as red, green, and blue, or cyan, magenta, yellow, and black.',
'In some contexts (such as descriptions of camera sensors), pixel refers to a single scalar element of a multi-component representation (called a photosite in the camera sensor context, although sensel is sometimes used),[3] while in yet other contexts it may refer to the set of component intensities for a spatial position.',
'The word pixel is a portmanteau of pix (from "pictures", shortened to "pics") and el (for "element"); similar formations with \'el\' include the words voxel[4] and texel.[4]',
'The word "pixel" was first published in 1965 by Frederic C. Billingsley of JPL, to describe the picture elements of video images from space probes to the Moon and Mars.[5] Billingsley had learned the word from Keith E. McFarland, at the Link Division of General Precision in Palo Alto, who in turn said he did not know where it originated. McFarland said simply it was "in use at the time" (circa 1963).[6]'
]
docs2 = ['In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers. Dichotomization is the special case of discretization in which the number of discrete classes is 2, which can approximate a continuous variable as a binary variable (creating a dichotomy for modeling purposes, as in binary classification).',
'Discretization is also related to discrete mathematics, and is an important component of granular computing. In this context, discretization may also refer to modification of variable or category granularity, as when multiple discrete variables are aggregated or multiple discrete categories fused.',
'Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.',
'The terms discretization and quantization often have the same denotation but not always identical connotations. (Specifically, the two terms share a semantic field.) The same is true of discretization error and quantization error.'
]
## We sum them up to have a universal TF-IDF dictionary, so that we can 'compare oranges to oranges'
docs3 = docs1+docs2
## Using Sklearn TfIdfVectorizer - it is easy and straight forward!
vectorizer = TfidfVectorizer()
## Now we make the universal TF-IDF dictionary, MAKE SURE TO USE THE MERGED LIST AND fit() [not fittransform]
X = vectorizer.fit(docs3)
## Checking the array shapes after using transform (fitting them to the tf-idf dictionary)
## Notice that they are the same size but with distinct number of lines
print(X.transform(docs1).toarray().shape, X.transform(docs2).toarray().shape)
(5, 221) (4, 221)
## Now, to "merge" them all, there are many ways to do it - here I used a simple "mean" method.
transformed_docs1 = np.mean(X.transform(docs1).toarray(), axis=0)
transformed_docs2 = np.mean(X.transform(docs1).toarray(), axis=0)
print(transformed_docs1)
print(transformed_docs2)
[0.02284796 0.02284796 0.02805426 0.06425141 0. 0.03212571
0. 0.03061173 0.02284796 0. 0. 0.04419432
0.08623564 0. 0. 0. 0.03806573 0.0385955
0.04569592 0. 0.02805426 0.02805426 0. 0.04299283
...
0. 0.02284796 0. 0.05610853 0.02284796 0.03061173
0. 0.02060219 0. 0.02284796 0.04345487 0.04569592
0. 0. 0.02284796 0. 0.03061173 0.02284796
0.04345487 0.07529817 0.04345487 0.02805426 0.03061173]
## These are the final Shapes.
print(transformed_docs1.shape, transformed_docs2.shape)
(221,) (221,)
About Removing junk words, TF-IDF averages rare words out (such as number, and etc) - if it is too rare, it wont matter much. But this could increase a lot the size of your input vectors, so I'd advise you to find a way to clean them. Also, consider some NLP preprocessing steps, such as lemmatization, to reduce dimensionality.

Get a blobs Skewness by inspecting 3rd Order Moment?

I've been reading up on Image Moments and it looks like they are very useful for efficiently describing a blob. Apparently, the 3rd order moment represents/can tell me about a blob's skewness (is this correct?).
How can I get the 3rd order moment in OpenCV? Do you have a calculation/formula you can point me to?
Moments m = moments(contour, false);
// Are any of these the 3rd order moment?
m.m03;
m.mu03;
m.nu03;
As stated in the OpenCV docs for moments():
Calculates all of the moments up to the third order of a polygon or rasterized shape.
So yes, moments() does return what you're after.
The three quantities you mention, m03, mu03, nu03 are all different types of the third or moment.
m03 is the third order moment
mu03 is the third order central moment, i.e. the same as m03 but if the blob was centered around (0, 0)
nu03 is mean-shifted and normalized, i.e. the same as mu03 but divided by an area.
Let's say you wanted to describe a shape but be agnostic to the size or the location of it in your image. Then you would use the mean-shifted and normalized descriptors, nu03. But if you wanted to keep the size as part of the description, then you'd use mu03. If you wanted to keep the location information as well, you'd use m03.
You can think about it in the same way as you think about distributions in general. Saying that I have a sample of x = 500 in a normal distribution with mean 450 and standard deviation 25 is basically the same thing as saying I have a sample of x = 2 in a normal distribution with mean 0 and std dev 1. Sometimes you might want to talk about the distribution in terms of it's actual parameters (mean 450, std dev 25), sometimes you might want to talk about the distribution as if it's mean centered (mean 0, std dev 25), and sometimes talk about it as if it's the standard Gaussian (mean 0, std dev 1).
This excellent answer goes over how to manually calculate the moments, which in tandem with the 'basic' descriptions I gave above, should make the formulas in the OpenCV docs for moments() a little more comfortable.

AudioKit FFT conversion to dB?

First time posting, thanks for the great community!
I am using AudioKit and trying to add frequency weighting filters to the microphone input and so I am trying to understand the values that are coming out of the AudioKit AKFFTTap.
Currently I am trying to just print the FFT buffer converted into dB values
for i in 0..<self.bufferSize {
let db = 20 * log10((self.fft?.fftData[Int(i)])!)
print(db)
}
I was expecting values ranging in the range of about -128 to 0, but I am getting strange values of nearly -200dB and when I blow on the microphone to peg out the readings it only reaches about -60. Am I not approaching this correctly? I was assuming that the values being output from the EZAudioFFT engine would be plain amplitude values and that the normal dB conversion math would work. Anyone have any ideas?
Thanks in advance for any discussion about this issue!
You need to add all of the values from self.fft?.fftData (consider changing negative values to positive before adding) and then change that to decibels
The values in the array correspond to the values of the bins in the FFT. Having a single bin contain a magnitude value close to 1 would mean that a great amount of energy is in that narrow frequency band e.g. a very loud sinusoid (a signal with a single frequency).
Normal sounds, such as the one caused by you blowing on the mic, spread their energy across the entire spectrum, that is, in many bins instead of just one. For this reason, usually the magnitudes get lower as the FFT size increases.
Magnitude of -40dB on a single bin is quite loud. If you try to play a tone, you should see a clear peak in one of the bins.

How to find the frequency from FFT data in MATLAB

From the question: How do I obtain the frequencies of each value in an FFT?
I have a similar question. I understand the answers to the previous question, but I would like further clarification on frequency. Is frequency the same as the index?
Let's go for an example: let's assume we have an array (1X200) of data in MATLAB. When you apply 'abs(fft)' for that array it gives the same size array as the result (1X200). So, does this mean this array contains magnitude? Does this mean the indices of these magnitudes are the frequencies? Like 1, 2, 3, 4...200? Or, if this assumption is wrong, please tell me how to find the frequency from the magnitude.
Instead of using the FFT directly you can use MATLAB's periodogram function, which takes care of a lot of the housekeeping for you, and which will plot the X (frequency axis) correctly if you supply the sample rate. See e.g. this answer.
For clarification though, the index of the FFT corresponds to frequency, and the magnitude of the complex value at each frequency (index) tells you the amplitude of the signal at that frequency.

Kohonen SOM Maps: Normalizing the input with unknown range

According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.

Resources