y-axis ticks of seaborn displot - histogram

Can I know what are the ticks at y axis mean?
I created a seaborn.distplot of Age column from titanic dataset from kaggle. I know distplot is similar to histograms and values are between bins. But I was not able to understand what the y-axis ticks meant. Any help is appreciated. Thanks

The graph shows the probability density. You may read it as follows; choosing e.g. the bar from 20 to 24 years. This shows a probability density of 0.035. The probability to have someone between 20 and 24 years is hence (24-20)*0.035 = 0.14 = 14%.
Apart you can learn a lot of qualitative things, e.g. it is interesting that there are more than twice as many children aged below 4 years than between 4 and 8 years.

Related

How do I create a function that will run input-dependent sums or outcomes in Google Sheets?

I have built a spreadsheet for a game I play. Its purpose is to demonstrate the amount of points earned after each round. The amount of points earned is dependent on the win/loss of that round, and then the points compound or retract accordingly.
There are 24 matches that are played in this game, with a possible result of win/loss for each round. So, there are 48 different individual results that can occur. The complicated thing is that each round's points depends on win/loss, as well as the previous round's individual earnings. For example, if Round 2 is won and earns 120 points, and Round 3 is won, it earns 150 points. But if Round 3 is lost, 120 points is earned.
I am looking to build a program or function that will compute the final score of every one of the 16,777216 possible combination of outcomes.
Thanks in advance!

How do I combine two electromagnetic readings to predict the position of a sensor?

I have an electromagnetic sensor and electromagnetic field emitter.
The sensor will read power from the emitter. I want to predict the position of the sensor using the reading.
Let me simplify the problem, suppose the sensor and the emitter are in 1 dimension world where there are only position X (not X,Y,Z) and the emitter emits power as a function of distance squared.
From the painted image below, you will see that the emitter is drawn as a circle and the sensor is drawn as a cross.
E.g. if the sensor is 5 meter away from the emitter, the reading you get on the sensor will be 5^2 = 25. So the correct position will be either 0 or 10, because the emitter is at position 5.
So, with one emitter, I cannot know the exact position of the sensor. I only know that there are 50% chance it's at 0, and 50% chance it's at 10.
So if I have two emitters like the following image:
I will get two readings. And I can know exactly where the sensor is. If the reading is 25 and 16, I know the sensor is at 10.
So from this fact, I want to use 2 emitters to locate the sensor.
Now that I've explained you the situation, my problems are like this:
The emitter has a more complicated function of the distance. It's
not just distance squared. And it also have noise. so I'm trying to
model it using machine learning.
Some of the areas, the emitter don't work so well. E.g. if you are
between 3 to 4 meters away, the emitter will always give you a fixed
reading of 9 instead of going from 9 to 16.
When I train the machine learning model with 2 inputs, the
prediction is very accurate. E.g. if the input is 25,36 and the
output will be position 0. But it means that after training, I
cannot move the emitters at all. If I move one of the emitters to be
further apart, the prediction will be broken immediately because the
reading will be something like 25,49 when the right emitter moves to
the right 1 meter. And the prediction can be anything because the
model has not seen this input pair before. And I cannot afford to
train the model on all possible distance of the 2 emitters.
The emitters can be slightly not identical. The difference will
be on the scale. E.g. one of the emitters can be giving 10% bigger
reading. But you can ignore this problem for now.
My question is How do I make the model work when the emitters are allowed to move? Give me some ideas.
Some of my ideas:
I think that I have to figure out the position of both
emitters relative to each other dynamically. But after knowing the
position of both emitters, how do I tell that to the model?
I have tried training each emitter separately instead of pairing
them as input. But that means there are many positions that cause
conflict like when you get reading=25, the model will predict the
average of 0 and 10 because both are valid position of reading=25.
You might suggest training to predict distance instead of position,
that's possible if there is no problem number 2. But because
there is problem number 2, the prediction between 3 to 4 meters away
will be wrong. The model will get input as 9, and the output will be
the average distance 3.5 meters or somewhere between 3 to 4 meters.
Use the model to predict position
probability density function instead of predicting the position.
E.g. when the reading is 9, the model should predict a uniform
density function from 3 to 4 meters. And then you can combine the 2
density functions from the 2 readings somehow. But I think it's not
going to be that accurate compared to modeling 2 emitters together
because the density function can be quite complicated. We cannot
assume normal distribution or even uniform distribution.
Use some kind of optimizer to predict the position separately for each
emitter based on the assumption that both predictions must be the same. If
the predictions are not the same, the optimizer must try to move the
predictions so that they are exactly at the same point. Maybe reinforcement
learning where the actions are "move left", "move right", etc.
I told you my ideas so that it might evoke some ideas in you. Because this is already my best but it's not solving the issue elegantly yet.
So ideally, I would want the end-to-end model that are fed 2 readings, and give me position even when the emitters are moved. How would I go about that?
PS. The emitters are only allowed to move before usage. During usage or prediction, the model can assume that the emitter will not be moved anymore. This allows you to have time to run emitters position calibration algorithm before usage. Maybe this will be a helpful thing for you to know.
You're confusing memoizing a function with training a model; the former is merely recalling previous results; the latter is the province of AI. To train with two emitters, you need to give the useful input data and appropriate labels (right answers), and design your model topology such that it can be trained to a useful functional response fro cases it has never seen.
Let the first emitter be at position 0 by definition. Your data then consists of the position of the second emitter and the two readings. The label is the sensor's position. Your given examples would look like this:
emit2 read1 read2 sensor
1 25 36 0
1 25 16 5
2 25 49 0
1.5 25 9 5 distance of 3 < d < 4 always reads as 3^2
Since you know that you have an squared relationship in the underlying physics, you need to include quadratic capability in your model. To handle noise, you'll want some dampener capability, such as an extra node or two in a hidden layer after the first. For more complex relationships, you'll need other topologies, non-linear activation functions, etc.
Can you take it from there?

Google Sheets - Incorrect result

I am confused with a Google Sheet I created.
https://docs.google.com/spreadsheets/d/1k0osuq_WFztRxNGcxXBhG5Hi6LSrHj8A5RCEwMnQZUs/edit?usp=sharing
These are bike times taken to complete 90 kms. These are split into 5kms chunks.
The interesting thing is that I input the time for each 5km chunk and calculate the speed in km/h from it. Then I calculate the total time taken using sum and the average speed using average. However this is incorrect. For Zell-am-See 2017 I get an average speed of 31km/h when the it should be around 28 km/h.
I can't seem to find the error. Initially I thought it was due to rounding but even if I change the data format to scientific nothing changes.
It is an incorrect assumption about the mathematics. You cannot average the averages (you need total distance over total time) because the lower rates impact the average more than the higher rates because they happen for longer times. You might want to Google "harmonic mean" for more.
For example, suppose you go 120 km at 40 km/h, and then ride back at 30 km/h. You have traveled 240 km in 7 hours. Your average rate is under 35 km/h.
EDIT: Total distance over total time is the way to go. But if you want to satisfy yourself that it is the harmonic mean you want, add a column F to the right of your speeds, and in F3 say =1/E3, and drag that on down through F20. In F21 say =1/AVERAGE(F3:F20), and behold you have the harmonic mean, which is the desired answer.

Is there a way to summarize the features of many time series?

I'm actually trying to detect characteristics of the time series for a very big region composed of many smaller subregions (in my case pixels). I don't know much about this, so the only way I can come up with is an averaged time series for the entire region, although I know this would definitely conceal many features by averaging.
I'm just wondering if there are any widely used techniques that can detect the common features of a suite of time series? like pattern recognition or time series classification?
Any ideas/suggestions are much appreciated!
Thanks!
Some extra explanations: I'm dealing with remote sensing images of several years with a time step of 7 days. So for each pixel, there is a time series associated, with values extracted from this pixel on different dates.So if I define a region consisting of many pixels, is there a way to detect or extract some common features charactering all or most of the time series of pixels within this region? Such as the shape of the time series, or a date around which there's an obvious increase in the values?
You could compute the correlation matrix for the pixels. This would simply be:
corr = np.zeros((npix,npix))
for i in range(npix):
for j in range(npix):
corr(i,j) = sum(data(i,:)*data(j,:))/sqrt(sum(data(i,:)**2)*sum(data(j,:)**2))
If you want more information, you can compute this as a function of time, i.e. divide your time series into blocks (say minutes) and compute the correlation for each of them. Then you can see how the correlation changes over time.
If the correlation changes a lot, you may be more interested in the cross-power spectrum of the pixels. This is defined as
cpow(i,j,:) = (fft(data(i,:))*conj(fft(data(j,:)))
This will tell you how much pixel i and j tend to change together on various time-scales. For example, they could be moving in unison in time-scales of a second (1 Hz), but also have changes on a time-scale of, say, 10 seconds which are not correlated with each other.
It all depends on what you need, really.

How to compute histograms using weka

Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!
I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.

Resources