I've been using Mathemathica in order to collect wind speed data in a certain location. Mathematica gives me a TimeSeries from which I would like to know what is the maximum value and the minimum one. I've used "FindPeaks" function, but as it is not a regularly sampled time series the function does not work. Is there a way to get the maximum value?
The following is the data:
data = WindSpeedData[{19.416258, -99.719266}, {DateObject[{2016, 1,
1}], DateObject[{2017, 1, 2}]}]
This
Normal[data]
will undo part of the layer of Timeseries that your actual wind speeds are wrapped up inside of.
In that you can see that what you probably want is part of the second item in each list, that appears to be the actual speed. So
Max[Map[#[[2, 1]] &, Normal[data]]]
will look at the beginning of the second part, which is your speed, and then try to find the maximum value.
For your example data that shows Max[29.0802, "NotAvailable"] and that 29.0802 is the maximum wind speed.
That trailing "NotAvailable" is still some part of their wrapping the actual data up inside layers. It is possible that there is one missing wind speed in there and Max doesn't know what to make of that.
Thanks #Bill,I used
Max[data]
It threw me ["Not available", 46.8 km/h]
I just used your code and gave me the same result but in different order [46.8 km/h, "Not available"]. I double checked the datum (46.8) by displaying the data in a grid. Thank you very much, I will use the normal function further on.
Related
I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.
I might not get something. How can I plot a raw time series with Timelion without applying any further aggregation? Just the raw data of a field over time that I have in an index. Of course I select the proper time window for the data.
I was trying to achieve the same thing, but didn't fully get what I wanted, but maybe these steps will help you.
My data was on by minute basis, so I don't want any more frequent fragmentation. Selecting interval = 1m helps only for short periods of time, but adding "interval=1m" into .es() block works on long periods, too.
To have lines not to return to 0 in between points, use .es().fit(carry)
.es().scale_interval(1m).fit(scale) - this is my chart to return to 0 if there were no data for certain period rather than carrying the line on the same level.
.es(metric=max:value_field) helps not to sum up the values, but show the max of the aggregated set.
My charts are still weirdly aggregated, but maybe it'll help someone.
Useful links:
Sparse time series in timelion
https://www.elastic.co/blog/sparse-timeseries-and-timelion
Scaling issue 1
https://discuss.elastic.co/t/diferent-value-on-y-axis-depending-on-time-interval/67785
Scaling issue 2
https://discuss.elastic.co/t/timelion-giving-wrong-metric-aggregate-value-on-enlarging/132789
Scaling issue 3
https://discuss.elastic.co/t/re-timelion-giving-wrong-metric-aggregate-value-on-enlarging/132925
I'm building an app using database.
I have a words table and everytime user types something, this app will record and update word the database.
And the frequency field will be auto increase after user enter one matched word.
But the trouble is user type day by day and i afraid the search performance will be reduce after times and also the Int field will reach to the limit (max limit Int) someday.
So, i limit the database to around less than 50.000 records.
I delete less-used records after a certain time.
But i don't know how to deal with frequency Int field of each word?
How to know exactly frequency usage of each word without increasing the field forever?
I recommend that you use a logarithmic scale for the frequency values. That's what is often done in situations like this. See Wikipedia to learn about logarithmic scales.
For example, if you have a word MAN that has a frequency of 15, the value you store in the database would be log(15) ~= 1.17609125906.
If you then find 4 new occurrences of MAN, then you want to add 4 to the field. You cannot add the log values directly because log(x)+log(y)=log(x*y). (See the Logarithm Rules section of this article for more information on log rules.)
Instead -- assuming you use a base 10 logarithm, you would use this formula:
SET frequency = log(10^frequency+4)
Depending on the length of your words, the few bytes for the frequency don't matter. With an unsigned four bytes integer, you can count up to more than two billion, which is way above the number of words what the user can type in in their whole lifespan.
So may want to go for two or three bytes, but the savings may be negligible.
Anyway, there are the following approaches for preventing overflow:
You can detect it, and then undo the operations, scale everything down by some factor of two, and then redo.
You can periodically check all your numbers and do the scaling when approaching the limit.
You can do a probabilistic update like below.
Probabilistic update
Instead of simply incrementing the frequency every time by one, you do it only with a probability which gets lower and lower as the counter grows. For example, you can do the increment with a probability of 1.0 / (oldValue + 1) or 2 ** -oldValue. The latter leads to a logarithmic growth, but, unlike the idea in the other answer, it works.
There are obviously some disadvantages due to the randomness and precision loss, but when all you care about is the relative frequency, it should be good enough.
I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.
In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.
In generating a "preference" value for an item, we have the following 5 data points:
Positive interest
User converted on item (highest possible sign of interest)
Normal like (user expressed interest, e.g. like buttons)
Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")
Negative interest
Indifference (items the user ignored when active on other items, a vague expression of disinterest)
Active dislike (thumbs down, remove item from my view, etc)
Over what range I should express these different attributes, let's use a 1-100 scale for discussion?
Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' # 100, 'Lesser likes' # ~65, 'Dislikes' clustered in the 1-10 range?
On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?
I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.
The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.
These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.
A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.
I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.