Keeping min/max value in a BigTable cell - google-cloud-dataflow

I have a problem where it would be very helpful if I was able to send a ReadModifyWrite request to BigTable where it only overwrites the value if the new value is bigger/smaller than the existing value. Is this somehow possible?
Note: I thought of a hacky way where I use the timestamp as my actual value, and have the max number of versions 1, so that would keep the "latest" value which is the higher timestamp. But those timestamps would have values from 1 to 10 instead of 1.5bn. Would this work?
I looked into the existing APIs but haven't found anything that would help me do this. It seems like it is available in DynamoDB, so I guess it's reasonable to ask for BigTable to have it as well https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateItem.html#API_UpdateItem_RequestSyntax

Your timestamp approach could probably be made to work, but would interact poorly with stuff like age-based garbage collection.
I also assume you mean CheckAndMutate as opposed to ReadModifyWrite? The former lets you do conditional overwrites, the latter lets you do unconditional increments/appends. If you actually want an increment that only works if the result will be larger, just make sure you only send positive increments ;)
My suggestion, assuming your client language supports it, would be to use a CheckAndMutateRow request with a value_range_filter. This will require you to use a fixed-width encoding for your values, but that's no different than re-using the timestamp.
Example: if you want to set the value to 000768, but only if that would be an increase, use a value_range_filter from 000000 to 000767, inclusive, and do your write in the true_mutation of the CheckAndMutate.

Related

In InfluxDB/Telegraf How to compute difference between 2 fields based on 3rd field

I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

Redis optimal hash set entry size

I have some questions regarding the optimal entry size setting for Redis hash sets.
In this example memory-optimization they use 100 hash entries
per key but use hash-max-zipmap-entries 256 ? Why not
hash-max-zipmap-entries 100 or 128?
On the redis website (above link) they used max hash entry size of
100, but in this post instagram, they mention 1000 entries. So
does this mean the optimal setting is a function of the product of
hash-max-zipmap-entries & hash-max-zipmap-value ?(ie in this case
Instagram has smaller hash-values than memory optimization example?)
Your comments/clarifications are much appreciated.
The key is, from here:
manipulating the compact versions of these [ziplist] structures can become slow as they grow longer
and
[as ziplists grow longer] fetching/updating individual fields of a HASH, Redis will have to decode many individual entries, and CPU caches won’t be as effective
So to your questions
This page just shows an example and I doubt the author gave much thought to the exact values. In real life, IF you wanted to take advantage of ziplists, and you knew your number of entries per hash was <100, then setting it at 100, 128 or 256 would make no difference. hash-max-zipmap-entries is only the LIMIT over which you're telling Redis to change the encoding from ziplist to hash.
There may be some truth in your "product of hash-max-zipmap-entries & hash-max-zipmap-value" idea, but I'm speculating. More importantly, first you have to define "optimal" based on what you want to do. If you want to do lots of HSET/HGETs in a large ziplist, it will be slower than if you used a hash. But if you never get/update single fields only ever do HMSET/HGETALL on a key, large ziplists wouldn't slow you down. The Instagram 1000 was THEIR optimal number based on THEIR specific data, use cases, and Redis function call frequencies.
You encouraged me to read both links and it seems that you are asking for "default value for hash table size".
I don't think that it's possible to say that one number is universal for all possibilities. The described mechanism is similar to standard hash mapping. Look at http://en.wikipedia.org/wiki/Hash_table
If you have small size of hash-table, it means that many various hash values point into the same array, where the equals method is used to find out the item.
On the other hand, large hash table means that it allocates large memory along with many empty fields. But this scales well as the algorithm uses O(1) big O notation and there is no equals searching for the item.
In general the size of the table IMHO depends on the overall count of all elements you expect to put into the table and it also depends on the diversity of the key. I mean if every hash start with "0001" not even size=100000 would help you.

Is there any point to using Any() linq expression for optimisation purposes?

I have a MVC application which returns 2 types of Json responses from 2 controller methods; AnyRemindersExist() and GetAllUserReminders(). The first returns a boolean, 2nd returns an array, both wrapped as Json.
I have a JavaScript timer checking for calendar reminders against a user. It makes the first call (AnyRemindersExist) to check whether reminders exist and whether the client should then make the 2nd call.
For example, if the result of the Json response is false from the Any() query, it doesn't then make the 2nd controller action which makes a LINQ select call. If there are reminders that exist, it then goes further and then requests them (making use of the LINQ SELECT).
Imagine a system ramped up where 100-1000s users use the system and on the client, every 30-60 seconds a request comes in to load in the reminders. Does this Any() call help in anyway in reducing load on the server?
If you're always going to get the actual values afterwards, then no - it would make more sense to have fewer requests, and just always give the full results. I very much doubt that returning no results is slower than returning an indication that there are no results.
EDIT: tvanfosson's latest comment (at the time of this writing) is worth promoting:
You can really only tell by measuring and I'd only resort to it IFF the performance of the select only option didn't meet the requirements.
That's the most important thing about performance: the value of a guess is much less than the value of test data.
I would say that it depends on how the underlying queries are translated. If the any call is translated into an indexed lookup when the select (perhaps due to a join to get related data) must do some sort of table scan, then it will save some work in the case when there are no reminders to be found. It will cause a little extra work when there are reminders. It might be useful if the majority of the calls don't result in any results.
In the general case, though, I would just select the data and only try to optimize IF that turns out to not be fast enough. The conditions under which it will actually save effort on the server are pretty narrow and might only apply if you hand-craft the SQL rather than depend on your ORM.
Any only checks to see if there is at least one item in the Collection that is being returned. Versus using something like Count > 0 which counts the total amount of items in the collection then yes this is more optimal.
If your AnyRemindersExist method is operating on a similar principle then not calling a second call to the server would reduce your load.
So you are asking if not doing work the application doesn't need to do would reduce the workload on the server?
Of course. How would this answer every be "yes, doing extra work for no reason won't effect the server load".
It ultimately depends on how much faster the Any check is compared to getting the results and how often it will be false.
If the Any call takes near as long as the select then it pretty
much never makes sense.
If the Any call is much faster than the select but 90% of the
time it's true, then it probably isn't worth it (best case you
get 10% improvement, worst case it's actually more work).
If the Any call is much faster than the select and 90% of the
time it's false, then it probably makes sense to check if there
are any before actually getting results.
So the answer is it depends on your specific scenario. Ultimately you're going to need to measure both the relative performance (on different typical loads, maybe some queries are more intensive than others) as well as the frequency that there are no results to return.
Actually it should almost never make sense to check Any in this case.
If Any returns false then you don't need to grab the results.
However this means it would have returned no results anyway, so
unless your Any check is significantly faster than a select
returning 0 results, there's no added benefit here.
On the other hand, if Any returns true, then you'll need to get the
results anyway, so in this case Any is purely additional work done.

Should I always return IQueryable<> instead of IList<>?

I came across this post while I was looking for things to improve performance. Currently, in my application we are returning IList<> all over the place. Is it a good idea to change all of these returns to AsQueryable() ?
Here is what I found -
AsQueryable() - Context needs to be
open and you cannot control the
lifetime of the database context
it need to be disposed properly. Also
it is deferred execution('faster
filtering' as compared to Lists)
IList<> - This should be preferred
over List<> as it provides a barebone
and lightweight implementation.
Also when should be one preferred over another ? I know the basics but I am sorry I am still not clear when and how should we use them correctly in an application. It would be great to know this as the next time I would try to keep it in mind before returning anything..Thanks a lot.
Basically, you should try to reference the widest type you need. For example, if some variable is declared as List<...>, you put a constraint for the type of the values that can be assigned to it. It may happen that you need only sequential access, so it would be enough to declare the variable as IEnumerable<...> instead. That will allow you to assign the values of other types to the variable, as well as the results of LINQ operations.
If you see that your variable needs access by index, you can again declare it as IList<...> and not just List<...>, allowing other types implementing IList<...> be assigned to it.
For the function return types, it depends upon you. If you think it's important that the function returns exactly List<...>, you declare it to return exactly List<...>. If the only important thing is access to the result by index, perhaps you don't need to constrain yourself to return exactly List<...>, you may declare return type as IList<...> (but return actually an instance of List<...> in this implementation, and possibly of some another type supporting IList<...> later). Again, if you see that the only important thing about the return value of your function is that it can be enumerated (and the access by index is not needed), you should change the function return type to IEnumerable<...>, giving yourself more freedom.
Now, about AsQueriable, again it depends on your logic. If you think that possible delayed evaluation is a good thing in your case, as it may aid to avoid the unneeded calculations, or you intend to use it as a part of some another query, you use it. If you think that the results have to be "materialized", i.e., calculated at this very moment, you would better return a List<...>. You would especially need to materialize your result if the calculation later may result in a different list!
With the database a good rule of thumb is to use AsQueriable for the short-term intermediate results, but List for the "final" results which will be used within some longer time. Of course having a non-materialized query hanging around makes closing the database impossible (since at the moment of actual evaluation of the the database should be still open).
if you do not intend to do any further queries over sql server then you should return IList because it produces in-memory data
If you are concerned about performance you should also try to run your queries on as few DB requests as possible and cache the most used queries. It is very common to reduce significantly the request process time using batch approachs.
Which ORM do you use to retrieve data from DB? If you use NHibernate, see this post about how to use Future, Multi Criteria 1, Multi Criteria 2 and Multi Query.
Greetings.

Resources