Is there any point to using Any() linq expression for optimisation purposes? -

I have a MVC application which returns 2 types of Json responses from 2 controller methods; AnyRemindersExist() and GetAllUserReminders(). The first returns a boolean, 2nd returns an array, both wrapped as Json.
I have a JavaScript timer checking for calendar reminders against a user. It makes the first call (AnyRemindersExist) to check whether reminders exist and whether the client should then make the 2nd call.
For example, if the result of the Json response is false from the Any() query, it doesn't then make the 2nd controller action which makes a LINQ select call. If there are reminders that exist, it then goes further and then requests them (making use of the LINQ SELECT).
Imagine a system ramped up where 100-1000s users use the system and on the client, every 30-60 seconds a request comes in to load in the reminders. Does this Any() call help in anyway in reducing load on the server?

If you're always going to get the actual values afterwards, then no - it would make more sense to have fewer requests, and just always give the full results. I very much doubt that returning no results is slower than returning an indication that there are no results.
EDIT: tvanfosson's latest comment (at the time of this writing) is worth promoting:
You can really only tell by measuring and I'd only resort to it IFF the performance of the select only option didn't meet the requirements.
That's the most important thing about performance: the value of a guess is much less than the value of test data.

I would say that it depends on how the underlying queries are translated. If the any call is translated into an indexed lookup when the select (perhaps due to a join to get related data) must do some sort of table scan, then it will save some work in the case when there are no reminders to be found. It will cause a little extra work when there are reminders. It might be useful if the majority of the calls don't result in any results.
In the general case, though, I would just select the data and only try to optimize IF that turns out to not be fast enough. The conditions under which it will actually save effort on the server are pretty narrow and might only apply if you hand-craft the SQL rather than depend on your ORM.

Any only checks to see if there is at least one item in the Collection that is being returned. Versus using something like Count > 0 which counts the total amount of items in the collection then yes this is more optimal.
If your AnyRemindersExist method is operating on a similar principle then not calling a second call to the server would reduce your load.

So you are asking if not doing work the application doesn't need to do would reduce the workload on the server?
Of course. How would this answer every be "yes, doing extra work for no reason won't effect the server load".

It ultimately depends on how much faster the Any check is compared to getting the results and how often it will be false.
If the Any call takes near as long as the select then it pretty
much never makes sense.
If the Any call is much faster than the select but 90% of the
time it's true, then it probably isn't worth it (best case you
get 10% improvement, worst case it's actually more work).
If the Any call is much faster than the select and 90% of the
time it's false, then it probably makes sense to check if there
are any before actually getting results.
So the answer is it depends on your specific scenario. Ultimately you're going to need to measure both the relative performance (on different typical loads, maybe some queries are more intensive than others) as well as the frequency that there are no results to return.
Actually it should almost never make sense to check Any in this case.
If Any returns false then you don't need to grab the results.
However this means it would have returned no results anyway, so
unless your Any check is significantly faster than a select
returning 0 results, there's no added benefit here.
On the other hand, if Any returns true, then you'll need to get the
results anyway, so in this case Any is purely additional work done.


Why is “group by” so much slower than my custom CombineFn or asList()?

Mistakenly, many months ago, I used the following logic to essentially implement the same functionality as the PCollectionView's asList() method:
I assigned a dummy key of “000” to every element in my collection
I then did a groupBy on this dummy key so that I would essentially get a list of all my elements in a single array list
The above logic worked fine when I only had about 200 elements in my collection. When I ran it on 5000 elements, it ran for hours until I finally killed it. Then, I created a custom “CombineFn” in which I essentially put all the elements into my own local hash table. That worked, and even on my 5000 element situation, it ran within about a minute or less. Similarly, I later learned that I could use the asList() method, and that too ran in less than a minute. However, what concerns me – and what I don't understand – is why the group by took so long to run (even with only 200 elements it would take more than a few seconds) and with 5000 it ran for hours without seeming to accomplish anything.
I took a look at the group by code, and it seems to be doing a lot off steps I don't quite understand… Is it somehow related to the fact that the group by statement is attempting to run things across a cluster? Or is it may be related to using an efficient coder? Or am I missing something else? The reason I ask is that there are certain situations in which I'm forced to use a group by statement, because the data set is way too large to fit in any single computer's RAM. However, I am concerned that I'm not understanding how to properly use a group by statement, since it seems to be so slow…
There are a couple of things that could be contributing to the slow performance. First, if you are using SerializableCoder, that is quite slow, and AvroCoder will likely work better for you. Second, if you are iterating through the Iterable in your ParDo after the GBK, if you have enough elements you will exceed the cache and end up fetching the same data many times. If you need to do that, explicitly putting the data in a container will help. Both the CombineFn and asList approaches do this for you.

Memory usage increases with every record read

I have a couple of database management tasks that need to go through every record in the database. It was my understanding that with the CakePHP 3.x ORM, I could do something like this, and it would only ever have one record in memory at a time:
$records = TableRegistry::get('Whatever')->find();
foreach ($records as $record) {
// do some processing
However, this is eventually crashing with an "out of memory" exception. I've added a bit of logging of memory_get_peak_usage, and it's increasing with every iteration, even if there is nothing other than the logging happening inside the foreach loop. The delta is around 12K every time through the loop.
I'm running 3.2.7, and results are similar whether I have debugging and/or SQL logging enabled or not. Adding frequent calls to gc_collect_cycles() only slows the process down, it doesn't help with the memory usage.
Is this expected, or a bug? If the former, is there anything I can do differently in this code to prevent it? (Obviously, I could process it in smaller batches, but that's not an elegant solution.)
CakePHP 3.x ORM has built in query caching for the ResultSet object. When you iterator over the result set the entities are stored in an internal array. This is done so that you can rewind the iterator and loop again.
If you are going to iterate over a large result set only once, and you want to reduce memory usage then you have to disable result buffering.
$records = TableRegistry::get('Whatever')->find()->bufferResults(false);
foreach ($records as $record) {
// do some processing
With buffering turned off the entity is fetched from the result set and there should be no references to it afterwards.
Documentation for this feature is available in the CakePHP book:
Here's the API reference:
From my understanding it is the expected behaviour, as you execute the query build with the ORM when you start iterating over the object($records). Thus all the data is loaded into memory, and you then iterate over each entry one by one.
If you want to limit the memory usage I would suggest you look into limit and offset. With these you can extract subsets to work on, thus limiting memory usage.

HHVM staticly typing lookup tables and keeping them fully cached in RAM

I'm doing scientific research, processing through millions of combinations of multi-megabyte arrays.
For you to be capable of answering this question you will need to have knowledge/experience of all of the following
how HHVM is able to cache data structures in RAM between requests
how to tell HHVM data structures will be constant
how to declare array index and value types
I need to process the entire arrays, so it's a lot of data to be loaded and processed. (millions of requests within minutes on a LAN). The faster I can complete requests the quicker I can complete my work. If HHVM has to do work loading this data on each request, it accounts for a significant fraction of the time to complete the request (sometimes more than half, it depends on the complexity of the analysis I'm doing at the time).
I have found a method that has allowed me to keep these data structures cached in RAM (no loading from files, interpreting code, pushing to the array hundreds of thousands of times for no reason, no pointless repetitive unserialize etc), and thus I have eliminated this massive measurable delay.
I have 3 questions regarding how I can make this even faster:
Is the way I'm doing it now creating a global scope penalty?
How can I declare my arrays as constant and tell HHVM what data types to expect?
If I declare my arrays as constant is it even necessary to declare the types for HHVM?
Instead of using nested arrays, if I use 3 separate data structures ImmVector, PackedArray, or define a class would it be faster?
Keep in mind that anything that prevents HHVM from caching the data structure in RAM between requests should be regarded as unacceptable.
$data = [
["uuid (20 chars)", 5336, 7373],
["uuid (20 chars)", 5336, 7373],
#more lines as above
Some of these files are many MB in size and there are a lot of them
function main() {
require /path/to/Lookuptable35543.php;
#(Do stuff with $data)
This is working quite well, as Main.php gets thousands of requests, in a short period of time, HHVM keeps Lookuptable.php's data structure in memory. Avoiding pointless processing and IO, as it just sits in RAM, ready for use. (I have more than enough RAM)
Unfortunately, the only way I know how to make HHVM hold the lookup table in RAM is, I set $data in the global scope inside my lookup####.php file (then require the lookup file into a function in the data processing file: Main.php)? This way HHVM doesnt bother loading the file or re executing the code to create $data, because it can see that $data can be determined at compile time, and it will not ever change during runtime. This works but I dont know if there is a penalty from having the $data exist in the lookup###.php file's global scope. (Or maybe its not global at all because it is required into main.php's function?)
What if I return $data from a function inside Lookup.php and call that function from Main.php like this
Would the HHVM JIT the result of getData() in RAM?
Somehow I associate functions with unpredictability... but maybe HHVM is clever enough to know that the functions result can be determined at compile time, and never changes?
I can't put the lookup table inside Main.php because I require different lookup tables based on the type of request.
Is there a way I can tell HHVM that my outer array will always have an integer index that never changes, and the values of the outer array will always be an array?
Perhaps I need to use ImmVector?
Then is there a way to tell HHVM that my inner array will always be a fixed length string followed by 2 integers, always, no extra elements, contents never changes?
I'd prefer not to use OO or create a class. How can I declare types, procedural style?
If a class is absolutely necessary can you please give example code suitable for my requirements above?
Will it be faster if I dont nest arrays?
I just realized I could have one array with integer index and values of fixed length string. Then a 2nd array with integer index and integer values, and a 3rd one with integer index and integer values.
If you're not familiar with this HHVM caching technique please do not waste mutual time suggesting a database, redis, APC, unserialize, etc. The fastest is for HHVM to just keep my various $data variables in RAM. Even unserializing $data from a ramdisk file is slow, because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know. I dont want to even have to copy $data. The lookup tables are immutable, read only. They must just stay fully structured in RAM. My current caching solution (at the top of this question) has already given me huge gains, but as per my 3 questions I think there may be more gains to be had?
Incase you're wondering, I have measured the latency of various data loading or caching methods.
Now I basically want to keep the caching situation I have, but give the HHVM JIT maximum confidence about how to type my data, so it can save time not running type or even bound (array size) checks.
Ok so nobody has been able to give me any code examples yet, so I'm just trying stuff out.
Here's what I've found out so far.
const arrays don't work yet in HHVM. const foo = ['uuid1',43,43];
throws an error about HHVM only supporting constants with scalar values.
Vector with Array values: I don't know how it will perform yet... I expect it will be better than a normal array. This is valid HH code.
This is progress, because HHVM should be able to cache this in the same way, HHVM knows this whole structure is constant, and HHVM knows the indexes are all integers.
What I'm still not entirely happy about with this structure is this:
Consider this code
for ($n=0;$n<count($iv);++$n) if ($x > $iv[$n][1]) dosomething();
Will HHVM perform a type check on $if[$n][2] on every loop iteration?
In my definition of $iv above, there is nothing that says the 2nd element of the inner array will be an integer.
How can I improve on this?
Can disabling the type checker be of any use? Does this only hide errors from the external type checker, or does it prevent HHVM from constantly doing type checks? (I'm thinking it's the first thing)
Perhaps if I could make my own user-defined type that would solve the problem?
#I don't know what mechanisms for UDT's exist, so this code is made-up
CreateUDT foo = <string,int,int>;
$iv = ImmVector<foo> {
I found a reference to this at Hack Collections Literal Syntax Vector<Foo> unfortunately it might not be available to use yet.
I'm a software engineer at Facebook working on HHVM.
This entire question reeks of premature optimization to me. Have you done profiling and determined that loading this array is actually a bottleneck for your app? (Not just microbenchmarks, but how it actually affects the performance, latency, RPS, etc of realistic pageloads.) And also isolated from other effects, e.g., if this array is a cache or some sort of precomputed data, you need to isolate the win of precomputing the data from the actual time to load it by caching it in various different ways.
In general, HHVM is very good at dealing with arrays, since they are so hot in nearly every codepath -- and in particular at constant arrays like this one. To your questions about how to inform it of the shape and types of things in the arrays, HHVM can figure that all out for itself, and is very good at doing so on constant arrays composed entirely of constants. (And the ways it thinks about arrays aren't quite the ways you think about arrays, so it can probably do a better job anyway!) Basically, unless profiling says this is actually a hotspot -- which I'm pretty skeptical of -- I wouldn't worry too much about it. A couple general notes to be aware of:
Measure every performance diff. Don't prematurely optimize -- use profiling to guide. The developer productivity lost by premature optimizations getting in the way can be lethal.
Get things out of toplevel ("pseudomains") as much as possible. A function which returns a static or constant array should be just fine, and will in general help HHVM optimize code even better.
Avoid references as much as possible, especially in this array if you care about performance so much.
You probably should look into repo authoritiative mode which can help HHVM optimize lots of things even more -- but in particular for this case, the more aggressive inlining that repo auth mode can do might be a win.
Edit, aside:
because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know
This is exactly what I mean by premature optimization: you're rejecting APC without even trying it, even if it might be a cleaner way of doing what you want. It turns out that, in most cases, HHVM actually can optimize away the serialization/deserialization of storing arrays in APC, particularly if they are constant arrays that are never modified. As above, HHVM is very good at optimizing lots of common patterns. Just write code that's clean, profile it, and fix the hotspots.
Okay I've solved my first question.
I don't have any global scope issues. My require is being done from inside function main(), so it's as if the code from lookuptable####.php is being inserted into function main().
HHVM docs: "If the include occurs inside a function..."
Basically if you were to open lookuptable####.php it looks like the code is in global scope, but that's not the file that is being requested from hhvm. main.php is the one being requested, thus there is no code in global scope.
I think I've answered my 2nd question, it's currently at the bottom of my question. I'm not 100% convinced, but I'm pretty happy to move ahead and test it.

Best way to detect and store path combinations for analysing purpose later

I am searching for ideas/examples on how to store path patterns from users - with the goal of analysing their behaviours and optimizing on "most used path" when we can detect them somehow.
Eg. which action do they do after what, so that we later on can check to see if certain actions are done over and over again - therefore developing a shortcut or assembling some of the actions into a combined multiaction.
My first guess would be some sort of "simple log", perhaps stored in some SQL-manner, where we can keep each action as an index and then just record everything.
Problem is that the path/action might be dynamically changed - even while logging - so we need to be able to take care of this fact too, when looking for patterns later.
Would you log everthing "bigtime" first and then POST-process every bit of details after some time or do you have great experience with other tactics?
My worry is that this is going to take up space, BIG TIME while logging 1000 users each day for a month or more.
Hope this makes sense and I am curious to see if anyone can provide sample code, pseudocode or perhaps links to something usefull.
Our tools will be C#, SQL-database, XML and .NET 3.5 - clients could also get .NET 4.0 if needed.
Patterns examples as we expect them
User #1001: A-B-A-A-A-B-C-E-F-G-H-A-A-A-C-B-A
User #1002: B-A-A-B-C-E-F
User #1003: F-B-B-A-E-C-A-A-A
User #1002: C-E-F
etc. no real way to know what they do next nor how many they will use, how often they will do it.
A secondary goal, if possible, if we later on add a new "action" called G (just sample to illustrate, there will be hundreds of actions) how could we detect these new behaviours influence on the previous patterns.
To explain it better, my thought here would be some way to detect "patterns within patterns", sort of like how compressions work, so that "repeative patterns" are spottet. We dont know how long these patterns might be, nor how often they might come. How do we break this down into "small bits and pieces" - whats the best approach you think?
I am not sure what you mean by path, but, if you gave every action in a path a unique symbol, you could reduce the problem to longest common substring or subsequence.
Or have a map of paths to the number of times that action occurred. Every time a certain path happens, increment the count for that path. Then sort to find the most common.
Pseudo idea/implementation so far
Log ever users action into a list/series of actions, bulk kinda style (textfiles/SQL - what ever, just store the whole thing for post-processing)
start counting every "1 action", "2 actions", "3 actions" up til a certain amount (lets say 30 levels)
sort them all, by giving values of importants to some of the actions (might be those producing end results)
A usefull result perhaps?
If we count all [A], [A-A], [A-B], [A-C], [A-A-A], [A-A-B] etc. its going to make a LONG and fine list of which actions are used in row frequently, and thats in the right direction, because if some of these results gets too high, we might need a shorter path. Problem is then, whats too few actions to be optimized and whats the longest needed actionlist to search for? My guess is that we need to do this counting first, then examine the numbers.
Problem is that this would be part of an analyzing tool we are developing and we dont have data until implementation, so we dont know what to look for before its actually done. hmm... wondering if there really IS an answer to this one.

Should I always return IQueryable<> instead of IList<>?

I came across this post while I was looking for things to improve performance. Currently, in my application we are returning IList<> all over the place. Is it a good idea to change all of these returns to AsQueryable() ?
Here is what I found -
AsQueryable() - Context needs to be
open and you cannot control the
lifetime of the database context
it need to be disposed properly. Also
it is deferred execution('faster
filtering' as compared to Lists)
IList<> - This should be preferred
over List<> as it provides a barebone
and lightweight implementation.
Also when should be one preferred over another ? I know the basics but I am sorry I am still not clear when and how should we use them correctly in an application. It would be great to know this as the next time I would try to keep it in mind before returning anything..Thanks a lot.
Basically, you should try to reference the widest type you need. For example, if some variable is declared as List<...>, you put a constraint for the type of the values that can be assigned to it. It may happen that you need only sequential access, so it would be enough to declare the variable as IEnumerable<...> instead. That will allow you to assign the values of other types to the variable, as well as the results of LINQ operations.
If you see that your variable needs access by index, you can again declare it as IList<...> and not just List<...>, allowing other types implementing IList<...> be assigned to it.
For the function return types, it depends upon you. If you think it's important that the function returns exactly List<...>, you declare it to return exactly List<...>. If the only important thing is access to the result by index, perhaps you don't need to constrain yourself to return exactly List<...>, you may declare return type as IList<...> (but return actually an instance of List<...> in this implementation, and possibly of some another type supporting IList<...> later). Again, if you see that the only important thing about the return value of your function is that it can be enumerated (and the access by index is not needed), you should change the function return type to IEnumerable<...>, giving yourself more freedom.
Now, about AsQueriable, again it depends on your logic. If you think that possible delayed evaluation is a good thing in your case, as it may aid to avoid the unneeded calculations, or you intend to use it as a part of some another query, you use it. If you think that the results have to be "materialized", i.e., calculated at this very moment, you would better return a List<...>. You would especially need to materialize your result if the calculation later may result in a different list!
With the database a good rule of thumb is to use AsQueriable for the short-term intermediate results, but List for the "final" results which will be used within some longer time. Of course having a non-materialized query hanging around makes closing the database impossible (since at the moment of actual evaluation of the the database should be still open).
if you do not intend to do any further queries over sql server then you should return IList because it produces in-memory data
If you are concerned about performance you should also try to run your queries on as few DB requests as possible and cache the most used queries. It is very common to reduce significantly the request process time using batch approachs.
Which ORM do you use to retrieve data from DB? If you use NHibernate, see this post about how to use Future, Multi Criteria 1, Multi Criteria 2 and Multi Query.
