Count() on pull queries in ksqldb - ksqldb

I have explored different ways to obtain the count of record in a materialized view in ksqldb. Note, The materialized view is created with create source table. It does not have an output topic and that’s on purpose.
So far the only approach that seemed to work was to use a push query which support groupby which is necessary for a count. The trick was to use select count(*) from table group by 1.
This works, but not fully as expected.
Indeed, although we issue the query when the materialized views is done pulling from the topic, we sometime get 1 result, sometime 2 or 3, but ultimately the final result is always the right count.
This is rather strange given that the push query is supposed to do a a full scan at first.
For our workflow we do not want to work with such behavior. We need a firm count, once.
Hence I was thinking about using UDAF.
However before I jump into it, I wanted to know if pull queries support them, and if there is some known limitation here. Indeed, given that a count as pull queries sounds like something that should be provided out of the box, I wonder why it is not ? Maybe there is some limitation about providing such a function with pull queries ?

Related

What's the most efficient way to pull only the records created since last pull?

I am building a Shopify App, and I want my customer (the store owner) to 'sync' the app with their store data occasionally, specifically for their Order data (ignore the fact that I will also have a webhook that pushes this data to my app whenever a new order is created).
Right now, just for crude illustration purposes, I am doing this (which works in dev mode):
store_orders = ShopifyAPI::Order.find(:all)
store_orders.each do |sorder|
new_order = Order.find_or_create_by(s_order_id: sorder.id)
new_order.update(
currency: sorder.currency,
...
)
So, I am pulling all of the orders directly from the ShopifyAPI. Ideally, what I want to happen is to only pull the new orders that have been made that have not been synced.
I have two constraints:
The Order IDs produced by the ShopifyAPI are not in a sequential order, and can be relatively haphazard
I can't do a where query on the ShopifyAPI according to the updated_at date, i.e. to only select the records after my last synced date (the ShopifyAPI is not allowing this at the moment, I am not sure if this will be fixed in the future).
So that leaves me with two questions:
What's another, efficient way, for me to quickly find the records that have not been pulled and only pull those?
How do I make sure to only update the records locally that any attribute has been changed or no record exists before? (i.e. I am trying to avoid updating every record that has been pulled).
I am not sure where you figure the Shopify order ID is "haphazard" and not in "sequential" order? If you study these things, you will find that they are integers and in fact, they are in order in the sense that an order created after another will have a bigger ID.
So the coolest little thing you can do, really quite easy, is to use the Shopify API filter known as "since_id". You get the luxury of pulling only orders that hit Shopify SINCE the last pull, assuming you stored that last pulled ID in the since_id field you keep around, on the shop model you keep.
Try it. It works perfectly. I have been doing that for years. Just update the since_id in your DB once you're done processing a batch of orders, and then next time you want more, filter using your since_id.

One single Azure SQL query is consuming almost all query_stats.total_worker_time and query_stats.execution_count

I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.

Breeze.js reverses the query order when executed locally

So a slightly weird one that I can't find any cause for really.
My app is set up to basically run almost all queries through one standard method that handles things like querying against the local cache etc. So essentially the queries are all pretty standardised.
Then I have just one, with a strange orderby issue. The query includes a specific orderby clause, and if I run the query first time, the cache is checked, no results found, queries the remote data source, get data, all correct and ordered.
When I return to the page, the query is executed again, and the query is executed against the local cache, where it does find the data and returns it... the weird parts is the order is reversed. Bear in mind the parameters going in are exactly the same, the only difference is the query is executed with executeQueryLocally, and results are found, and returned (in the first query, it is still executed with executeQueryLocally, it's just that no results are found and it goes on to execute it remotely).
I really can't see any specific issue as to why the results are reversed (I say they are reversed, I can't actually guarantee that - they might just be unordered and happen to come out in a reversed order)
This isn't really causing a headache, it's just weird, especially as it appears to be only one query where this happens).
Thoughts?
Server side queries and client side queries are not guaranteed to return results in any specific order UNLESS you have an "orderBy" clause specified. The reason that order may be different without the "orderBy" clause is that the data is being stored very differently on the server vs the client and unless a specific order is specified both will attempt to satisfy the query as efficiently as possible given the storage implementation.
One interesting side note is that per the ANSI 92 SQL standard, even your SQL database is not required to return data in the same order for the same query ( again unless you have an ORDER BY clause). It's just that it's very rare to see it happen.

Why does a select with consistent read from Amazon SimpleDB yield different results?

I have a domain on SimpleDB and I never delete from it.
I am doing the following query on it.
select count(*) from table_name where last_updated > '2012-09-25';
Though I am setting consistent read parameter as true, it is still returning me different results in different executions. As I am not deleting anything from this domain, ideally the results should be in increasing order, but that is not happening.
Am I missing something here?
If I understand your use case correctly, you might be misreading the semantics of the ConsistentRead parameter in the context of Amazon SimpleDB, see Select:
When set to true, ensures that the most recent data is returned. For
more information, see Consistency
The phrase most recent can admittedly be misleading eventually, but it doesn't address/affect result ordering in any way, rather it means most recently updated and ConsistentRead guarantees that every update operation preceding your select statement is visible to this select operation already, see the description:
Amazon SimpleDB keeps multiple copies of each domain. When data is
written or updated, all copies of the data are updated. However, it
takes time for the update to propagate to all storage locations. The
data will eventually be consistent, but an immediate read might not
show the change. If eventually consistent reads are not acceptable for
your application, use ConsistentRead. Although this operation might
take longer than a standard read, it always returns the last updated
value. [emphasis mine]
The linked section on Consistency provides more details and an illustration regarding this concept.
Sort order
To achieve the results you presumably desire, a simple order by statement should do the job, e.g.:
select * from table_name where last_updated > '2012-09-25' order by last_updated;
There are a couple of constraints/subtleties regarding this operation on SimpleDB, so make sure to skim the short documentation of Sort for details.

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

Resources