Snowflake Stream Consumption - stream

I have a question on stream consumption.
As per my setup I have a parent table telemetry_data which is constantly updated by app analytics data. I create a stream on top of the parent table telemetry_data_stream to read new records added to the parent table. To consume the stream, I execute a Create Table as statement which consumes the stream and resets it to accept new changes after the consumption.
For example, If the SQL statement took 3 seconds to be executed, what happens to the data that was added to the parent table during this 3 seconds. Would the data reflect in the new table I create based on the stream using create table as SELECT * FROM catalog_returns_stream; or would it be contained in the stream as a part of the new version. Is there a chance of dataloss due to the time taken during stream consumption? If so, is there a way we can be sure we wont face such a situation?
More of it I would like to understand how does the stream manage consumption and reading of new data while both of them happening at the same time. Does the stream advance to the point when the transaction started on it or when it ended?
I tried to constantly add data to the parent table while also consuming the stream. The time for the sql is quick enough hard to tell if there is any data loss.

Stream is a living object. It does not store any data and hence the word reset is not really applicable. When you are consuming data your are basically updating the offset.
You can read more about it https://docs.snowflake.com/en/user-guide/streams-intro.html#offset-storage
create table as SELECT * FROM catalog_returns_stream; may not be the best way to consume a frequently updating stream.
Kindly also note that each table can have multiple streams on it at the same time.
A table that is being updated while the stream on it is being consumed is a very common scenario. The stream (offset) then points to the 'first' row which was not consumed during the last read operation.
Hope that helps

Related

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

Is there any standard stored procedure to capture table refresh details in snowflake

I am trying to log table refresh details in snowflake DWH
Details include below
Batch Date, Source Table Name, Target Table Name, rows loaded, timestamp, status, err.Message.
Is there any standard SQL\Snowflake stored procedure which can be useful as common one for entire DWH to trace\audit table refresh details and log them into single table.
I have the variables which captures Batchdate, target table name, source table name, etc...
If I get standard stored procedure which can log start of the activity and end of the activity, that really helpful.
Regards,
Srinivas
If you are looking for some ideas moving forward, here are a couple of things that can help you out:
Query History is useful, but hard to filter. If you use a query_tag in your batch processes, then you can reference query_history for information.
In addition, if you want to capture information as its running, you could use Streams and Tasks on your tables to capture counts of updates/inserts/deletes, etc. for each batch in the background.
There is no standard stored procedure that you can leverage within Snowflake to query this information, but there is a lot of data available in the snowflake.account_usage share.
Not sure what exactly you're trying to achieve here, but
you can use last_altered on a table to see when the data was last modified
you can filter the query_history view to see what queries modified the table:
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
You can take advantage of Snowflake Streams https://docs.snowflake.com/en/sql-reference/sql/create-stream.html
When you create a stream, you point it to a target table. So, your stream, records changes produces on the target table (INSERTS, UPDATES and DELETES) between two points in time.
You can use your stream as any table to select over it, to look for changes.
What's great about streams is that after a succesfully DML operation is done by using data from any stream, the stream is purged, so when you query against it, it'll be empty.
Use them free of guilty, since streams don´t duplicate your data, they just storage the offset and the CDC, so data remains on your table.
Some useful guides: it generates something related you need
- Part 1: https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-1/
- Part 2: https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-2/

CSV Import to multiple tables - speed consideration

I have an app that will take sales made available to vendors at Whole Foods and process the daily sales data by store and item. All the parent information is stored in one downloaded CSV with about 10,000 lines per month.
The importing process checks for new stores before importing the sale information.
I don't know how to track 'time' of processes in ruby and rails but i was wondering if it would be 'faster' to process one line at a time to each table or to process the file for one table (stores) and then to the other table (sales)
If it matters in anything, new stores are not often added though stores might be closed (and the import checks for that as well), so the scan through the stores might only add a few new entries whereas every row of the csv is added to the sales.
If this isn't appropriate - I apologize - still working out the kinks of the rules
When it comes to processing data with Ruby the memory consumption is what you should be concerned about.
With csv processing in Ruby, the best you can do is reading line by line:
file = CSV.open("data.csv")
while line = file.readline
# do stuff
end
This way no matter how many lines are in the file, there always be only single one (+ previous processed one) loaded into memory at a time - GC will collect processed lines as your program executes. This way is almost no-memory consumptive + it will speed up the parsing process, too.
i was wondering if it would be 'faster' to process one line at a time
to each table or to process the file for one table (stores) and then
to the other table (sales)
I would go with one line at a time to each table.

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

Can TCustomClientDataset apply updates in a batch mode?

I've got a DB Express TSimpleDataset connected to a Firebird database. I've just added several thousand rows of data to the dataset, and now it's time to call ApplyUpdates.
Unfortunately, this results in several thousand database hits as it tries to INSERT each row individually. That's a bit disappointing. What I'd really like to see is the dataset generate a single transaction with a few thousand INSERT statements in it and send the whole thing at once. I could set that up myself if I had to, but first I'd like to know if there's any method for it built in to the dataset or the DBX framework.
Don't know if it is possible with a TSimpleDataset (never used it), but surely you can if you use a TClientDataset + TDatasetProvider + <put your db dataset here>. You can write a BeforeUpdateRecord to handle the apply process yourself. Basically, it allows you to bypass the standard apply process, access the dataset delta with changes made to records, and then use your own code and components to apply changes to the database. For example you could call stored procedures to modify data, and so on.
However, there is a difference between a transaction and what is called "array DML", "bulk insert" or the like. Even if you use a single transaction (and an "apply" AFAIK happens in a single transaction), within the transaction you may still need to send "n" INSERTs. Some databases supports a way of sending a single INSERT (or update, delete) with an array of parameters to be inserted, reducing the number of single statements to be used - but that may be very database specific and AFAIK dbExpress/Datasnap do not support it - you still could use the BeforeUpdateRecord event to take advantage of specific database capabililties.

Resources