Table of all users registered in the last X days - ksqldb

I have a topic derived from a MySQL users table and now want to create a table in ksqldb that always contains all users registered within the last 50 days. After spending some time in the docs I still can't find a solution for this. Windowing doesn't seem to work because it only applies to stream-stream joins as far as I can see. Is this sort of thing even possible using purely ksql, or do I need to look for other solutions?
Thanks!

I don't think this is currently possible with ksqlDB. This because a filter, such as WHERE DATEDIFF('days', user.registerDate, now()) < 50 would only be evaluated at the time the source row was received.
This isn't too different to standard SQL dbs, where a materialized view built with a similar filter would also not update automagically as time advanced.
You could likely build a system using Kafka Streams that could use punctuation to evict old entries.
Some enhancement requests on ksqlDB may also be of use:
Table over topic with time based retention
View updates

Related

Kafka KTables missing data when joining KStream to KTable

Has anyone posted a response to this problem? There have been other posts with no answers. Our situation is that we are pushing messages onto a topic that is backing a KTable in the first step of our stream process. We are then pulling a small amount of data from those messages and passing them along. We are doing multiple computations on that smaller amount of data for grouping and aggregation. At the end of the streaming process, we simply want to join back to that original topic via a KTable to pick up the full message content again. The results of the join are only a subset of the data because it can not find the entries in the KTable.
This is just the beginning of the problem. In another case, we are using KTables as indexes for lookups meant to enrich the data coming in. Think of these lookups as identifying whether we have seen a specific pattern in the streaming message before. If we have seen the pattern we want to tag it with an ID (used for grouping) pulled from an existing KTable. If we have not seen the pattern before we would assign it an ID and place it back into the KTable to be used to tag future messages. What we have found is that there is no guaranty that the information will be present in the KTable for future messages. This lack of guaranty seems to make KTables useless. We can not figure out why there is a very little discussion of this on the forums.
Finally, none of this seemed to be a problem when running with a single instance of the streams application. However, as soon as our data got large and we were forced to have 10 instances of the app, everything broke. As well, there is no way that we could use things like GlobalKTables because there is too much data to be loaded into a single machine's memory.
What can we do? We are currently planning to abandon KTables all together and use something like Hazelcast to store the lookup data. Should we just move to Hazelcast Jet and drop Kafka streams all together?
Adding flow:
Kafka data flow
I'm sorry for this non-answer answer, but I don't have enough points to comment...
The behavior you describe is definitely inconsistent with my understanding and experience with streams. If you can share the topology (or a simplified one) that is causing the problem, there might be a simple mistake we can point out.
Once we get more info, I can edit this into a "real" answer...
Thanks!
-John

Pre-Made ActiveRecord to Optimize Performance / Save Resources

Essentially each time a visitor reaches the application, the controller performs a database query to check what are the most relevant items to show.
Although the items shown vary with time, they are not personally selected for each user.
This means that instead of being calculated each time a visitor comes, it would be better to be system performing a single query every like 10 minutes and store it, to apply on each visit.
What is the best way to apply this idea? I was thinking on cronjobs and maybe store on redis but IDK, some help is appreciated!
There are a number of ways to do this. One way that I've used in the past with success is to have a table in your database that represents the most relevant items and then have a cron job that updates that table.
Fragment caching like #wesley6j recommended isn't a bad way to go either and you can combine the 2 techniques as well if you want.
If you want more detailed suggestions, you can provide some more details about what you are trying to achieve.

Inputting Incremental Database into Apache Storm Project

I searched a lot but couldnt pretty much find what I was specifically looking for. The Question is simple and straightforward.
I have a database table, which gets populated every second!
Next, I have almost defined the Analysis Methods/classes in the Apache Storm Spout/Bolts classes.
All I wish to do is, send those new rows being inserted every second to the Spout class as a stream input.
How Do I do this?
Thanks,
There are several ways you could accomplish this, but without knowing more about the nature of the data it's hard to give a good answer. One way would be to use another table to track which records have already been processed by storm based on some field in the original table. For instance, if you used a timestamp column you could track the maximum timestamp you have already processed. There are some potential race conditions you have to be careful of with both the reading/updating of the metadata table as well as the actual data table, but both of those can be managed with transactions and proper time synchronization.
Teradata provide functionality of Queue tables. These tables support "select and consume" operation, which means it will remove rows from table as soon as you select them. For more information: http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1146_111A/ch01.032.045.html#ww798205
This approach assumes that table in Teradata is used as buffer and nobody else needs it.
If you need to have both: permanent full table (for some other application) as well as streaming this data to Storm, you may want to modify your loading process in a way to populate permanent table as well as queue table. In this case other applications can use whole data depth in permanent table, and Storm will consume data from queue table with minimal space impact.

Delete job posting after a set time

I am trying to develop a web application that allows users to post a short job description and set a time limit as to when the message should cease to show on a time line. (NB: The post is not deleted it only ceases to show up on time lines) The least time is 4hrs other times are in multiples of 4 up to 24hrs. I don't know the best way to approach this, I am thinking of doing some multi-threading but I am not sure if that is the right approach. In essence I am trying to build something like snapchat but a text based one.
I would to know if:
I need a special hosting package to host such an application.
If multi threading is a viable option
What would you do if you were building an app like this.
NB: I am using ASP.NET with C#
You don't need any threading or special processes, just a better database design.
Also, deleting items from a database generally isn't a good idea, instead just modify your design to be like this:
JobPostings( JobPostingId bigint, Title nvarchar, Description nvarchar, VisibleUntil datetime )
then just exclude old job postings from your queries:
SELECT * FROM JobPostings WHERE VisibleUntil >= NOW()

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Resources