I have a system that produces thousands of messages per hour. Some location tracking system that gather events from different devices and doing different calculations based on that messages.
I'm trying to evaluate if event store suites for this use case. So my plane is associate stream per device and accumulate messages in those streams.
Now the question - will I be able to read those messages for specific time frame in the past? I don't want to replay all the messages from the beginning, I just need fast access to messages from date1 to date2.
Any ideas? So far what I saw in the docs only relates to reading all messages either from the beginning or from the end and do the filtration during the process. But this pattern doesn't look very optimal to me. Am I doing something wrong?
EventStoreDB index allows you to read events from a specific stream by the event number, but not by date. What you wrote about reading from the beginning or the end is not entirely correct, as you can read from any position in the stream, both backwards and forwards, but then again it has nothing to do with dates.
Essentially, the data when the event was written to the database is considered not important and transient. For example, if you decide to move your data to another store using replication, all the events will get the new date. That's why if the date is important, it should be stored somewhere in the event data or metadata. EventStoreDB doesn't know about the event payload (or meta), and doesn't index that.
If you are looking to find a database kind that allows you to query records by time, the best chance is to look at time series databases like Prometheus and InfluxDB. These databases are specifically designed to index primarily by timestamp, and optimised to store data like sensor readings where each reading is a replacement of the previous one. EventStoreDB is not designed for that purpose, it's the database built to support event-sourced applications, and sensor readings is not that.
Related
I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.
I have a dataflow pipeline that collects user data like navigation, purchases, crud actions etc. I have this requirement to be able to identify patterns real time and then dispatch pubsub events that other services can listen to in order to provide the user real time tips, offers or promotions.
I'm thinking to start grouping the events by user id and then if the match a pattern to create a PCollection that contains the events names that need to be triggered via pubsub.
Is this the right approach? Is there a better way?
This could certainly work for some use cases.
If you are using session based windowing in combination with early firings (triggering upon arrival of each element). You can have all the data needed to identify patterns each time a new element arrives.
However, depending on the rate of user data being pushed and the size of the session, this might result in holding a lot of data in the PCollection and repeating this pattern matching a lot (on the same data), since you have to reuse all the data in the session. Furthermore you cannot use elements that arrived before this session.
Sometimes, you might be better off by keeping a state for each user (without redoing the pattern matching on all the data of the user for this session). Using a state would in fact remove the need to work with windowing.
The new process would now look like this:
For each element that arrives:
Fetch the current state
Calculate the new state (based on the old state and the new element)
If needed, emit a message to PubSub.
To hold your state, you could use BigTable or Datastore.
I may be wrong but GTFS is mainly used to plan or describe a public transportation system and GTFS-realtime is mainly used to make realtime operation data available. I think I need something that is not contemplated by none of these frameworks.
I need to record operational data like, how many passenger were transported, how much they paid, when each trip left the initial stop, etc. Data that must be recorded daily and kept in a database for latter use.
Does GTFS somehow address this?
Not really. Using a GTFS and a GTFS-realtime feed together you should be able to identify when a trip departed from its origin and whether it was on-time. If your transit agency includes "alert" data in its GTFS-realtime feed you may also be able to identify exceptional events that affect particular trips, such as roadwork or collisions.
Beyond that, I think you will have to look for other sources for the data you need (most likely the transit agency itself).
GTFS data describes the static features of a transit network, including its stops, routes and timetables. A GTFS-realtime feed provides live, operational data, but data of the sort riders can use to know when their bus will be arriving, not data transit operators track internally like ridership and fare revenues.
How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.
I'm building an app that needs to store a fair amount of events that the users carry out. (Think LOTS as in millions per month).
I need to report on the these events (total of type x in the last month, etc) and need something resilient and fast.
I've toyed with Redis etc to store aggregates of the data, but this could just mean that I'm building up a massive store of single figure aggregates that aren't rebuildable.
Whilst this isn't a bad solution, I'm looking at storing the raw event data in tables that I can then query on a needs basis, and potentially generate aggregate counters on a periodic basis. This would thus give me the ability to add counters over time, and also carry out ad-hoc inspections on what is going on, something which aggregates don't allow.
Question is, how is best to do this? I obviously don't want to have to create a model for each table (which is what Rails would prefer), so do I just create the tables and interact with raw SQL on a needs basis, or is there some other choice for dealing with this sort of data?
I've worked on an app that had that type of data flow and the solution is the following :
-> store everything
-> create aggregates
-> delete everything after a short period (1 week or somehting) to free up resources
So you can simply store events with rails, have some background aggregate creation from another fast script (cron sql), read with rails the aggregates and yet another background script for raw event deletion.
Also .. rails and performance don't quite go hand in hand usually ;)