How to use GTFS to record or analyse operating time series? - time-series

I may be wrong but GTFS is mainly used to plan or describe a public transportation system and GTFS-realtime is mainly used to make realtime operation data available. I think I need something that is not contemplated by none of these frameworks.
I need to record operational data like, how many passenger were transported, how much they paid, when each trip left the initial stop, etc. Data that must be recorded daily and kept in a database for latter use.
Does GTFS somehow address this?

Not really. Using a GTFS and a GTFS-realtime feed together you should be able to identify when a trip departed from its origin and whether it was on-time. If your transit agency includes "alert" data in its GTFS-realtime feed you may also be able to identify exceptional events that affect particular trips, such as roadwork or collisions.
Beyond that, I think you will have to look for other sources for the data you need (most likely the transit agency itself).
GTFS data describes the static features of a transit network, including its stops, routes and timetables. A GTFS-realtime feed provides live, operational data, but data of the sort riders can use to know when their bus will be arriving, not data transit operators track internally like ridership and fare revenues.

Related

EventStore - read specific time frame from stream

I have a system that produces thousands of messages per hour. Some location tracking system that gather events from different devices and doing different calculations based on that messages.
I'm trying to evaluate if event store suites for this use case. So my plane is associate stream per device and accumulate messages in those streams.
Now the question - will I be able to read those messages for specific time frame in the past? I don't want to replay all the messages from the beginning, I just need fast access to messages from date1 to date2.
Any ideas? So far what I saw in the docs only relates to reading all messages either from the beginning or from the end and do the filtration during the process. But this pattern doesn't look very optimal to me. Am I doing something wrong?
EventStoreDB index allows you to read events from a specific stream by the event number, but not by date. What you wrote about reading from the beginning or the end is not entirely correct, as you can read from any position in the stream, both backwards and forwards, but then again it has nothing to do with dates.
Essentially, the data when the event was written to the database is considered not important and transient. For example, if you decide to move your data to another store using replication, all the events will get the new date. That's why if the date is important, it should be stored somewhere in the event data or metadata. EventStoreDB doesn't know about the event payload (or meta), and doesn't index that.
If you are looking to find a database kind that allows you to query records by time, the best chance is to look at time series databases like Prometheus and InfluxDB. These databases are specifically designed to index primarily by timestamp, and optimised to store data like sensor readings where each reading is a replacement of the previous one. EventStoreDB is not designed for that purpose, it's the database built to support event-sourced applications, and sensor readings is not that.

Getting Granular Data from Google Analytics to enable Machine Learning applications

In the context of Google Analytics, I wonder if I can get granular data for an account in the form of a table --or multiple tables that could be joined --containing all relevant information collected per user and then per session.
For each user there should be rows describing in detail the activities and outcomes --micro and macro-- of each session. Features would include source, time of visit, duration of visit, pages visited, time per page, goal conversions etc.
Having the row data in a granular form would enable me to apply machine learning algorithms that would help me explore the data and optimize decisions (web design, budget allocation, biding).
This is possible, however not by default. You will need to set up custom dimensions to be able to identify individual clients, sessions, and timestamps to be able to get list wise user data, rather then pre-aggregated data. A good place to start is https://www.simoahava.com/analytics/improve-data-collection-with-four-custom-dimensions/
There is no way to collect all data per user in one simple query. You will need to run multiple queries, pivot tables, etc. and merge's to get the full dataset you are currently envisaging.
Beyond the problem you currently have, there is also then the problem of downloading the data.
1) There is a 10,000 row limit, so you will need to make a loop to download all available rows.
2) Depending on your traffic, you are likely to encounter sampled data, so you will need to download the data per day, or hour to avoid Google Analytics sampling.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

Correction of historical ADT Data in HL7 V2

Just out of curiosity
How do you transmit corrections of historic ADT Data in HL7 V2, e.g. patient transfer, if you do not have a ZBE segment for historic movements as in Germany?
Do you cancel all relevant events and build a new patient history or do you use some of the already defined fields of the segments of ADT messages to mark the event, that should be corrected?
How do you deal with multiple transfers wardA --> wardB --> wardA -->wardB ?
ADT^A02 would be the "correct" way to transfer a patient from one room/bed to another. However, as #Sid stated I cannot recall a time where I've ever seen an ADT^A02 implemented in the real world.
This is most likely due to changing status/demographics while transferring a patient - Most of the time there will be a specific reason as to why the transfer is happening - moving from an outpatient/inpatient(or vice-versa), change in facility, etc. It's much easier to bunch up this information into one ADT^A08 then to send both an ADT^A02 and an ADT^A08 to satisfy these constraints.
If the transfer information is erroneous, like you've stated in the above comment, then a transfer cancellation will need to be triggered ADT^A12 - Again, this is another one that I've seen rarely used - but if the transfer was done accidently or wrong, you wouldn't want to keep that information in the system. You'd want to get rid of it, and only have the correct information updated.
Since your Health/Hospital Information System(HIS) is usually the same system in which your Patient Census is done, blasting this cancellation message out to every individually connected system isn't usually worth doing because most specialty applications attached to the HIS could careless about previous room/bed for patients, only the most current information. Because of this, again, an ADT^A08 is more widely used.
Previous Room/Bed information is usually kept up by the HIS from the application standpoint. When a patient room/bed is updated, it will write the current PatientRoom or PatientBed information in the database columns to something like PreviousRoom or PreviousBed. I've seen this implemented down to the "Previous-Previous" Room and Bed. It will then write the new room/bed to the PatientRoom or PatientBed.
This is done so that the HIS can locally process cancellations of transfers. Most of the time these cancellations are done directly in the HIS by a user, and then an ADT^A08 is sent out to the appropriate interoperable applications using the new Room/Bed, with the connected applications being none-the-wiser what the previous room and bed was, or that it may have been cancelled and retransferred. If they were to cancel this transfer in the HIS, it would revert back to the database columns for PreviousRoom or PreviousBed, and update accordingly.
Hope this didn't confuse you too much.
TL;DR - The HIS is predominately the only system that cares about previous room/bed data. ADT^A08 is what's used the majority of the time to update patient room/bed information, even though it's standard-wrong.

Implementing offline Item based recommendation using Mahout

I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.
MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.
If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !

Resources