Storing Statistics Data in InfluxDB - influxdb

We want to analyze the usage of our application and therefor want to store the usage data in an influxdb. We want to store data like Session Start Time, Browser, Browser Version, OS, Language, Available Languages, etc.
We then want to know e.g. what the top 5 Browsers are (with percent of sessions or percent of users), or which the most often used OS is.
How would I store this data in influxdb in order to be able to get such reports as described above or are there better databases to store such data.

As per my understanding you are not dealing with timeseries data. Your data is structured as per my evaluation. You should use one of the relational databases like MySQL or PostgreSQL.
You can go ahead with InfluxDB if you are dealing with regular/irregular time series data.

Yes, you can use InfluxDB for this. You'd want to store this information as events in the database.
So when a user browses to your application, that would be an event and you'd store something like:
user_events,browser="Chrome",country="UK",os="Linux",language="en_GB",url="/home" some_field=1 TIMESTAMP
This is Line Protocol:
https://docs.influxdata.com/influxdb/v1.7/write_protocols/line_protocol_tutorial/

Related

Is it okay if Operational data store holds data for rolling time period?

We are planning to build an Operational data store for the front-end users data extraction requirements.
As far as I know the Kimball's approach to build ODS\DW, it should hold the data for complete time period and not like the rolling time period.
The reason being, there could be a need to extract older data from ODS\DW.
So I need your thoughts on this. How should I approach ?
I would create a snapshot table that could hold the values for the rolling period for each day, and filter on the client side of things which snapshot to display.
Once the period is over then the final values can be stored on the permanent data mart.
Kimball's approach for a data warehouse would be to load transactional data to any data warehouse if you can, because it is more flexible in terms of being rolled up. Certainly at the ODS stage you wouldn't want to 'pre-aggregate' your data, if there could be a need to get hold of older data.
If you store both the transactional data and then pre-aggregated versions of the data (in aggregate fact tables, with indexes/views or with a cube, or just filtering on the report side as the other answer suggests), you can get the best of both worlds.
(Note: Kimball's approach in fact does not require an ODS: they're fine if you want to build one, but their focus is on the dimensionally modelled data warehouse.)

How to use GTFS to record or analyse operating time series?

I may be wrong but GTFS is mainly used to plan or describe a public transportation system and GTFS-realtime is mainly used to make realtime operation data available. I think I need something that is not contemplated by none of these frameworks.
I need to record operational data like, how many passenger were transported, how much they paid, when each trip left the initial stop, etc. Data that must be recorded daily and kept in a database for latter use.
Does GTFS somehow address this?
Not really. Using a GTFS and a GTFS-realtime feed together you should be able to identify when a trip departed from its origin and whether it was on-time. If your transit agency includes "alert" data in its GTFS-realtime feed you may also be able to identify exceptional events that affect particular trips, such as roadwork or collisions.
Beyond that, I think you will have to look for other sources for the data you need (most likely the transit agency itself).
GTFS data describes the static features of a transit network, including its stops, routes and timetables. A GTFS-realtime feed provides live, operational data, but data of the sort riders can use to know when their bus will be arriving, not data transit operators track internally like ridership and fare revenues.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

iOS app with remote server - I don't need data to persist on app, should I still use CoreData?

Design question:
My app talks to a server. Json data being sent/received.
Data on server is always changing, and I want users to see most current data, not stored/cached data. So I require a user to be logged in order to use the app, and care not to persist data in the app.
Should I still use CoreData and map it to Json's.?
Or can I just create custom model classes and map Json's to it's properties, and have nsarray properties, which point to its child objects, etc. ?
Which is better?
Thanks
If you dont want to persist data, I personally think core data would be overkill for this application
Core Data is really for local persistance. If the data was not changing so often and you didnt want them to have to get an updated data everytime the user visited the page, then you would load the JSON and store it locally using CoreData.
Use plain old objective-c objects for now. It's not hard to switch to Core Data in future, but once you've done so it gets a lot harder to change your schema.
That depends on what your needs are.
If you need the app to work offline, you need to store your information somehow in the client.
In order to save on network usage, you could store locally, then query the server to see if it had an updated answer -- you could do this by sending a time stamp to the server and return a 304 Not Modified if the entity hasn't changed.
Generally, it depends on how much time you have to put into the app and what your specific requirements are, but as a general rule I would optimise for as low bandwidth usage as possible, as that not only reduces potential data costs, but also means the answers will be more quickly available to your users (when online and they have not changed) and also available offline.
If you do not wish to store data locally at all,

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Resources