Query high frequency firebase data at a lower frequency - firebase-realtime-database

We're currently logging measurements to a Firebase database every 3 seconds.
But I want to graph the data over different periods, sometimes 5 mins, in which case 3 seconds resolution is ok( ~100 points). However if I want to see how it changes over 12hrs at the 3second resolution, I'll have 14,400 points.
For the longer time periods, I want to drop the resolution to reduce the data points.
As we're using Firebase there's no backend to query the database and then filter the data, it's the UI that queries the DB so it has to be in the query.
Is there any standard methodologies for handling this? (or common names for this that I can search on)
Does anyone know a Firebase specific solution while querying
Is it best to save whether this is 1min,5min,10min,1hr data when the data is first saved? (this is a less preferred solution as the data being set to firebase is from a small ESP8266 microcontroller without a huge amount of memory)
Many thanks in advance
Example data:

In the end I went with a variant on option three, pushing the data into 3 different locations depending on time interval, 3sec, 1min, 10min
This means I'm not sending up any extra information, just to a different location.
When you do the queries, while limited to fixed intervals, I can just query those three locations

Related

Firebase Realtime Database load issues / brownout

We experience intermittent, seemingly random brownouts of a firebase realtime database. We are beginning to shard our data into multiple databases, however, we are not sure this will solve our problem. It appears to us that firebase cannot scale to meet our needs in terms of doing frequent writes to a specific data set.
We sync data from a third-party data source in cycles (every 4-10 minutes, 1000 active jobs). Each update has the potential to change a few thousand nodes in firebase, most of which lie pretty low. However, most of the time the number of low-level nodes changed is much lower. We do differential updates on the sync'd data in order to allow very small writes to the lower-level nodes. This helps prevent our users from downloading a ton of additional data. We also batch all of our updates per cycle into only a handful of writes, between 10-20 (not sure of the performance impact of a batched write to multiple nodes vs. a write to a single node).
Here is an image of the database load graph, which includes some sharding:
Database Load
The "blue" line is our "main" database. The "orange line" is a database containing only the data that requires many writes, as described above. Currently, the main (blue) database is supporting normal operations, including reads/writes, etc.. The shard (orange) database is only handling writes. The mirror of these is pretty indicative of a "write" load issue, given that a large percentage of writes occurs in the morning.
At times, the database load reaches 100% and remains in this state for 30+ minutes.
Please let me know if I can expand on anything or explain anything in more detail. Would appreciate any suggestions on debugging strategies or explanations as to why this may be occurring.
We are actively refactoring a lot of code to mitigate this issue, however, it is not obvious what the main driver is.

Getting Granular Data from Google Analytics to enable Machine Learning applications

In the context of Google Analytics, I wonder if I can get granular data for an account in the form of a table --or multiple tables that could be joined --containing all relevant information collected per user and then per session.
For each user there should be rows describing in detail the activities and outcomes --micro and macro-- of each session. Features would include source, time of visit, duration of visit, pages visited, time per page, goal conversions etc.
Having the row data in a granular form would enable me to apply machine learning algorithms that would help me explore the data and optimize decisions (web design, budget allocation, biding).
This is possible, however not by default. You will need to set up custom dimensions to be able to identify individual clients, sessions, and timestamps to be able to get list wise user data, rather then pre-aggregated data. A good place to start is https://www.simoahava.com/analytics/improve-data-collection-with-four-custom-dimensions/
There is no way to collect all data per user in one simple query. You will need to run multiple queries, pivot tables, etc. and merge's to get the full dataset you are currently envisaging.
Beyond the problem you currently have, there is also then the problem of downloading the data.
1) There is a 10,000 row limit, so you will need to make a loop to download all available rows.
2) Depending on your traffic, you are likely to encounter sampled data, so you will need to download the data per day, or hour to avoid Google Analytics sampling.

Implementing offline Item based recommendation using Mahout

I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.
MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.
If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !

Quickest way to load total number of points within a set of iterations?

I am creating an app which graphs the total number of accepted points on an iteration by iteration basis, compared to all points accepted within that iteration (regardless of project). Currently, I am using a WsapiDataStore call with filters to only pull from the chosen iterations. However, this requires pulling all user stories within the iteration and then summing the Plan Estimate fields of each. It works, but it takes a pretty long time (about 20-30 seconds) to pull data which I would assume might be able to be queried in a single call. Am I correct in my thinking, or is this really the easiest way?
Rally's API does not support server side aggregations. Unfortunately pulling that data into local memory is the only way to do calculations like this.

Tracking impressions/visits per web page

I have a site with several pages for each company and I want to show how their page is performing in terms of number of people coming to this profile.
We have already made sure that bots are excluded.
Currently, we are recording each hit in a DB with either insert (for the first request in a day to a profile) or update (for the following requests in a day to a profile). But, given that requests have gone from few thousands per days to tens of thousands per day, these inserts/updates are causing major performance issues.
Assuming no JS solution, what will be the best way to handle this?
I am using Ruby on Rails, MySQL, Memcache, Apache, HaProxy for running overall show.
Any help will be much appreciated.
Thx
http://www.scribd.com/doc/49575/Scaling-Rails-Presentation-From-Scribd-Launch
you should start reading from slide 17.
i think the performance isnt a problem, if it's possible to build solution like this for website as big as scribd.
Here are 4 ways to address this, from easy estimates to complex and accurate:
Track only a percentage (10% or 1%) of users, then multiply to get an estimate of the count.
After the first 50 counts for a given page, start updating the count 1/13th of the time by a count of 13. This helps if it's a few page doing many counts while keeping small counts accurate. (use 13 as it's hard to notice that the incr isn't 1).
Save exact counts in a cache layer like memcache or local server memory and save them all to disk when they hit 10 counts or have been in the cache for a certain amount of time.
Build a separate counting layer that 1) always has the current count available in memory, 2) persists the count to it's own tables/database, 3) has calls that adjust both places

Resources