Adobe Analytics evars, props & events in App Measurement 2.0.0 - adobe-analytics

i know adobe analytics/sitecatalyst for a while now (and i know all those "dont combine props and success events etc.") but i am still confused about the results i see in my reports: what are those numbers telling me exactly?
background: I stumbled across the idea of "page view success events", but i am not sure if this is still state of the art.
for my example i use one prop and evar, which contain exactly the same characteristics (prop = evar).
props + page views + visits + instances + orders
result: 0 pageviews < 100 visits < 120 instances (orders not selectable)
my interpretation: this prop is set in an s.tl() call, so no page views are related (?). it was set 120 times in 100 sessions, so some sessions triggered the prop more than once. success metrics (purchase metrics) cannot be combined with props.
evars + page views + visits + instances + orders
result: 20 orders < 100 visits < 120 instances < 6.000 page views
my interpretation: the variable was set in the same s.tl() call like the prop above, thats why visits and instances are matching. after setting this variable, 20 orders were triggered. furthermore, after the s.tl() call which set the variables, the 100 sessions triggered 6.000 additional s.t() calls (?).
I guess it must depend somehow on the sequence of s.t() and s.tl() calls but i am not sure..would be very glad if someone could shed some light :)

eVars persist data, so the 6000 page views are all page views that occurred after it was defined until the eVar expired (defaults to visit).
Page views are only s.t() calls; Instances are the number of times it was defined in both s.t() and s.tl() calls.

Related

How to dedupe across over-lapping sliding windows in apache beam / dataflow

I have the following requirement:
read events from a pub sub topic
take a window of duration 30 mins and period 1 minute
in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub topic
The event should be raised as soon as the 3rd event comes in for the grouping id as this is for detecting fraudulent behaviour. In one pane there many be many ids that have 3 events that match my predicate so i may need to emit multiple events per pane
I am able to write a function which consumes a PCollection does the necessary grouping, logic and filtering and emit events according to my business logic.
Questions:
The output PCollection contains duplicates due to the overlapping sliding windows. I understand this is the expected behaviour of sliding windows but how can I avoid this whilst staying in the same dataflow pipeline. I realise I could dedupe in an external system but that is just adding complexity to my system.
I also need to write some sort of trigger that fires each and every time my condition is reached in a window
Is dataflow suitable for this type of realtime detection scenario
Many thanks
You can rewindow the output PCollection into the global window (using the regular Window.into()) and dedupe using a GroupByKey.
It sounds like you're already returning the events of interest as a PCollection. In order to "do something for each event", all you need is a ParDo.of(whatever action you want) applied to this collection. Triggers do something else: they control what happens when a new value V arrives for a particular key K in a GroupByKey<K, V>: whether to drop the value, or buffer it, or to pass the buffered KV<K, Iterable<V>> for downstream processing.
Yes :)

Is it possible to use one multi-index query to results with per-index limits?

I'm working on a Rails application that is using Elasticsearch to index and search three types of documents, lets call them A, B and C. They are related but its not too important how. There is a search view in which one can search and have items returned under the 3 different categories. At first the set up was to have a list of say the 20 top results across all the categories in one list but that's not working so well now.
Now the view will have 3 different tabs with one for the results from each index, essentially. The current methodology would then break down as I want to have up to 10 results from each category independently and not have say 23 of A, 2 of B and 5 of C, which would happen if I just increased the results limit in hopes of getting a spread. If I were doing this in say Go I'd happily just split this into 3 simpler concurrent requests but I'm hesitant to try this in Ruby as I'm fairly new to it. From my research it seems my options, in order of preference are:
Have the current Elasticsearch query return up to 30 results, with a limit of 10 from each index (my question)
Delegate to 3 background wget calls on the system and spin later waiting for the results
Use a multi-thread approach (Ruby processes take too long to start). The prospect of the thread-safety worms that will come out of this can from all the Gems I'm using frightens me.
Number one is absolutely perfect, I just don't know how to accomplish this from going through the docs. I know that you can aggregate results into buckets which sounds close to what I want but is it possible to then also limit the number of results from each index independently?

Why does SimpleFacetedSearch result in a OutOfMemory exception in lucene.net?

I am using Lucene.net to perform faceted searches for a MVC based web app hosted on Azure.
The index consists of approx. 2 million entries.
Each entry has 1 Analysed field and about 25 Non Analysed fields.
All fields need to be stored.
Currently the entire app works fine with a 25% complete sample index but falls over when the full index is created.
At which point i start getting an outofmemory exception from this line :
sfs = new SimpleFacetedSearch(Newreader, "Product_Id");
Each document is an SKU and each product has about 44 SKUs.
My intention ( and what was working before the full index was created) was to perform a facet search on "Product_Id" giving the total unique Products in order to allow for paging and only creating object models for the required number of product (24 products per page for example).
The layout of the page is such that i need all the SKU data for a product but need to limit by unique products. (i.e 24 products/page not SKUs/page)
So in essence I either need to figure out why i am getting the outofmemory exception. (Lucene seems to be handling much larger index for people, so maybe i am doing something wrong)
OR
I need to filter the SKUs down by unique product ID in another way.
I tried iterating through a loop tracking the document productIds and only grabbing the full model if required which worked well when trying to meet the 24 per page quota of the first few results pages (as you didn't have to iterate so many times) but was awful for the last few pages.

State machine transitions at specific times

Simplified example:
I have a to-do. It can be future, current, or late based on what time it is.
Time State
8:00 am Future
9:00 am Current
10:00 am Late
So, in this example, the to-do is "current" from 9 am to 10 am.
Originally, I thought about adding fields for "current_at" and "late_at" and then using an instance method to return the state. I can query for all "current" todos with now > current and now < late.
In short, I'd calculate the state each time or use SQL to pull the set of states I need.
If I wanted to use a state machine, I'd have a set of states and would store that state name on the to-do. But, how would I trigger the transition between states at a specific time for each to-do?
Run a cron job every minute to pull anything in a state but past the transition time and update it
Use background processing to queue transition jobs at the appropriate times in the future, so in the above example I would have two jobs: "transition to current at 9 am" and "transition to late at 10 am" that would presumably have logic to guard against deleted todos and "don't mark late if done" and such.
Does anyone have experience with managing either of these options when trying to handle a lot of state transitions at specific times?
It feels like a state machine, I'm just not sure of the best way to manage all of these transitions.
Update after responses:
Yes, I need to query for "current" or "future" todos
Yes, I need to trigger notifications on state change ("your todo wasn't to-done")
Hence, my desire to more of a state-machine-like idea so that I can encapsulate the transitions.
I have designed and maintained several systems that manage huge numbers of these little state machines. (Some systems, up to 100K/day, some 100K/minute)
I have found that the more state you explicitly fiddle with, the more likely it is to break somewhere. Or to put it a different way, the more state you infer, the more robust the solution.
That being said, you must keep some state. But try to keep it as minimal as possible.
Additionally, keeping the state-machine logic in one place makes the system more robust and easier to maintain. That is, don't put your state machine logic in both code and the database. I prefer my logic in the code.
Preferred solution. (Simple pictures are best).
For your example I would have a very simple table:
task_id, current_at, current_duration, is_done, is_deleted, description...
and infer the state based on now in relation to current_at and current_duration. This works surprisingly well. Make sure you index/partition your table on current_at.
Handling logic on transition change
Things are different when you need to fire an event on the transition change.
Change your table to look like this:
task_id, current_at, current_duration, state, locked_by, locked_until, description...
Keep your index on current_at, and add one on state if you like. You are now mangling state, so things are a little more fragile due to concurrency or failure, so we'll have to shore it up a little bit using locked_by and locked_until for optimistic locking which I'll describe below.
I assume your program will fail in the middle of processing on occassion—even if only for a deployment.
You need a mechanism to transition a task from one state to another. To simplify the discussion, I'll concern myself with moving from FUTURE to CURRENT, but the logic is the same no matter the transition.
If your dataset is large enough, you constantly poll the database to discover to discover tasks requiring transition (of course, with linear or exponential back-off when there's nothing to do); otherwise you use or your favorite scheduler whether it is cron or ruby-based, or Quartz if you subscribe to Java/Scala/C#.
Select all entries that need to be moved from FUTURE to CURRENT and are not currently locked.
(updated:)
-- move from pending to current
select task_id
from tasks
where now >= current_at
and (locked_until is null OR locked_until < now)
and state == 'PENDING'
and current_at >= (now - 3 days) -- optimization
limit :LIMIT -- optimization
Throw all these task_ids into your reliable queue. Or, if you must, just process them in your script.
When you start to work on an item, you must first lock it using our optimistic locking scheme:
update tasks
set locked_by = :worker_id -- unique identifier for host + process + thread
, locked_until = now + 5 minutes -- however this looks in your SQL langage
where task_id = :task_id -- you can lock multiple tasks here if necessary
and (locked_until is null OR locked_until < now) -- only if it's not locked!
Now, if you actually updated the record, you own the lock. You may now fire your special on-transition logic. (Applause. This is what makes you different from all the other task managers, right?)
When that is successful, update the task state, make sure you still use the optimistic locking:
update tasks
set state = :new_state
, locked_until = null -- explicitly release the lock (an optimization, really)
where task_id = :task_id
and locked_by = :worker_id -- make sure we still own the lock
-- no-one really cares if we overstep our time-bounds
Multi-thread/process optimization
Only do this when you have multiple threads or processes updating tasks in batch (such as in a cron job, or polling the database)! The problem is they'll each get the similar results from the database and will then contend to lock each row. This is inefficient both because it will slow down the database, and because you have threads basically doing nothing but slowing down the others.
So, add a limit to how many results the query returns and follow this algorithm:
results = database.tasks_to_move_to_current_state :limit => BATCH_SIZE
while !results.empty
results.shuffle! # make sure we're not in lock step with another worker
contention_count = 0
results.each do |task_id|
if database.lock_task :task_id => task_id
on_transition_to_current task_id
else
contention_count += 1
end
break if contention_count > MAX_CONTENTION_COUNT # too much contention!
done
results = database.tasks_to_move_to_current_state :limit => BATCH_SIZE
end
Fiddle around with BATCH_SIZE and MAX_CONTENTION_COUNT until the program is super-fast.
Update:
The optimistic locking allows for multiple processors in parallel.
By have the lock timeout (via the locked_until field) it allows for failure while processing a transition. If the processor fails, another processor is able to pick up the task after a timeout (5 minutes in the above code). It is important, then, to a) only lock the task when you are about to work on it; and b) lock the task for how long it will take to do the task plus a generous leeway.
The locked_by field is mostly for debugging purposes, (which process/machine was this on?) It is enough to have the locked_until field if your database driver returns the number of rows updated, but only if you update one row at a time.
Managing all those transitions at specific times does seem tricky. Perhaps you could use something like DelayedJob to schedule the transitions, so that a cron job every minute wouldn't be necessary, and recovering from a failure would be more automated?
Otherwise - if this is Ruby, is using Enumerable an option?
Like so (in untested pseudo-code, with simplistic methods)
ToDo class
def state
if to_do.future?
return "Future"
elsif to_do.current?
return "Current"
elsif to_do.late?
return "Late"
else
return "must not have been important"
end
end
def future?
Time.now.hour <= 8
end
def current?
Time.now.hour == 9
end
def late?
Time.now.hour >= 10
end
def self.find_current_to_dos
self.find(:all, :conditions => " 1=1 /* or whatever */ ").select(&:state == 'Current')
end
One simple solution for moderately large datasets is to use a SQL database. Each todo record should have a "state_id", "current_at", and "late_at" fields. You can probably omit the "future_at" unless you really have four states.
This allows three states:
Future: when now < current_at
Current: when current_at <= now < late_at
Late: when late_at <= now
Storing the state as state_id (optionally make a foreign key to a lookup table named "states" where 1: Future, 2: Current, 3: Late) is basically storing de-normalized data, which lets you avoid recalculating the state as it rarely changes.
If you aren't actually querying todo records according to state (eg ... WHERE state_id = 1) or triggering some side-effect (eg sending an email) when the state changes, perhaps you don't need to manage state. If you're just showing the user a todo list and indicating which ones are late, the cheapest implementation might even be to calculate it client side. For the purpose of answering, I'll assume you need to manage the state.
You have a few options for updating state_id. I'll assume you are enforcing the constraint current_at < late_at.
The simplest is to update every record: UPDATE todos SET state_id = CASE WHEN late_at <= NOW() THEN 3 WHEN current_at <= NOW() THEN 2 ELSE 1 END;.
You probably will get better performance with something like (in one transaction) UPDATE todos SET state_id = 3 WHERE state_id <> 3 AND late_at <= NOW(), UPDATE todos SET state_id = 2 WHERE state_id <> 2 AND NOW() < late_at AND current_at <= NOW(), UPDATE todos SET state_id = 1 WHERE state_id <> 1 AND NOW() < current_at. This avoids retrieving rows that don't need to be updated but you'll want indices on "late_at" and "future_at" (you can try indexing "state_id", see note below). You can run these three updates as frequently as you need.
Slight variation of the above is to get the IDs of records first, so you can do something with the todos that have changed states. This looks something like SELECT id FROM todos WHERE state_id <> 3 AND late_at <= NOW() FOR UPDATE. You should then do the update like UPDATE todos SET state_id = 3 WHERE id IN (:ids). Now you've still got the IDs to do something with later (eg email a notification "20 tasks have become overdue").
Scheduling or queuing update jobs for each todo (eg update this one to "current" at 10AM and "late" at 11PM) will result in a lot of scheduled jobs, at least two times the number of todos, and poor performance -- each scheduled job is updating only a single record.
You could schedule batch updates like UPDATE state_id = 2 WHERE ID IN (1,2,3,4,5,...) where you've pre-calculated the list of todo IDs that will become current near some specific time. This probably won't work out so nicely in practice for several reasons. One being some todo's current_at and late_at fields might change after you've scheduled updates.
Note: you might not gain much by indexing "state_id" as it only divides your dataset into three sets. This is probably not good enough for a query planner to consider using it in a query like SELECT * FROM todos WHERE state_id = 1.
The key to this problem that you didn't discuss is what happens to completed todos? If you leave them in this todos table, the table will grow indefinitely and your performance will degrade over time. The solution is partitioning the data into two separate tables (like "completed_todos" and "pending_todos"). You can then use UNION to concatenate both tables when you actually need to.
State machines are driven by something. user interaction or the last input from a stream, right? In this case, time drives the state machine. I think a cron job is the right play. it would be the clock driving the machine.
for what it's worth it is pretty difficult to set up an efficient index on a two columns where you have to do a range like that.
now > current && now < late is going to be hard to represent in the database in a performant way as an attribute of task
id|title|future_time|current_time|late_time
1|hello|8:00am|9:00am|10:00am
Never try to force patterns into problems. Things are the other way around. So, go directly to find a good solution for it.
Here is an idea: (for what I understood yours is)
Use persistent alerts and one monitored process to "consume" them. Secondarily, query them.
That will allow you to:
keep it simple
keep it cheap to maintain. Secondarily it also will keep you mentally more
fresh to do something else.
keep all the logic in code only (as it should).
I stress the point of having that process monitored with some kind of watchdog so you are ensured to send those alerts in time (or, in a worst case scenario, with some delay after a crash or things like that).
Note that: the fact of having persisted those alerts allows you this two things:
make/keeps your system resilient (more fault tolerant) and
make you able to query future and current items (by playing around with querying the alerts' time range as best fits your needs)
In my experience, a state machine in SQL is most useful when you have an external process acting on something, and updating the database with it's state. For example, we have a process that uploads and converts videos. We use the database to keep track of what is happening to a video at any time, and what should happen to it next.
In your case, I think you can (and should) use SQL to solve your problem instead of worrying about using a state machine:
Make a todo_states table:
todo_id todo_state_id datetime notified
1 1 (future) 8:00 0
1 2 (current) 9:00 0
1 3 (late) 10:00 0
Your SQL query, where all the real work happens:
SELECT todo_id, MAX(todo_state_id) AS todo_state_id
FROM todo_states
WHERE time < NOW()
GROUP BY todo_id
The currently active state is always the one you select. If you want to notify the user just once, insert the original state with notify = 0, and bump it on the first select.
Once the task is "done", you can either insert another state into the todo_states table, or simply delete all the states associated with a task and raise a "done" flag in the todo item, or whatever is most useful in your case.
Don't forget to clean out stale states.

How to group similar items in an activity feed

For a social network site, I have an activity of events from people you follow, and I'd like to group similar types of events made within a short timeframe together, for a more compact activity feed. Imagine how Facebook displays a comma separated list when you 'like' several things in rapid succession: 'Joe likes beer, football and chips.'
I understand using the group_by method on ActiveRecord Enumerable results, but there needs to be some initial work done populating a property that I can group by later. My questions deal with both storing activity data in a way that these groupings can be marked, and then later retrieving them again.
Right now I have an Activity model, which is a join association between the user that committed the activity and the item that that it's linked to (in my example above, assume 'beer', 'football' and 'chips' are records of a Like model). There are other activity types aside from 'likes' too (events, saving favorites, etc). What I'm considering is, as this association is created, a check is made when the last association of that type was done, and if it was made more than a certain time period ago, incrementing an 'activity block' counter that is part of the Activity model. Later, when rendering this activity feed, I can group by user, then type, then this activity block counter.
Example: Let's say 2 blocks of updates are made within the same day. A user likes 2 things at 2:05 and later 3 more things at 5:45. After the third update (the start of the 2nd block) happens at 5:45, the model detects too much time has passed and increments its activity block counter by 1, thus forcing any following updates into a new block when they are rendered via a group_by call:
2:05 Joe likes beer nuts and Hooters.
5:45 Joe likes couches, chips and salsa.
7:00 Joe is attending the Football Viewing Party At Joe's
My first question: What's an efficient way to increment a counter like this? It's no longer auto_increment, so the easiest thing I can think of is looking at the counter for the last record as a reference point. However, this couldn't be from the same query that checked for when the last update of that type was made, since a later update of another type could have already received the next counter value. They don't have to be globally unique, but that would be nice.
The other overall strategy I thought of was another model Called ActivityBlock, that joins groups of similar activities together. In many cases, updates will be isolated by themselves though, so this seems a little inefficient to have one record for each individual activity.
Do either of these seem like a solid strategy?
My final question revolves around pagination. Now that we're dealing with blocks, it's harder to always display exactly a certain amount of entries, before pagination kicks in. Either an individual (isolated) Activity update, or a block of then should count as just 1, so at the lowest layer of my group_by, I can incorporate a counter to track how many rows I've displayed, but this means I can't just make one DB query anymore and simply specify a limit statement. Is there any way I could still do this without repeatedly performing additional SQL queries until I've reached my page limit?
This would be one advantage of the ActivityBlock model approach, since I could easily apply a limit call to that, and blocks could contain an auto increment counter as well.
Check out http://railscasts.com/episodes/406-public-activity
He also posted one on how to do it from scratch in episode 407 (it's a Pro episode though).
You could use the epoch time, or a variation of it as the counter since thats semi-unique and deterministic

Resources