Druid Cohort Analysis? - analysis

We collect data on our website traffic, which results in about 50k to 100k
unique visits a day.
Cohort analysis:
Find the percentage of users within a 24-hour period which register at the
website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).
Two very abbreviated sample documents:
sessionId: our unique identifier for performing counts
url: the url for evaluating cohorts
time: unix timestamp for event
{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}
{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}
If I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense.
On a side note: I am also not interested in getting 100% accurate results,
an approximation would be sufficient for trend analysis.
Can we achieve this with Druid? Or It's not suitable for this kind of analysis? Is there anything else, that is superior to do cohort analysis?

I think you can do this with druid and data sketches.
Look at the last example is this page
In case you want to go with this approximation method you can look here to understand the bound errors of the approximation and the trade off that you can make to trade memory for accuracy.

Related

Extract some keywords like rent, deposit, liabilities etc. from unstructured document

Writing an algorithm to extract some keywords like rent, deposit, liabilities etc. from rent agreement document. I used "naive bayes classifier" but the output is not giving desired output:
my training data is like:
train = [
("refundable security deposit Rs 50000 numbers equal 5 months","deposit"),
("Lessee pay one month's advance rent Lessor","security"),
("eleven (11) months commencing 1st march 2019","duration"),
("commence 15th feb 2019 valid till 14th jan 2020","startdate")]
The below code is not giving desired keyword:
classifier.classify(test_data_features)
Please share if there are any libraries in NLP to accomplish this.
Seems like you need to make your specific NER(Named Entity Recognizer) for parsing your unstructured document.
where you need to tag every word of your sentence into certain labels. Based on the surrounding words and context window your trained NER will be able to give you the results which you looking for.
Check standford corenlp implementation of NER.

Cyclic ordinal features in random forest

How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?
By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is only 10 minutes difference.
I found a solution here where the time feature is split in to two features using cosine and sine of the seconds after midnight feature. But will that be appropriate for random forest? With using random forest one can't be sure that all features will be present for every split. So often there will be half of the time information missing for a decisions.
Looking forward to you thoughts!
If you have a date variable, with values like '2019/11/09', you can extract individual features like year (2019), month (11), day (09), day of the week (Monday), quarter (4), semester (2). You can go ahead and add additional features like "is bank holiday", "is weekend", or "advertisement campaign", if you know the dates of specific events.
If you have a time variable with values like 23:55, you can extract hr (23), minutes (55) and if you had, seconds, nanoseconds etc. If you have info about the timezone, you can also get this.
If you have datetime variable with values like '2019/11/09 23:55', you can combine the above.
If you have more than 1 datetime variable, you can capture differences between them, for example if you have date of birth, and date of application, you can determine the feature "age at time of application".
More info about the options for datetime can be found in pandas dt module. Check methods here.
The cyclical transformation in your link is used to re-code circular variables like hrs of a day, or months of the year, where for example December (month 12) is closer to January (month 1) than to July (month 7), whereas if you encoded with numbers, this relationship is not captured. You would use this transformation if this is what you want to represent. But this is not the standard go method to transform this variables (to my knowledge).
You can check Scikit-learn's tutorial on time related feature engineering.
Random forests capture non-linear relationships between features and targets, so they should be able to handle both numerical features like month, or the cyclical variation.
To be absolutely sure, the best way is to try both engineering methods and see which feature returns better model performance.
You can apply the cyclical transformation straightaway with the open source package Feature-engine. Check the CyclicalTransformer.

Calculating effective APR of a mortgage

I am trying to reproduce in Google Docs the calculations underlying the sample mortgage Closing Disclosure provided by the CFPB at:
http://files.consumerfinance.gov/f/201311_cfpb_kbyo_closing-disclosure.pdf
That document describes a mortgage with the following parameters:
Loan Amount: $162000
Annual Interest Rate: 3.875%
monthly PMI: $82.35
total loan costs: $4694.05
prepaid interest: $279.04
and summarizes it as follows (page 5):
Total Payments: $285803.36
Finance Charge: $118830.27
Amount Financed: $162000.00
Annual Percent Rate: 4.174%
Total Interest Percent: 69.46%
Almost everything I calculate seems to agree but I can't get the formula right for the effective APR (4th line of summary).
I currently calculate it as follows
=100*12*rate(12*30, -1*(4694.05+279.04+162000+-1*cumipmt(0.03875/12, 30*12, 162000, 1, 30*12, 0)+82.35*80)/360, 162000, 0, 0)
This comes out to 4.218%, not 4.174% as published.
What am I missing?
The code I'm using is here:
https://docs.google.com/spreadsheets/d/1VQshp3A55brVv17eS9REdBjBUG0EmwTrcwhgXBVK8m8/edit#gid=0
APR has many nuances. It's primarily a customer-facing metric, so banks feel pressured to keep the number low.
Many institutions assume 12 months of 30 days each; some take other approaches. There's different ways of treating leap years, Escrow, etc.. Since you are close, but not exact, likely this institution has a non-standard APR calculation.

What to do when daylight savings results in duplicate data rows?

I have a fact table for energy consumption as follows:
f_meter_data:
utc_calendar_id
local_calendar_id
meter_id
reading
timestamp
The calendar table is structured as per the Kimball recommendations, and it's the recommendations in the Data Warehouse Toolkit that are why I have the two calendar IDs so users can query on local and UTC time.
This is all well and good but the problems arise when daylight savings kicks in.
As the granularity is half hour periods, there will be a duplicate fact records when the clocks change.
And when the clocks change in the other direction there will be a gap in the data.
How can I handle this situation?
Should I average the duplicate values and store that instead?
And for when it's a gap in the data, should I use an average of the point immediately before and the point immediately after the gap?
I have a feeling this question may end up getting closed as "primarily opinion based", but my particular opinion is that the system should be set up to deal with the fact that not every day has exactly 24 hours. There may be 23, 24 or 25. (Or, if you're on Lord Howe Island, 23.5, 24 or 24.5).
Depending on when your extra hour falls (which will be different for each time zone), you may have something like:
00 01a 01b 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Or you might consider coupling the hour with the local UTC offset, like:
00-04:00 01-04:00 01-05:00 02-05:00 03-05:00 etc...
Or if you're doing half-hour buckets:
00:00-04:00 00:30-04:00 01:00-04:00 01:30-04:00 01:00-05:00 01:30-05:00 ...
It probably wouldn't be appropriate to do any averaging to align to 24 hours. If you did, then totals would be off.
You also should consider how people will be using the data. Will they be trying to figure out trends across a given hour of the day? If so, then how will they compensate for a spike or dip caused by the DST transition? It may be as simple as putting an asterisk and footnote on the output report. Or it may be much more involved than that, depending on the usage.
Also, you said you're working with 30-minute intervals. Be aware that there are some time zones that are 45-minute offset (Nepal, Chatham Islands, and a small region in Australia). So if you're trying to cover the entire world then you would need 15-minute interval buckets.
And, as Whichert pointed out in comments, if you're using UTC then there is no daylight saving time. It's only when you group by local-time that you'll have this concern.
You may also find the graphs in the DST tag wiki useful.
I think you should simplify this with your business. Meaning when the clock is turned back, you turn back your record by pushing the old records out into a warning or error table and putting the new ones for the same interval.
As suggested by Matt, anyways reports would not tell the true story, if run by local time. Then, why give wrong data in the reports.
Or to followup on Matt's advice again change your interval records. You should then not bind the time interval to the local_id. Instead use a Interval_seq_id that runs in interval of 30 minutes that might have 48 records (1-48), 50 records (1-50) or 52 (1-52) records for a given day depending on your region. This technically will remove your duplicate problems on the Local_Int_starttime and Time_interval_Endtime, its no more dependant or bond with the time intervals.
This though moves the issue to your reports/query tools to solve how they now want to display time in the graphs that have duplicates on local time.Especially, if you want to do some analytics based on local time and meter reading. Though, this way the database design now differentiates the records through Interval_Seq_id and not using the time interval.
There is a similar thread about daylight savings problems in C# here.
The answer goes into deep details about daylight savings. I believe the problem is somewhat similar.

Is timezone just an offset number or "more information"?

I live in a country where they change the time twice a year. That is: there is a period in the year when the offset from UTC is -3 hours (-180 mins) and other period where the offset is -4 hours (-240 mins)
Grafically:
|------- (offset = -3) -------|------- (offset is -4) -------|
start of year mid end of year
My question is:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
This is important because I save every date in UTC timezone (offset = 0) in my database.
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
Here is an example of a problem I see by saving the dates with timezone UTC:
Lets say I have a system where people send messages.
I want to have a statistics section where I plot "messages sent v/s hour" (ie: "Messages sent by hour in a regular day")
Lets say there are just two messages in the whole database:
Message 1, sent in march 1, at UTC time 5 pm (local time 2 pm)
Message 2, sent in august 1, at UTC time 5 pm (local time 1 pm)
Then, if I create the plot on august 2, converting those UTC dates to local would give me: "2 messages where sent at 1 pm", which is erratic information!
From the timezone tag wiki here on StackOverflow:
TimeZone != Offset
A time zone can not be represented solely by an offset from UTC. Many
time zones have more than one offset due to "daylight savings time" or
"summer time" rules. The dates that offsets change are also part of
the rules for the time zone, as are any historical offset changes.
Many software programs, libraries, and web services disregard this
important detail, and erroneously call the standard or current offset
the "zone". This can lead to confusion, and misuse of the data. Please
use the correct terminology whenever possible.
There are two commonly used database, the Microsoft Windows time zone db, and the IANA/Olson time zone db. See the wiki for more detail.
Your specific questions:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
You have one "time zone". It includes two "offsets".
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
If you are recording the precise moment an event occurred or will occur, then you should store the offset of that particular time with it. In .Net and SQL Server, this is represented using a DateTimeOffset. There are similar datatypes in other platforms. It only contains the offset information - not the time zone that the offset originated from. Commonly, it is serialized in ISO8601 format, such as:
2013-05-09T13:29:00-04:00
If you might need to edit that time, then you cannot just store the offset. Somewhere in your system, you also need to have the time zone identifier. Otherwise, you have no way to determine what the new offset should be after the edit is made. If you desire, you can store this with the value itself. Some platforms have objects for exactly this purpose - such as ZonedDateTime in NodaTime. Example:
2013-05-09T13:29:00-04:00 America/New_York
Even when storing the zone id, you still need to record the offset. This is to resolve ambiguity during a "fall-back" transition from a daylight offset to a standard offset.
Alternatively, you could store the time at UTC with the time zone name:
2013-05-09T17:29:00Z America/New_York
This would work just as well, but you'd have to apply the time zone before displaying the value to anyone. TIMESTAMP WITH TIME ZONE in Oracle and PostgreSQL work this way.
You can read more about this in this post, while .Net focused - the idea is applicable to other platforms as well. The example problem you gave is what I call "maintaining the perspective of the observer" - which is discussed in the same article.
that is: my country has two timezones? or the timezone includes this information?
The term "timezone" usually includes that information. For example, in Java, "TimeZone represents a time zone offset, and also figures out daylight savings" (link), and on Unix-like systems, the tz database contains DST information.
However, for a single timestamp, I think it's more common to give just a UTC offset than a complete time-zone identifier.
[…] in my database.
Naturally, you should consult your database's documentation, or at least indicate what database you're using, and what tools (e.g., what drivers, what languages) you're using to access it.
Here's an example of a very popular format for describing timezones (though not what Windows uses).
You can see that it's more than a simple offset. More along the lines of offsets and the set of rules (changing over time) for when to use which offset.

Resources