I am trying to reproduce in Google Docs the calculations underlying the sample mortgage Closing Disclosure provided by the CFPB at:
http://files.consumerfinance.gov/f/201311_cfpb_kbyo_closing-disclosure.pdf
That document describes a mortgage with the following parameters:
Loan Amount: $162000
Annual Interest Rate: 3.875%
monthly PMI: $82.35
total loan costs: $4694.05
prepaid interest: $279.04
and summarizes it as follows (page 5):
Total Payments: $285803.36
Finance Charge: $118830.27
Amount Financed: $162000.00
Annual Percent Rate: 4.174%
Total Interest Percent: 69.46%
Almost everything I calculate seems to agree but I can't get the formula right for the effective APR (4th line of summary).
I currently calculate it as follows
=100*12*rate(12*30, -1*(4694.05+279.04+162000+-1*cumipmt(0.03875/12, 30*12, 162000, 1, 30*12, 0)+82.35*80)/360, 162000, 0, 0)
This comes out to 4.218%, not 4.174% as published.
What am I missing?
The code I'm using is here:
https://docs.google.com/spreadsheets/d/1VQshp3A55brVv17eS9REdBjBUG0EmwTrcwhgXBVK8m8/edit#gid=0
APR has many nuances. It's primarily a customer-facing metric, so banks feel pressured to keep the number low.
Many institutions assume 12 months of 30 days each; some take other approaches. There's different ways of treating leap years, Escrow, etc.. Since you are close, but not exact, likely this institution has a non-standard APR calculation.
Related
Writing an algorithm to extract some keywords like rent, deposit, liabilities etc. from rent agreement document. I used "naive bayes classifier" but the output is not giving desired output:
my training data is like:
train = [
("refundable security deposit Rs 50000 numbers equal 5 months","deposit"),
("Lessee pay one month's advance rent Lessor","security"),
("eleven (11) months commencing 1st march 2019","duration"),
("commence 15th feb 2019 valid till 14th jan 2020","startdate")]
The below code is not giving desired keyword:
classifier.classify(test_data_features)
Please share if there are any libraries in NLP to accomplish this.
Seems like you need to make your specific NER(Named Entity Recognizer) for parsing your unstructured document.
where you need to tag every word of your sentence into certain labels. Based on the surrounding words and context window your trained NER will be able to give you the results which you looking for.
Check standford corenlp implementation of NER.
How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?
By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is only 10 minutes difference.
I found a solution here where the time feature is split in to two features using cosine and sine of the seconds after midnight feature. But will that be appropriate for random forest? With using random forest one can't be sure that all features will be present for every split. So often there will be half of the time information missing for a decisions.
Looking forward to you thoughts!
If you have a date variable, with values like '2019/11/09', you can extract individual features like year (2019), month (11), day (09), day of the week (Monday), quarter (4), semester (2). You can go ahead and add additional features like "is bank holiday", "is weekend", or "advertisement campaign", if you know the dates of specific events.
If you have a time variable with values like 23:55, you can extract hr (23), minutes (55) and if you had, seconds, nanoseconds etc. If you have info about the timezone, you can also get this.
If you have datetime variable with values like '2019/11/09 23:55', you can combine the above.
If you have more than 1 datetime variable, you can capture differences between them, for example if you have date of birth, and date of application, you can determine the feature "age at time of application".
More info about the options for datetime can be found in pandas dt module. Check methods here.
The cyclical transformation in your link is used to re-code circular variables like hrs of a day, or months of the year, where for example December (month 12) is closer to January (month 1) than to July (month 7), whereas if you encoded with numbers, this relationship is not captured. You would use this transformation if this is what you want to represent. But this is not the standard go method to transform this variables (to my knowledge).
You can check Scikit-learn's tutorial on time related feature engineering.
Random forests capture non-linear relationships between features and targets, so they should be able to handle both numerical features like month, or the cyclical variation.
To be absolutely sure, the best way is to try both engineering methods and see which feature returns better model performance.
You can apply the cyclical transformation straightaway with the open source package Feature-engine. Check the CyclicalTransformer.
Does anyone notice quandl WIKI or EOD data are different than Yahoo or Bloomberg? I notice this when I was comparing data providers, which I am using AAPL as a test. AAPL split its stock 7 for 1 on Jun 9, 2014, so I think it is an ideal candidate to compare data.
Here is a picture of data comparison:
Do you know why they are different and which one I should trust? If I should trust neither, is there any other free data provider I can trust?
It depends on whether you adjust the dividends too (the series starting at 52 is adjusted for dividends, the series starting at 58 is not).
For reference, Bloomberg data:
Unadjusted price Adjusted for split Adjusted for split & div
27/12/2011 406.53 58.0757 52.4533
28/12/2011 402.64 57.52 51.9514
29/12/2011 405.12 57.8743 52.2714
etc.
We collect data on our website traffic, which results in about 50k to 100k
unique visits a day.
Cohort analysis:
Find the percentage of users within a 24-hour period which register at the
website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).
Two very abbreviated sample documents:
sessionId: our unique identifier for performing counts
url: the url for evaluating cohorts
time: unix timestamp for event
{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}
{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}
If I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense.
On a side note: I am also not interested in getting 100% accurate results,
an approximation would be sufficient for trend analysis.
Can we achieve this with Druid? Or It's not suitable for this kind of analysis? Is there anything else, that is superior to do cohort analysis?
I think you can do this with druid and data sketches.
Look at the last example is this page
In case you want to go with this approximation method you can look here to understand the bound errors of the approximation and the trade off that you can make to trade memory for accuracy.
I have many emails that I would like to extract data from. The emails contain data but in different formats.
The below example contains data about a request for a shipment:
Account: SugarHigh Inc
Qty: 1,000 Tons Sugar
Date: 9 - 15 July
From: NY
To: IL
I would like to extract the above into the following format:
Account Quantity Product FromDate ToDate From To
------- -------- ------- -------- ------ ---- --
SugarHigh Inc 1000 Sugar 9 July 15 July NY IL
The same request can arrive in a different formats.For example:
Acc: SugarHigh Inc
Qty/Date: 1,000 Tons Sugar/9 - 15 July
From/To: NY/IL
Some requests can even have more or less fields or describe things differently.
Can machine learning be used in assisting to extract this data fully/partially? If so what type of algorithms/models exist for this type of problem? I am assuming I might also need to use some type of dictionary for known words such as products or locations.
Yes it can, you start with reading this post on text mining. That being said, I'd recommend just using some (fuzzy) string searching. The variability in such data is limited. Every time you encounter new pattern just add it to the algorithm. Should yield you better results and cost less time.