Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Problem:
I have a dataset about hedge fund. It contains monthly hedge fund returns and some financial metrics. I calculated metrics for every month from 2010 to 2019 December. (2889 monthly data) I want to binary classification and predict hedge funds' class basis on these metrics for next month. I want make prediction for T+1 from T time. And i want use random forest and other classifiers(Decision Tree,KNN,SVM,logistic regression). I know this dataset is time series problem, how do i convert this to machine learning problem.
I am open to your suggestions and advisories as to what method or approach should be followed in modeling, feature engineering and editing this data set.
Additional Questions:
1)How do I make a data split when using this data for training and test ? 0,80-0,20?. Is there any other method of validation you can recommend?
2)some funds are added to the data later, so not all funds have data of equal length, for example, the "AEB" fund established in 2015 has no data before 2015. There are a few such funds, do they cause problems, or is it better to delete them and remove them from the dataset? I have a total of 27 different fund data.
3)In addition, I have changed the tickers/names of the hedge funds to numeric ID, is it possible to do dummy encoding, would it be better for performance?
Sample Dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 1 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 1 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | 0 |
My Idea and First Try:
I modeled one example. I used a method like this, I shifted(-1) back the target variable. So each line was shown the class in which the fund was located in the following month.I did it because of this, I want to predict the next month before that month starts. Predict to T+1 from T.But this model gave a very poor result.(%43)
view of this model dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 1 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 0 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | ? |
There are many approaches out there that you can find. Time series are challenging and its okay to have poor results at the beginning. I advise you do the following:
Add some lags as additional columns in your dataset. You want to predict t+1 and you have t, so try to also compute t-1,t-2, t-3, etc.
In order to know the best number of t-x that you can have, try to do ACF and PACF plots and see the first lags that appear in the in the shaded region
Lags might boost your accuracy
try to normalize/standardize your data when modeling
Try to see if your time series is a random walk, if it is, there are many recent papers that try to tackle the problem of random walk prediction
If your dataset is big enough, try to use some neural networks like LSTM, RNN, GANs, etc that might be better than the shallow models you have mentioned
I really advise you to see the tutorials of Jason Brownlee on Time Series here Jason is super intelligent and you can always add comments to his tutorials. He is also responsive!!
I want to use foreach to cut out same sample using Stata.
I have written the following code snippet:
foreach i of numlist 0/11 {
preserve
keep id projectno nickname
gen start=`i'*30000+1
gen end=(`i'+1)*30000
outsheet using d:\profile\nickname_`i'.xls in `start'/`end'
restore
}
However, I receive the error below despite having revised it many times:
'/' invalid observation number
How can I correct my code?
This isn't a complete answer -- and focuses on a side-issue to your question -- but it won't fit easily into a comment.
Together with changes explained elsewhere, I would change the order of your commands to
preserve
keep id projectno nickname
forval i = 0/11 {
local start= `i' * 30000 + 1
local end = (`i' + 1) * 30000
outsheet using d:\profile\nickname_`i'.xls in `start'/`end'
}
restore
The in statement in the outsheet command is wrong because start and end are generated as variables and not local macros. You need to initialze both start and end as follows:
local start = `i' * 30000 + 1
local end = (`i' + 1) * 30000
Consider the following toy example using Stata's auto toy dataset:
sysuse auto, clear
foreach i of numlist 0/11 {
preserve
keep price mpg make
local start = (`i' * 3) + 1
local end = (`i' + 1) * 3
list in `start' / `end'
restore
}
Results:
+---------------------------+
| make price mpg |
|---------------------------|
1. | AMC Concord 4,099 22 |
2. | AMC Pacer 4,749 17 |
3. | AMC Spirit 3,799 22 |
+---------------------------+
+-----------------------------+
| make price mpg |
|-----------------------------|
4. | Buick Century 4,816 20 |
5. | Buick Electra 7,827 15 |
6. | Buick LeSabre 5,788 18 |
+-----------------------------+
+------------------------------+
| make price mpg |
|------------------------------|
7. | Buick Opel 4,453 26 |
8. | Buick Regal 5,189 20 |
9. | Buick Riviera 10,372 16 |
+------------------------------+
+------------------------------+
| make price mpg |
|------------------------------|
10. | Buick Skylark 4,082 19 |
11. | Cad. Deville 11,385 14 |
12. | Cad. Eldorado 14,500 14 |
+------------------------------+
+-------------------------------+
| make price mpg |
|-------------------------------|
13. | Cad. Seville 15,906 21 |
14. | Chev. Chevette 3,299 29 |
15. | Chev. Impala 5,705 16 |
+-------------------------------+
+---------------------------------+
| make price mpg |
|---------------------------------|
16. | Chev. Malibu 4,504 22 |
17. | Chev. Monte Carlo 5,104 22 |
18. | Chev. Monza 3,667 24 |
+---------------------------------+
+------------------------------+
| make price mpg |
|------------------------------|
19. | Chev. Nova 3,955 19 |
20. | Dodge Colt 3,984 30 |
21. | Dodge Diplomat 4,010 18 |
+------------------------------+
+-------------------------------+
| make price mpg |
|-------------------------------|
22. | Dodge Magnum 5,886 16 |
23. | Dodge St. Regis 6,342 17 |
24. | Ford Fiesta 4,389 28 |
+-------------------------------+
+----------------------------------+
| make price mpg |
|----------------------------------|
25. | Ford Mustang 4,187 21 |
26. | Linc. Continental 11,497 12 |
27. | Linc. Mark V 13,594 12 |
+----------------------------------+
+---------------------------------+
| make price mpg |
|---------------------------------|
28. | Linc. Versailles 13,466 14 |
29. | Merc. Bobcat 3,829 22 |
30. | Merc. Cougar 5,379 14 |
+---------------------------------+
+-----------------------------+
| make price mpg |
|-----------------------------|
31. | Merc. Marquis 6,165 15 |
32. | Merc. Monarch 4,516 18 |
33. | Merc. XR-7 6,303 14 |
+-----------------------------+
+------------------------------+
| make price mpg |
|------------------------------|
34. | Merc. Zephyr 3,291 20 |
35. | Olds 98 8,814 21 |
36. | Olds Cutl Supr 5,172 19 |
+------------------------------+
Note that it is not necessary the commands preserve, keep and restore to be within your loop as they are one-time operations and repeating them is just inefficient.
I am using EmoKit (https://github.com/openyou/emokit) to retrieve data. The sample data looks like as follows:
+========================================================+
| Sensor | Value | Quality | Quality L1 | Quality L2 |
+--------+----------+----------+------------+------------+
| F3 | -768 | 5672 | None | Excellent |
| FC5 | 603 | 7296 | None | Excellent |
| AF3 | 311 | 7696 | None | Excellent |
| F7 | -21 | 296 | Nothing | Nothing |
| T7 | 433 | 104 | Nothing | Nothing |
| P7 | 581 | 7592 | None | Excellent |
| O1 | 812 | 7760 | None | Excellent |
| O2 | 137 | 6032 | None | Excellent |
| P8 | 211 | 5912 | None | Excellent |
| T8 | -51 | 6624 | None | Excellent |
| F8 | 402 | 7768 | None | Excellent |
| AF4 | -52 | 7024 | None | Excellent |
| FC6 | 249 | 6064 | None | Excellent |
| F4 | 509 | 5352 | None | Excellent |
| X | -2 | N/A | N/A | N/A |
| Y | 0 | N/A | N/A | N/A |
| Z | ? | N/A | N/A | N/A |
| Batt | 82 | N/A | N/A | N/A |
+--------+----------+----------+------------+------------+
|Packets Received: 3101 | Packets Processed: 3100 |
| Sampling Rate: 129 | Crypto Rate: 129 |
+========================================================+
Are these values in micro-volts? If so, how can these be more than 200 microvolts? The EEG data is in the range of 0-200 microvolts. Or does this require some kind of processing? If so what?
As described in the frequently asked questions of emokit, :
What unit is the data I'm getting back in? How do I get volts out of it?
One least-significant-bit of the fourteen-bit value you get back is 0.51 microvolts. See the specification for more details.
Looking for the details in the specification (via archive.org), we find the following for the "Emotiv EPOC Neuroheadset":
Resolution | 14 bits 1 LSB = 0.51μV (16 bit ADC,
| 2 bits instrumental noise floor discarded)
Dynamic range (input referred) | 8400μV (pp)
As a validation we can check that for a 14 bits linear ADC, the 8400 microvolts (peak-to-peak) would be divided in steps of 8400 / 16384 or approximately 0.5127 microvolts.
For the Epoc+, the comparison chart indicates a 14-bit and a 16-bit version (with a +/- 4.17mV dynamic range or 8340 microvolts peak-to-peak). The 16-bit version would then have raw data steps of 8340 / 65536 or approximately 0.127 microvolts. If that is what you are using, then the largest value of 812 you listed would correspond to 812 * 0.127 = 103 microvolts.
I'm a bit stumped.
In my database, I have a relationship like this:
(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)
I want to perform a query where for a given user, I find the common artists between that user and every other user.
To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.
The first query I was trying was the following:
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
other:User
WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists
This was taking around 20 seconds to return. The profile for this query is as follows:
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns other.username, mutualArtists |
| Extract | 10 | 20 | | other.username |
| ColumnFilter(1) | 10 | 0 | | keep columns other, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
| EagerAggregation | 563 | 0 | | other |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, me, ar2, ar1, other | |
| Filter(1) | 1 | 3 | | ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
I was frustrated. It didn't seem like this should take 20 seconds.
I came back to the problem later on, and tried debugging it from the start.
I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.
I finally added things back, and ended up with the following query:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
u2:User
WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists
This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns u2.username, mutualArtists |
| Extract | 10 | 20 | | u2.username |
| ColumnFilter(1) | 10 | 0 | | keep columns u2, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
| EagerAggregation | 563 | 0 | | u2 |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, u2, u, ar2, ar1 | |
| Filter(1) | 1 | 3 | | ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
And, to prove that I ran them both in a row and got very different results:
neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
> u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username | mutualArtists |
+------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
> other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+--------------------------------+
10 rows
20418 ms
Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").
Why does that cause a 5X improvement??!?!
If anyone has any insight into this, I would be eternally grateful.
Thanks,
-Adam
EDIT 8.1.14
Based on #ulkas's suggestion, I tried simplifying the query.
The results were:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~4 seconds
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~20 seconds
So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.
Thanks,
-Adam
That sounds like you're seeing the effect of caching. Upon the first access the cache is not populated. Subsequent queries hitting the same graph will be much faster since the nodes/relationships are already available in the cache.
working with OPTIONAL MATCH following WHERE other:User has no sense, since the end node other (u2) must be match. try to perform the queries without optional match and where and without the last with, simply
START me=node(553314), other=node:userLocations("withinDistance[38.89037,-77.03196,80.467]")
MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, count(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
We've been trying to figure out how to reduce the boot-up memory footprint of our rails app by identifying memory-hungry gems and finding alternatives or solutions.
But there's one behavior on OS X that I find baffling.
With a brand new generated rails app (rails new memoryusage), with no Gemfile, no models, no data and no transactions, upon starting up rails c the memory OSX displays for the corresponding ruby process vill vary each time it is started up, from as low as 60MB to as high as 65MB, with no discernible pattern as to why the same app might be requiring less or more memory per execution.
I imagine this has to do in some way with how Ruby allocates memory, but I'm not completely clear on why its memory allocation would vary so wildly for the same code and no variable processing.
We have similarly unpredictable behavior when we try to calculate the memory consumed by the process after each gem in the Gemfile is required. We load up a vanilla rails process, then in rails c we run a script that parses the Gemfile and requires each Gem individually, logging the memory pre- and post-require, and what we notice is that not only does the memory footprint not have a consistent starting point, but also the incremental 'steps' in our memory consumption vary wildly.
We booted up our process three times, one after the other, and measured the start up memory and the incremental memory required by each gem. Not only did the startup book memory footprints bounce between 60MB and 92MB, but the points at which we saw memory jumps on loading each gem were inconsistent -- sometimes loading SASS would eat up an additional 5MB, sometimes it wouldn't, sometimes active_merchant would demand 10MB additional, others it wouldn't.
: BOOT UP #1 : BOOT UP #2 : BOOT UP #3
gem : increment | total : increment | total : increment | total
rails : 0.00 | 59.71 : 0.00 | 92.54 : 0.18 | 67.76
unicorn : 0.52 | 60.24 : 0.52 | 93.06 : 3.35 | 71.12
haml : 8.77 | 69.02 : 1.88 | 94.94 : 9.45 | 80.57
haml-rails : 0.00 | 69.02 : 0.00 | 94.94 : 0.00 | 80.57
sass : 4.36 | 73.38 : 6.95 | 101.89 : 0.99 | 81.55
mongoid : 0.00 | 73.38 : 0.00 | 101.89 : 0.00 | 81.55
compass : 11.56 | 84.93 : 3.23 | 105.12 : 8.41 | 89.96
compass-rails : 0.00 | 84.93 : 0.08 | 105.20 : 0.00 | 89.96
compass_twitter_bootstrap: 0.00 | 84.93 : 0.00 | 105.20 : 0.00 | 89.96
profanalyzer : 0.59 | 85.52 : 0.46 | 105.66 : 0.64 | 90.60
simple_form : 0.34 | 85.87 : 0.35 | 106.01 : 0.00 | 90.60
sorcery : 0.00 | 85.87 : 0.25 | 106.26 : 1.07 | 91.67
validates_timeliness: 1.47 | 87.34 : 1.82 | 108.07 : 1.62 | 93.29
mongoid_token : 0.00 | 87.34 : 0.00 | 108.07 : 0.00 | 93.29
nested_form : 0.00 | 87.34 : 0.00 | 108.07 : 0.01 | 93.30
nokogiri : 0.86 | 88.20 : 1.16 | 109.24 : 1.37 | 94.67
carmen : 0.00 | 88.20 : 0.07 | 109.30 : 0.00 | 94.67
carrierwave/mongoid : 2.78 | 90.98 : 0.38 | 109.69 : 0.13 | 94.80
yajl : 0.04 | 91.02 : 0.04 | 109.73 : 0.04 | 94.84
multi_json : 0.00 | 91.02 : 0.00 | 109.73 : 0.00 | 94.84
uuid : 0.00 | 91.03 : 0.00 | 109.73 : 0.41 | 95.25
tilt : 0.00 | 91.03 : 0.00 | 109.73 : 0.00 | 95.25
dynamic_form : 0.00 | 91.04 : 0.00 | 109.73 : 0.00 | 95.25
forem : 0.03 | 91.07 : 0.00 | 109.73 : 0.00 | 95.25
browser : 0.00 | 91.07 : 0.00 | 109.73 : 0.00 | 95.25
activemerchant : 2.17 | 93.24 : 1.18 | 110.92 : 10.58 | 105.83
kaminari : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
merit : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
memcachier : 0.00 | 93.24 : 0.00 | 110.92 : 0.00 | 105.83
dalli : 0.01 | 93.25 : 0.05 | 110.96 : 0.34 | 106.17
bitly : 2.47 | 95.72 : 9.43 | 120.40 : 1.53 | 107.70
em-synchrony : 1.00 | 96.72 : 0.18 | 120.57 : 0.55 | 108.24
em-http-request : 5.56 | 102.28 : 2.15 | 122.72 : 1.40 | 109.64
httparty : 0.00 | 102.28 : 0.00 | 122.72 : 0.00 | 109.64
rack-block : 0.00 | 102.28 : 0.00 | 122.72 : 0.00 | 109.64
resque/server : 1.21 | 103.49 : 1.73 | 124.45 : 1.68 | 111.32
resque_mailer : 0.00 | 103.49 : 0.00 | 124.45 : 0.00 | 111.32
rack-timeout : 0.00 | 103.49 : 0.00 | 124.45 : 0.00 | 111.32
chronic : 1.66 | 105.15 : 0.67 | 125.12 : 0.64 | 111.96
oink : 0.00 | 105.15 : 0.00 | 125.12 : 0.00 | 111.96
dotenv-rails : 0.00 | 105.15 : 0.00 | 125.12 : 0.00 | 111.96
jquery-rails : 0.00 | 105.15 : 0.03 | 125.15 : 0.00 | 111.96
jquery-ui-rails : 0.00 | 105.15 : 0.00 | 125.15 : 0.00 | 111.96
It's clear to me that there's something very basic that I'm missing and don't understand about how memory is allocated to Ruby processes, but I'm having a hard time figuring out why it could be this seemingly stochastic. Anyone have any thoughts?
I'm going to take a wild guess and say this is caused by address space layout randomization and by interactions with shared libraries whose footprints are influenced by running programs that are not in your test case.
OS X has received increasing support for ASLR starting with 10.5 and as of 10.8 even the kernel is randomly relocated.
In some cases, ASLR can cause a program segment to use extra pages depending on whether the offset causes a page boundary to be crossed. Since there are many segments and many libraries, this effect is difficult to predict.
I'm also wondering if (given the huge differences you are seeing) perhaps this is a reporting issue in OS X. I wonder if the overhead for shared objects is being charged unfairly depending on load order.
You can test this by consulting this Stack Overflow question: Disabling ASLR in Mac OS X Snow Leopard.