Periods from grouped data based on two columns - frames

I have a file named example.csv with that data:
day,number,price,pr
2010-01-01 00:01:00,1,0.4,2
2010-01-01 00:02:00,1,1.2,4
2010-01-01 00:03:00,1,2.5,6
2010-01-01 00:04:00,1,9.1,2
2010-01-01 00:05:00,2,3.4,7
2010-01-01 00:06:00,2,6.9,9
2010-01-01 00:07:00,2,8.9,2
2010-01-01 00:08:00,3,9.1,5
2010-01-01 00:09:00,3,4.2,9
2010-01-01 00:10:00,3,11.2,2
2010-01-01 00:11:00,4,53.12,4
2010-01-01 00:12:00,4,45.21,1
2010-01-01 00:12:00,4,1.1,5
2010-01-01 00:13:00,4,3.43,2
2010-01-01 00:14:00,4,21.42,4
Loading the data:
example = read.csv(file="path/example.csv", header=TRUE, sep=",")
Based on day
ddx <- xts(x = example[, c("number", "price", "pr" )], order.by = as.POSIXct(example[, "day"], tz = "GMT", format = "%Y-%m-%d %H:%M:%S"))
Applying this the output it gives is the columns day and price:
period.apply(ddx$number, endpoints(ddx, on = "minutes", k = 3), sum)

Your method of creating xts is rather convoluted. Try below.
txt <- 'day,number,price
2010-01-01 00:01:00,1,0.4
2010-01-01 00:02:00,1,1.2
2010-01-01 00:03:00,1,2.5
2010-01-01 00:04:00,2,9.1
2010-01-01 00:05:00,2,3.4
2010-01-01 00:06:00,2,6.9
2010-01-01 00:07:00,3,8.9
2010-01-01 00:08:00,3,9.1
2010-01-01 00:09:00,3,4.2
2010-01-01 00:10:00,4,11.2
2010-01-01 00:11:00,4,53.12
2010-01-01 00:12:00,4,45.21
2010-01-01 00:12:00,4,1.1
2010-01-01 00:13:00,4,3.43
2010-01-01 00:14:00,4,21.42'
DD <- read.csv(text = txt, stringsAsFactor = FALSE)
# DD is already a dataframe
DD
## day number price
## 1 2010-01-01 00:01:00 1 0.40
## 2 2010-01-01 00:02:00 1 1.20
## 3 2010-01-01 00:03:00 1 2.50
## 4 2010-01-01 00:04:00 2 9.10
## 5 2010-01-01 00:05:00 2 3.40
## 6 2010-01-01 00:06:00 2 6.90
## 7 2010-01-01 00:07:00 3 8.90
## 8 2010-01-01 00:08:00 3 9.10
## 9 2010-01-01 00:09:00 3 4.20
## 10 2010-01-01 00:10:00 4 11.20
## 11 2010-01-01 00:11:00 4 53.12
## 12 2010-01-01 00:12:00 4 45.21
## 13 2010-01-01 00:12:00 4 1.10
## 14 2010-01-01 00:13:00 4 3.43
## 15 2010-01-01 00:14:00 4 21.42
ddx <- xts(x = DD[, c("number", "price")], order.by = as.POSIXct(DD[, "day"], tz = "GMT", format = "%Y-%m-%d %H:%M:%S"))
ddx
## number price
## 2010-01-01 00:01:00 1 0.40
## 2010-01-01 00:02:00 1 1.20
## 2010-01-01 00:03:00 1 2.50
## 2010-01-01 00:04:00 2 9.10
## 2010-01-01 00:05:00 2 3.40
## 2010-01-01 00:06:00 2 6.90
## 2010-01-01 00:07:00 3 8.90
## 2010-01-01 00:08:00 3 9.10
## 2010-01-01 00:09:00 3 4.20
## 2010-01-01 00:10:00 4 11.20
## 2010-01-01 00:11:00 4 53.12
## 2010-01-01 00:12:00 4 45.21
## 2010-01-01 00:12:00 4 1.10
## 2010-01-01 00:13:00 4 3.43
## 2010-01-01 00:14:00 4 21.42
To use period.apply on number column, just specify ddx$number instead of ddx
period.apply(ddx$number, endpoints(ddx, on = "minutes", k = 3), sum)
## number
## 2010-01-01 00:02:00 2
## 2010-01-01 00:05:00 5
## 2010-01-01 00:08:00 8
## 2010-01-01 00:11:00 11
## 2010-01-01 00:14:00 16

Related

Set up expansion EEPROM i2c-2 BeagleBoneBlack Rev-C

The BeagleBoneBlack comes with an "internal" EEPROM connected to i2c-0 line. I can see that clearly when I do i2cdetect:
debian#beaglebone:~$ i2cdetect -y -r 0
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- UU -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- UU -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: UU -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: UU -- -- -- -- -- -- --
It is showing under address 0x50. When I try to do ahexdump I get the following values with no issue:
sudo hexdump -C /sys/class/i2c-dev/i2c-0/device/0-0050/eeprom | head -5
00000000 aa 55 33 ee 41 33 33 35 42 4e 4c 54 30 30 30 43 |.U3.A335BNLT000C|
00000010 31 38 33 37 42 42 42 47 30 36 32 32 ff ff ff ff |1837BBBG0622....|
00000020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00001000 aa 55 33 ee 41 33 33 35 42 4e 4c 54 30 30 30 43 |.U3.A335BNLT000C|
Now I want to add another EEPROM (with cape) on i2c-2 line which is supported according to BBB SRM section 8.2. It is the CAT24C256 as mentioned in the SRM. The allowable address range for the expansion cards is 0x54-0x57. When I do i2cdetect I can see the following:
debian#beaglebone:~$ i2cdetect -r -y 2
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- UU UU UU UU -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --
I can see the addresses 0x54-0x57 showing, but when I try hex dump I get an error:
hexdump: /sys/class/i2c-dev/i2c-2/device/2-0054/eeprom: Connection timed out
Questions:
Why are they showing as U's not actual address numbers? I know U stands for used resource?
Why am I failing to read from that EEPROM? I have tried all addreses from 0x54-0x57 with no luck. I can confirm that those addresses are showing in /sys/class/i2c-dev/i2c-2/device and the each dir has the following in it:
debian#beaglebone:~$ ls /sys/class/i2c-dev/i2c-2/device/2-0054/ -la
total 0
drwxr-xr-x 4 root root 0 Oct 26 19:46 .
drwxr-xr-x 8 root root 0 Oct 26 19:46 ..
drwxr-xr-x 3 root root 0 Oct 26 19:47 2-00540
lrwxrwxrwx 1 root root 0 Oct 26 19:47 driver -> ../../../../../../bus/i2c/drivers/at24
-rw------- 1 root root 32768 Oct 26 19:47 eeprom
-r--r--r-- 1 root root 4096 Oct 26 19:47 modalias
-r--r--r-- 1 root root 4096 Oct 26 19:47 name
lrwxrwxrwx 1 root root 0 Oct 26 19:47 of_node -> ../../../../../../firmware/devicetree/base/ocp/i2c#4819c000/cape_eeprom0#54
drwxr-xr-x 2 root root 0 Oct 26 19:47 power
lrwxrwxrwx 1 root root 0 Oct 26 19:47 subsystem -> ../../../../../../bus/i2c
-rw-r--r-- 1 root root 4096 Oct 26 19:47 uevent
I can see the addresses mapping into the kernel but when I try to hexdump eeprom it doesn't work at all. I though this was supposed to be setup by kernel since it is mentioned in BeagleBone SRM. Am I going to need an overlay to add to uboot for this?
All I'm trying to do is read from the EEPROM like I did with the "internal" one to confirm it is working. What am I doing wrong?
The issue was that the cape manager was "hogging" those addresses on i23c-2. We will need to disable the cape manager in order to free those addresses. After doing that it shows 0x57 under i2c-2 so it should work afterwards.
Please checkout the following link as to how to disable Cape Manager on BeagleBone:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/beagleboard/NG8cDWuv2Y0/69vk5F5ZAgAJ)
Make sure to edit the am335x-boneblack-uboot.dts file remove the include on line 11 and replace with the following:
#include "am335x-bone-common-no-capemgr.dtsi"
Note this will disable your i2c-2 line by default so either enable it via overlays or edit the am335x-bone-common-no-capemgr.dtsi & add after &i2c0 (around line 245):
&i2c2 {
pinctrl-names = "default";
pinctrl-0 = <&i2c2_pins>;
status = "okay";
clock-frequency = <100000>;
};

scikit-learn GridSearchCV does not work properly with random forest

I have a grid search implementation for random forest models.
train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=.10, random_state=0)
# A bit performance gains can be obtained from standarization
train_X, test_X = standarize(train_X, test_X)
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0]
}]
scores = ['neg_mean_squared_error', 'neg_mean_absolute_error']
for n_fold in [5]:
for score in scores:
print("# Tuning hyper-parameters for %s with %d-fold" % (score, n_fold))
start_time = time.time()
print()
# TODO: RandomForestRegressor
clf = GridSearchCV(RandomForestRegressor(verbose=2), tuned_parameters, cv=n_fold,
scoring=score, verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
... Rest omitted
Before I use it for this grid search, I have used the exact same dataset for many other tasks, so there should not be any problem with the data. In addition, for the test purpose, I first use LinearRegression to see if the entire pipeline goes smoothly, it works. Then I switch to RandomForestRegressor and set a very small number of estimators to test it again. A very strange thing happen them, I'll attach the verbose information. There is a very significant decrease in performance and I don't know what happened. There is no reason to spend 30 minute+ for running one small grid search.
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 4 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.5s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.6s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
The above log is printed in a few second, then things seem to be stucked start here...
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.4min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.8min remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
It cost more than 20 minutes for these lines.
BTW, for each GridSearchCV run, linear regression cost less than 1 sec.
Do you have any idea why the performance decrease that much?
Any suggestion and comment are appreciated. Thank you.
Try setting max_depth for the RandomForestRegressor. This should reduce fitting time. By default max_depth=None.
For example:
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0],
'max_depth': [4],
}]
Edit: Also, by default RandomForestRegressor has n_jobs=1. It will build one tree at a time with this setting. Try setting n_jobs=-1.
In addition, instead of looping over the scoring parameters to GridSearchCV, you can specify multiple metrics. When doing so, you must also specify the metric you want to GridSearchCV to select on as the value of refit. Then, you can access all scores in the cv_results_ dictionary after the fit.
clf = GridSearchCV(RandomForestRegressor(verbose=2),tuned_parameters,
cv=n_fold, scoring=scores, refit='neg_mean_squared_error',
verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
results = clf.cv_results_
print(np.mean(results['mean_test_neg_mean_squared_error']))
print(np.mean(results['mean_test_neg_mean_absolute_error']))
http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py

Neo4j Performance Challenge - How to Improve?

I've been wrangling with Neo4J for the last few weeks, trying to resolve some extremely challenging performance problems. At this point, I need some additional help because I can't determine how to move forward.
I have a graph with a total of approx 12.5 Million nodes and 64 Million relationships. The purpose of the graph is going to be analyzing suspicious financial behavior, so it is customers, accounts, transactions, etc.
Here is an example of the performance challenge:
This query for total nodes takes 96,064ms to complete, which is extremely long.
neo4j-sh (?)$ MATCH (n) RETURN count(n);
+----------+
| count(n) |
+----------+
| 12519940 |
+----------+
1 row
96064 ms
The query for total relationships takes 919,449ms to complete, which seems silly.
neo4j-sh (?)$ MATCH ()-[r]-() return count(r);
+----------+
| count(r) |
+----------+
| 64062508 |
+----------+
1 row
919449 ms
I have 6.6M Transaction Nodes. When I attempt to search for transactions that have an amount above $8,000, the query takes 653,637ms also way too long.
neo4j-sh (?)$ MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);
+----------+
| count(t) |
+----------+
| 10696 |
+----------+
1 row
653637 ms
Relevant Schema
ON :Transaction(baseamount) ONLINE
ON :Transaction(type) ONLINE
ON :Transaction(amount) ONLINE
ON :Transaction(currency) ONLINE
ON :Transaction(basecurrency) ONLINE
ON :Transaction(transactionid) ONLINE (for uniqueness constraint)
Profile of Query:
neo4j-sh (?)$ PROFILE MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);
+----------+
| count(t) |
+----------+
| 10696 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+NodeByLabel
+------------------+---------+----------+-------------+------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+---------+----------+-------------+------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(t) |
| EagerAggregation | 1 | 0 | | |
| Filter | 10696 | 13216382 | | Property(t,amount(62)) > { AUTODOUBLE0} |
| NodeByLabel | 6608191 | 6608192 | t, t | :Transaction |
+------------------+---------+----------+-------------+------------------------------------------+
I am running these in the neo4j shell.
The performance challenges here are starting to create substantial doubt about whether I can even use Neo4J, and seem opposite of the potential the platform offers.
I am fully admit that I may have misconfigured something (I'm relatively new to Neo4J), so guidance on what to fix or what to look at is much appreciated.
Here are details of my setup:
System: Linux, Ubuntu, 16GB RAM, 3.5 i5 Proc, 256GB SSD HD
CPU
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4690K CPU # 3.50GHz
stepping : 3
microcode : 0x12
cpu MHz : 4230.625
cache size : 6144 KB
Memory
$ cat /proc/meminfo
MemTotal: 16115020 kB
MemFree: 224856 kB
MemAvailable: 8807160 kB
Buffers: 124356 kB
Cached: 8429964 kB
SwapCached: 8388 kB
Disk
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data1--vg-root 219G 32G 177G 16% /
Neo4J.properties
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.relationshipgroupstore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=50M
neostore.propertystore.db.index.keys.mapped_memory=200M
relationship_auto_indexing=true
Neo4J-Wrapper.properties
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
#********************************************************************
# JVM Parameters
#********************************************************************
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
# Uncomment the following lines to enable garbage collection logging
wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
wrapper.java.additional=-XX:+PrintGCDetails
wrapper.java.additional=-XX:+PrintGCDateStamps
wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
wrapper.java.additional=-XX:+PrintPromotionFailure
wrapper.java.additional=-XX:+PrintTenuringDistribution
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=6144
Other:
Changed the open file settings for Linux to 40k
I am not running anything else on this machine, no X Windows, no other DB server. Here is a snippet of top while running a query:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15785 neo4j 20 0 12.192g 8.964g 2.475g S 100.2 58.3 227:50.98 java
1 root 20 0 33464 2132 1140 S 0.0 0.0 0:02.36 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
The total file size in the graph.db directory is:
data/graph.db$ du --max-depth=1 -h
1.9G ./schema
36K ./index
26G .
Data loading was extremely hit or miss. Some merges would take less than 60 seconds (Even for ~200 to 300K inserts), while some merges would last for over 3 hours (11,898,514ms for a CSV file with 189,999 rows merging on one date)
I get constant GC thread blocking:
2015-03-27 14:56:26.347+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 15422ms.
2015-03-27 14:56:39.011+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12363ms.
2015-03-27 14:56:57.533+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 13969ms.
2015-03-27 14:57:17.345+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 14657ms.
2015-03-27 14:57:29.955+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12309ms.
2015-03-27 14:58:14.311+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 1928ms.
Please let me know if I should add anything else that would be salient to the discussion
Update 1
Thank you very much for your help, I just moved so I was delayed in responding.
Size of Neostore Files:
/data/graph.db$ ls -lah neostore.*
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.id
-rw-rw-r-- 1 neo4j neo4j 110 Apr 2 13:03 neostore.labeltokenstore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.labeltokenstore.db.id
-rw-rw-r-- 1 neo4j neo4j 874 Apr 2 13:03 neostore.labeltokenstore.db.names
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.labeltokenstore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 200M Apr 2 13:03 neostore.nodestore.db
-rw-rw-r-- 1 neo4j neo4j 41 Apr 2 13:03 neostore.nodestore.db.id
-rw-rw-r-- 1 neo4j neo4j 68 Apr 2 13:03 neostore.nodestore.db.labels
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.nodestore.db.labels.id
-rw-rw-r-- 1 neo4j neo4j 2.8G Apr 2 13:03 neostore.propertystore.db
-rw-rw-r-- 1 neo4j neo4j 128 Apr 2 13:03 neostore.propertystore.db.arrays
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.arrays.id
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.id
-rw-rw-r-- 1 neo4j neo4j 720 Apr 2 13:03 neostore.propertystore.db.index
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.index.id
-rw-rw-r-- 1 neo4j neo4j 3.1K Apr 2 13:03 neostore.propertystore.db.index.keys
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.index.keys.id
-rw-rw-r-- 1 neo4j neo4j 1.7K Apr 2 13:03 neostore.propertystore.db.strings
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.strings.id
-rw-rw-r-- 1 neo4j neo4j 47M Apr 2 13:03 neostore.relationshipgroupstore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshipgroupstore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.1G Apr 2 13:03 neostore.relationshipstore.db
-rw-rw-r-- 1 neo4j neo4j 1.6M Apr 2 13:03 neostore.relationshipstore.db.id
-rw-rw-r-- 1 neo4j neo4j 165 Apr 2 13:03 neostore.relationshiptypestore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshiptypestore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.3K Apr 2 13:03 neostore.relationshiptypestore.db.names
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshiptypestore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 3.5K Apr 2 13:03 neostore.schemastore.db
-rw-rw-r-- 1 neo4j neo4j 25 Apr 2 13:03 neostore.schemastore.db.id
I read that mapped memory settings are replaced by another cache, and I have commented out those settings.
Java Profiler
JvmTop 0.8.0 alpha - 16:12:59, amd64, 4 cpus, Linux 3.16.0-33, load avg 0.30
http://code.google.com/p/jvmtop
Profiling PID 4260: org.neo4j.server.Bootstrapper
68.67% ( 14.01s) org.neo4j.kernel.impl.nioneo.store.StoreFileChannel.read()
18.73% ( 3.82s) org.neo4j.kernel.impl.nioneo.store.StoreFailureException.<init>()
2.86% ( 0.58s) org.neo4j.kernel.impl.cache.ReferenceCache.put()
1.11% ( 0.23s) org.neo4j.helpers.Counter.inc()
0.87% ( 0.18s) org.neo4j.kernel.impl.cache.ReferenceCache.get()
0.65% ( 0.13s) org.neo4j.cypher.internal.compiler.v2_1.parser.Literals$class.PropertyKeyName()
0.63% ( 0.13s) org.parboiled.scala.package$.getCurrentRuleMethod()
0.62% ( 0.13s) scala.collection.mutable.OpenHashMap.<init>()
0.62% ( 0.13s) scala.collection.mutable.AbstractSeq.<init>()
0.62% ( 0.13s) org.neo4j.kernel.impl.cache.AutoLoadingCache.get()
0.61% ( 0.13s) scala.collection.TraversableLike$$anonfun$map$1.apply()
0.61% ( 0.12s) org.neo4j.kernel.impl.transaction.TxManager.assertTmOk()
0.61% ( 0.12s) org.neo4j.cypher.internal.compiler.v2_1.commands.EntityProducerFactory.<init>()
0.61% ( 0.12s) scala.collection.AbstractTraversable.<init>()
0.61% ( 0.12s) scala.collection.immutable.List.toStream()
0.60% ( 0.12s) org.neo4j.kernel.impl.nioneo.store.NodeStore.getRecord()
0.57% ( 0.12s) org.neo4j.kernel.impl.transaction.TxManager.getTransaction()
0.37% ( 0.08s) org.parboiled.scala.Parser$class.rule()
0.06% ( 0.01s) scala.util.DynamicVariable.value()
Unfortunately the schema indexes (aka those created using CREATE INDEX ON :Label(property)) do not yet support larger than/smaller than conditions. Therefore Neo4j falls back to scan all nodes with the given label and filter on their properties. This is of course expensive.
I do see two different approaches to tackle this:
1) If your condition does always have a pre-defined maximum granularity e.g. 10s of USDs, you can build up an "amount-tree" similar to a time-tree (see http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html).
2) if you don't know the granularity upfront the other option is to setup a manual or auto index for the amount property, see http://neo4j.com/docs/stable/indexing.html. The most easy thing is probably using auto index. In neo4j.properties set the following options:
node_auto_indexing=true
node_keys_indexable=amount
Note that this will not automatically add all existing transaction into that index, it just puts those in the index that have been written to since auto indexing is enabled.
You can do a explicit range query on the auto index using
MATCH t=node:node_auto_index("amount:[6000 TO 999999999]")
RETURN count(t)

Examples of Deflate Compression

I am interested in learning about the deflate compression algorithm, particularly how is it represented in a data-stream, and feel that I would greatly benefit from some extra examples (eg. the compression of a short string of text, or the decompression of a compressed chunk).
I am continuing to study some resources I have found: ref1, ref2, ref3 but these do not have many examples of how the actual compression looks as a data-stream.
If I could get a few examples of how some strings would look before and after being compressed, and an explanation of the relationship between them that would be fantastic.
Also if there are other resources that I could be looking at please add those.
You can compress example data with gzip or zlib and use infgen to disassemble and examine the resulting compressed data. infgen also has an option to see the detail in the dynamic headers.
+1 for infgen, but here's a slightly more detailed answer.
You can take a look at the before- and after- using gzip and any hex editor. For example, xxd is included on most linux distros. I'd included both raw hex output (not that interesting without understanding) and infgen's output.
hello hello hello hello (triggers static huffman coding, like most short strings).
~ $ echo -n "hello hello hello hello" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 cb48 cdc9 c957 ...........H...W
00000010: c840 2701 e351 3d8d 1700 0000 .#'..Q=.....
~ $ echo -n "hello hello hello hello" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
fixed
literal 'hello h
match 16 6
end
!
crc
length
\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1 (triggers uncompressed mode)
~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 010f 00f0 ffff ................
00000010: fefd fcfb faf9 f8f7 f6f5 f4f3 f2f1 c6d3 ................
00000020: 157e 0f00 0000 .~....
~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
stored
data 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241
end
!
crc
length
abaabbbabaababbaababaaaabaaabbbbbaa (triggers dynamic huffman coding)
~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 1dc6 4901 0000 ............I...
00000010: 1040 c0ac a37f 883d 3c20 2a97 9d37 5e1d .#.....=< *..7^.
00000020: 0c6e 2934 9423 0000 00 .n)4.#...
~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | ./infgen/a.out -i -d
! infgen 2.4 output
!
gzip
!
last
dynamic
count 260 7 18
code 1 4
code 2 1
code 4 4
code 16 4
code 17 4
code 18 2
zeros 97
lens 1 2
zeros 138
zeros 19
lens 4
repeat 3
lens 2
zeros 3
lens 2 2 2
! litlen 97 1
! litlen 98 2
! litlen 256 4
! litlen 257 4
! litlen 258 4
! litlen 259 4
! dist 0 2
! dist 4 2
! dist 5 2
! dist 6 2
literal 'abaabbba
match 4 7
match 3 9
match 5 6
literal 'aaa
match 5 5
literal 'b
match 4 1
literal 'aa
end
!
crc
length
I found infgen was still not enough detail to fully understand the format. I look through decompressing all three examples here bit-by-bit, by hand, in detail on my blog
For concepts, in addition to RFC 1951 (DEFLATE) which is pretty good, I would recommend Feldspar's conceptual overview of Huffman codes and LZ77 in DEFLATE

Thinking Sphinx not working in test mode

I'm trying to get Thinking Sphinx to work in test mode in Rails. Basically this:
ThinkingSphinx::Test.init
ThinkingSphinx::Test.start
freezes and never comes back.
My test and devel configuration is the same for test and devel:
dry_setting: &dry_setting
adapter: mysql
host: localhost
encoding: utf8
username: rails
password: blahblah
development:
<<: *dry_setting
database: proj_devel
socket: /tmp/mysql.sock # sphinx requires it
test:
<<: *dry_setting
database: proj_test
socket: /tmp/mysql.sock # sphinx requires it
and sphinx.yml
development:
enable_star: 1
min_infix_len: 2
bin_path: /opt/local/bin
test:
enable_star: 1
min_infix_len: 2
bin_path: /opt/local/bin
production:
enable_star: 1
min_infix_len: 2
The generated config files, config/development.sphinx.conf and config/test.sphinx.conf only differ in database names, directories and similar things; nothing functional.
Generating the index for devel goes without an issue
$ rake ts:in
(in /Users/pupeno/proj)
default config
Generating Configuration to /Users/pupeno/proj/config/development.sphinx.conf
Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
using config file '/Users/pupeno/proj/config/development.sphinx.conf'...
indexing index 'user_core'...
collected 7 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 99.8% done
total 7 docs, 422 bytes
total 0.098 sec, 4320.80 bytes/sec, 71.67 docs/sec
indexing index 'user_delta'...
collected 0 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, nan% done
total 0 docs, 0 bytes
total 0.010 sec, 0.00 bytes/sec, 0.00 docs/sec
distributed index 'user' can not be directly indexed; skipping.
but when I try to do it for test it freezes:
$ RAILS_ENV=test rake ts:in
(in /Users/pupeno/proj)
DEPRECATION WARNING: require "activeresource" is deprecated and will be removed in Rails 3. Use require "active_resource" instead.. (called from /Users/pupeno/.rvm/gems/ruby-1.8.7-p249/gems/activeresource-2.3.5/lib/activeresource.rb:2)
default config
Generating Configuration to /Users/pupeno/proj/config/test.sphinx.conf
Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
using config file '/Users/pupeno/proj/config/test.sphinx.conf'...
indexing index 'user_core'...
It's been there for more than 10 minutes, the user table has 4 records.
The database directory look quite diferently, but I don't know what to make of it:
$ ls -l db/sphinx/development/
total 96
-rw-r--r-- 1 pupeno staff 196 Mar 11 18:10 user_core.spa
-rw-r--r-- 1 pupeno staff 4982 Mar 11 18:10 user_core.spd
-rw-r--r-- 1 pupeno staff 417 Mar 11 18:10 user_core.sph
-rw-r--r-- 1 pupeno staff 3067 Mar 11 18:10 user_core.spi
-rw-r--r-- 1 pupeno staff 84 Mar 11 18:10 user_core.spm
-rw-r--r-- 1 pupeno staff 6832 Mar 11 18:10 user_core.spp
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:10 user_delta.spa
-rw-r--r-- 1 pupeno staff 1 Mar 11 18:10 user_delta.spd
-rw-r--r-- 1 pupeno staff 417 Mar 11 18:10 user_delta.sph
-rw-r--r-- 1 pupeno staff 1 Mar 11 18:10 user_delta.spi
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:10 user_delta.spm
-rw-r--r-- 1 pupeno staff 1 Mar 11 18:10 user_delta.spp
$ ls -l db/sphinx/test/
total 0
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:11 user_core.spl
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:11 user_core.tmp0
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:11 user_core.tmp1
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:11 user_core.tmp2
-rw-r--r-- 1 pupeno staff 0 Mar 11 18:11 user_core.tmp7
Nothing gets added to a log when this happens. Any ideas where to go from here?
I can run the command line manually:
/opt/local/bin/indexer --config config/test.sphinx.conf --all
which generates the output as the rake ts:in, so no help there.
The problem was the random ids generated by fixtures. The solution is described on http://freelancing-god.github.com/ts/en/common_issues.html#slow_indexing
Slow Indexing
If Sphinx is taking a
while to process all your records,
there are a few common reasons for
this happening. Firstly, make sure you
have database indexes on any foreign
key columns and any columns you filter
or sort by.
Secondly – are you using fixtures?
Rails’ fixtures have randomly
generated IDs, which are usually
extremely large integers, and Sphinx
isn’t set up to process disparate IDs
efficiently by default. To get around
this, you’ll need to set
sql_range_step in your
config/sphinx.yml file for the
appropriate environments:
development:
sql_range_step: 10000000
I added it to both, development and test environments.

Resources