when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases? - random-forest

According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1.
However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)?
If I run:
library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")
I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?
for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?
> getTree(rf,134)
left daughter right daughter split var split point status prediction
1 2 3 3 2.50 1 0
2 0 0 0 0.00 -1 1
3 4 5 4 1.75 1 0
4 6 7 3 4.95 1 0
5 8 9 3 4.85 1 0
6 10 11 4 1.60 1 0
7 12 13 1 6.50 1 0
8 14 15 1 5.95 1 0
9 0 0 0 0.00 -1 3
10 0 0 0 0.00 -1 2
11 0 0 0 0.00 -1 3
12 0 0 0 0.00 -1 3
13 0 0 0 0.00 -1 2
14 0 0 0 0.00 -1 2
15 0 0 0 0.00 -1 3
I would be greatful if someone can explain

"Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.
In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.

Related

Where are class names stored in a machine learning dataset in Python?

I'm learning machine learning using the iris dataset on Python 3.6 with sklearn, and I don't understand where the class names that are being retrieved are stored. In Iris, there are 3 classes, and each class contains 50 observations. You can use several commands to print the classes, and their associated numerical values:
print(iris.target)
print(iris.target_names)
This will result in the output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']
So as can be seen, the classes are Setosa, Versicolor, and Virginica. What I don't understand is where these class names are being stored, or how they're called upon within the model. If you use the shape command on the data, or target, the result is (150,4) and (150,) meaning there is 150 observations and 4 rows in the data, and 150 rows in the target. I am just not able to bridge the gap with my mind as to where this is coming from, however.
What I don't understand is where the class names are supposed to be stored. If I made a brand new dataset for pokemon types and had ice, fire, water, flying, where could I store these types? Would they be required to be numerical as well, like iris, with 0,1,2,3?
Sklearn uses a custom type of object to store its datasets, exactly so that they can store metadata along with the raw data.
If you load the iris dataset
In [2]: from sklearn import datasets
In [3]: iris = datasets.load_iris()
You can inspect the type of object with type:
In [4]: type(iris)
Out[4]: sklearn.utils.Bunch
You can look at the attributes inside the object with dir:
In [5]: dir(iris)
Out[5]: ['DESCR', 'data', 'feature_names', 'target', 'target_names']
And then use . notation to take a look at the attributes themselves:
In [6]: type(iris.data)
Out[6]: numpy.ndarray
In [7]: type(iris.target)
Out[7]: numpy.ndarray
In [8]: type(iris.feature_names)
Out[8]: list
If you want to mimic this for your own datasets, you will have to define your own custom object type to mimic this structure. That would involve defining your own class.

ERROR while implementing Cox PH model for recurrent event survival analysis using counting process

I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

Automatically learning clusters

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?
1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.
2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.
Apologies if this is trivial. The table is listed below. thanks!
Bin sugar
1 1
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 1
3 1
4 1
4 1
4 1
5 1
5 0
5 1
6 0
6 0
6 0
7 0
7 1
7 1
8 1
8 0
8 1
9 1
9 0
9 0
9 0
10 0
10 0
10 0
11 1
11 1
11 1
12 0
12 0
12 0
12 0
13 0
13 0
13 1
13 0
13 0
14 0
14 0
14 0
14 0
15 1
15 0
15 0
16 1
16 1
17 1
17 1
18 0
18 1
18 1
17 1
19 1
20 1
20 0
20 0
20 1
21 0
21 0
21 1
21 0
22 1
22 0
22 1
22 1
23 1
23 1
24 1
24 0
25 0
25 1
25 0
26 1
26 1
27 1
27 1
Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering.
Say event L is "a fly likes sugar", event B is "a fly is in bin B".
So what you have is:
number of flies = 84
size of each bins = (eg size of bin 1: 4)
probability that a fly likes sugar:
P(L) = flies that like sugar / total number of flies = 43/84
probability that a fly doesn't like sugar:
P(notL) = 1 - P(L) = 41/84
probability that a fly is in a given bin:
P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)
probability that a fly isn't in a given bin:
P(notB) = 1 - P(B) = 80/84 (for bin 1)
probability that a fly likes sugar, knowing that's in bin B:
P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)
probability that a fly likes sugar, knowing that it's not in bin B:
P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80
You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:
P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))
If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.
Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.
You can refer here to get more accurate reasoning and results.
As for problem 2)... I have to think about it a bit more.

Which of the IDS 11.70 onconfig parameters can be changed to maximize performance for a DSS app?

Informix 11.70.TC5DE,
Windows Vista with Dual Core Processor, 8GB RAM, 1TB HDD:
During the installation of this server, I specified it was going to be used for a data warehousing application. These are the onconfig parameters the install script generated.
Can any of these parameters be changed to maximize the performance of the server?
#(onconfig.ol_informix1170) - for data warehousing app.
ROOTNAME rootdbs
ROOTPATH C:\PROGRA~1\IBM\Informix\11.70\OL_INF~2\dbspaces\rootdbs.000
ROOTOFFSET 0
ROOTSIZE 312992
MIRROR 0
MIRRORPATH
MIRROROFFSET 0
PHYSFILE 49152
PLOG_OVERFLOW_PATH
PHYSBUFF 512
LOGFILES 6
LOGSIZE 10000
DYNAMIC_LOGS 2
LOGBUFF 256
LTXHWM 70
LTXEHWM 80
MSGPATH C:\PROGRA~1\IBM\Informix\11.70\ol_informix1170_1.log
CONSOLE C:\PROGRA~1\IBM\Informix\11.70\ol_informix1170_1.con
TBLTBLFIRST 0
TBLTBLNEXT 0
TBLSPACE_STATS 1
DBSPACETEMP tempdbs
SBSPACETEMP
SBSPACENAME sbspace
SYSSBSPACENAME
ONDBSPACEDOWN 2
SERVERNUM 6
DBSERVERNAME ol_informix1170_1
DBSERVERALIASES dr_informix1170_1
NETTYPE olsoctcp,1,150,NET
LISTEN_TIMEOUT 60
MAX_INCOMPLETE_CONNECTIONS 1024
FASTPOLL 1
NS_CACHE host=900,service=900,user=900,group=900
MULTIPROCESSOR 0
VPCLASS cpu,num=1,noage
VP_MEMORY_CACHE_KB 0
SINGLE_CPU_VP 1
#VPCLASS aio,num=1
CLEANERS 2
AUTO_AIOVPS 1
DIRECT_IO 0
LOCKS 2000
DEF_TABLE_LOCKMODE page
RESIDENT 0
SHMBASE 0xc000000L
SHMVIRTSIZE 209920
SHMADD 6560
EXTSHMADD 8192
SHMTOTAL 0
SHMVIRT_ALLOCSEG 0,3
#SHMNOACCESS 0x70000000-0x7FFFFFFF
CKPTINTVL 300
AUTO_CKPTS 1
RTO_SERVER_RESTART 60
BLOCKTIMEOUT 3600
CONVERSION_GUARD 2
RESTORE_POINT_DIR $INFORMIXDIR\tmp
TXTIMEOUT 300
DEADLOCK_TIMEOUT 60
HETERO_COMMIT 0
TAPEDEV \\.\TAPE0
TAPEBLK 16
TAPESIZE 0
LTAPEDEV
LTAPEBLK 16
LTAPESIZE 0
BAR_ACT_LOG $INFORMIXDIR\tmp\bar_act.log
BAR_DEBUG_LOG $INFORMIXDIR\tmp\bar_dbug.log
BAR_DEBUG 0
BAR_MAX_BACKUP 0
BAR_RETRY 1
BAR_NB_XPORT_COUNT 20
BAR_XFER_BUF_SIZE 15
RESTARTABLE_RESTORE ON
BAR_PROGRESS_FREQ 0
BAR_BSALIB_PATH
BACKUP_FILTER
RESTORE_FILTER
BAR_PERFORMANCE 0
BAR_CKPTSEC_TIMEOUT 15
ISM_DATA_POOL ISMData
ISM_LOG_POOL ISMLogs
DD_HASHSIZE 31
DD_HASHMAX 10
DS_HASHSIZE 31
DS_POOLSIZE 127
PC_HASHSIZE 31
PC_POOLSIZE 127
PRELOAD_DLL_FILE
STMT_CACHE 0
STMT_CACHE_HITS 0
STMT_CACHE_SIZE 512
STMT_CACHE_NOLIMIT 0
STMT_CACHE_NUMPOOL 1
USEOSTIME 0
STACKSIZE 64
ALLOW_NEWLINE 0
USELASTCOMMITTED NONE
FILLFACTOR 90
MAX_FILL_DATA_PAGES 0
BTSCANNER num=1,threshold=5000,rangesize=-1,alice=6,compression=default
ONLIDX_MAXMEM 188928
MAX_PDQPRIORITY 100
DS_MAX_QUERIES 1
DS_TOTAL_MEMORY 188928
DS_MAX_SCANS 1
DS_NONPDQ_QUERY_MEM 188928
DATASKIP
OPTCOMPIND 2
DIRECTIVES 1
EXT_DIRECTIVES 0
OPT_GOAL -1
IFX_FOLDVIEW 0
AUTO_REPREPARE 1
USTLOW_SAMPLE 0
RA_PAGES 64
RA_THRESHOLD 16
BATCHEDREAD_TABLE 1
BATCHEDREAD_INDEX 1
BATCHEDREAD_KEYONLY 0
EXPLAIN_STAT 1
#SQLTRACE level=low,ntraces=1000,size=2,mode=global
#DBCREATE_PERMISSION informix
#DB_LIBRARY_PATH
IFX_EXTEND_ROLE 1
SECURITY_LOCALCONNECTION
UNSECURE_ONSTAT
ADMIN_USER_MODE_WITH_DBSA
ADMIN_MODE_USERS
PLCY_POOLSIZE 127
PLCY_HASHSIZE 31
USRC_POOLSIZE 127
USRC_HASHSIZE 31
STAGEBLOB
OPCACHEMAX 0
SQL_LOGICAL_CHAR OFF
SEQ_CACHE_SIZE 10
ENCRYPT_HDR
ENCRYPT_SMX
ENCRYPT_CDR 0
ENCRYPT_CIPHERS
ENCRYPT_MAC
ENCRYPT_MACFILE
ENCRYPT_SWITCH
CDR_EVALTHREADS 1,2
CDR_DSLOCKWAIT 5
CDR_QUEUEMEM 4096
CDR_NIFCOMPRESS 0
CDR_SERIAL 0
CDR_DBSPACE
CDR_QHDR_DBSPACE
CDR_QDATA_SBSPACE
CDR_SUPPRESS_ATSRISWARN
CDR_DELAY_PURGE_DTC 0
CDR_LOG_LAG_ACTION ddrblock
CDR_LOG_STAGING_MAXSIZE 0
CDR_MAX_DYNAMIC_LOGS 0
DRAUTO 0
DRINTERVAL 30
DRTIMEOUT 30
HA_ALIAS
DRLOSTFOUND $INFORMIXDIR\etc\dr.lostfound
DRIDXAUTO 0
LOG_INDEX_BUILDS
SDS_ENABLE
SDS_TIMEOUT 20
SDS_TEMPDBS
SDS_PAGING
SDS_LOGCHECK 0
UPDATABLE_SECONDARY 0
FAILOVER_CALLBACK
FAILOVER_TX_TIMEOUT 0
TEMPTAB_NOLOG 0
DELAY_APPLY 0
STOP_APPLY 0
LOG_STAGING_DIR
RSS_FLOW_CONTROL 0
ENABLE_SNAPSHOT_COPY 0
SMX_COMPRESS 0
ON_RECVRY_THREADS 2
OFF_RECVRY_THREADS 5
DUMPDIR $INFORMIXDIR\tmp
DUMPSHMEM 1
DUMPGCORE 0
DUMPCORE 0
DUMPCNT 1
ALARMPROGRAM $INFORMIXDIR\etc\alarmprogram.bat
ALRM_ALL_EVENTS 0
#SYSALARMPROGRAM $INFORMIXDIR\etc\evidence.bat
STORAGE_FULL_ALARM 600,3
RAS_PLOG_SPEED 10982
RAS_LLOG_SPEED 0
EILSEQ_COMPAT_MODE 0
QSTATS 0
WSTATS 0
#VPCLASS MQ,noyield
MQSERVER
MQCHLLIB
MQCHLTAB
#VPCLASS jvp,num=1
#JVPJAVAHOME $INFORMIXDIR\extend\krakatoa\jre
#JVPHOME $INFORMIXDIR\extend\krakatoa
JVPPROPFILE $INFORMIXDIR\extend\krakatoa\.jvpprops
JVPLOGFILE $INFORMIXDIR\jvp.log
#JDKVERSION 1.5
#JVPJAVALIB \bin
#JVPJAVAVM jvm
#JVPARGS -verbose:jni
#JVPCLASSPATH $INFORMIXDIR\extend\krakatoa\krakatoa_g.jar;$INFORMIXDIR\extend\krakatoa\jdbc_g.jar
JVPARGS -Dcom.ibm.tools.attach.enable=no
JVPCLASSPATH $INFORMIXDIR\extend\krakatoa\krakatoa.jar;$INFORMIXDIR\extend\krakatoa\jdbc.jar
BUFFERPOOL default,buffers=10000,lrus=8,lru_min_dirty=50.00,lru_max_dirty=60.50
BUFFERPOOL size=4K,buffers=13108,lrus=16,lru_min_dirty=70.00,lru_max_dirty=80.00
AUTO_LRU_TUNING 1
USERMAPPING OFF
SP_AUTOEXPAND 1
SP_THRESHOLD 0
SP_WAITTIME 30
DEFAULTESCCHAR \
LOW_MEMORY_RESERVE 0
LOW_MEMORY_MGR 0
REMOTE_SERVER_CFG
REMOTE_USERS_CFG
S6_USE_REMOTE_SERVER_CFG 0
GSKIT_VERSION
NETTYPE drsoctcp,1,150,NET
If it is a multiprocessor machine, definitely consider turning on MULTIPROCESSOR by setting it to a non-zero value.
The ONCONFIG parameters of greatest interest to you for DSS are those related to Parallel Data Query, or PDQ. The block that commences with MAX_PDQPRIORITY. It is worth perusing the fine manual on these specifically, because the inter-relationship between them and some other parameters is too complex to go into here.
But in essence, DS_MAX_QUERIES is the maxumum number of parallel queries permitted at any time, and DS_MAX_SCANS determines the number of IO threads for scanning your tables. DS_TOTAL_MEMORY determines the amount of memory allocated for PDQ processing, and there is an algorithm in the manual that shows how these variables and the user's PDQPRIORITY setting combine.
You might also want to consider lifting the RA_PAGES and RA_THRESHOLD values - these determine how many pages are read into memory as 'blocks' before grabbing the next batch. If you're wanting to favour table-scans (which generally you do in DSS) then increasing these to something like 256 and 128 might improve performance.
My experience is with SMP and MPP unix boxes, rather than Windows, so I'm not sure how much you can wring out of your architecture, but this is where you want to start.
I would recommend identifying a good DSS query that runs for a decent length of time, and changing one parameter at a time to see the effect. SET EXPLAIN ON is your friend here, too.
One last thing - 11.7 supports table compression, and the tests I've seen show dramatic improvements in a DSS environment with large reads and irregular writes.

Resources