Label vectorized-features in pipeline to original array name (PySpark) - machine-learning

Very similar to this problem: Pyspark random forest feature importance mapping after column transformations but with arrays instead of categorical values
I'm running a feature importance test on my final model. The features were arrays (for example device which has elements "mobile" or "desktop" and city (elements are the cities visited, eg: London, New York,...) that I encoded using a CountVectorizer.
Afterwards, I ran a pipeline and constructed the featureImportances (similar to this: https://gist.github.com/colbyford/5443a525fe76b602f813ff7904c4dfff)
The end-result is the following:
idx name score
17 device_vector_0 0.483894
693 city_vector_69 0.001882
649 city_vector_25 0.001292
1172 city_vector_548 0.000000
1176 city_vector_552 0.000000
1177 city_vector_553 0.000000
My question is, how can I map the name (which is based on the array-element, I presume?) to the array-value (device_vector_0=mobile, city_vector_69=London,...).
Cleaned Code:
vector_list=list(set(['mobile','country']))
vectorizer=[CountVectorizer(inputCol=column, outputCol=column+"_vector").fit(df) for column in vector_list]
pipeline_vector = Pipeline(stages=vectorizer)
pipeline = Pipeline().setStages([pipeline_vector,assembler,rf])
bestModel=rf_model.bestModel
ExtractFeatureImportance(bestModel.stages[-1].featureImportances, resultDF, "features")

Related

(SAS)how to make prediction to new data using a trained logistic regression model?

I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
An input dataset
A model to apply to the data
It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod is your new production data coming in.
Let's take a look at the predictions output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.

Decision Tree sample count on main node doesn't match total count of observations

I have a dataset that includes 248 observations and I am trying to visualize a decision tree from a random forest model I put together. The article below suggests the sample value in the root node is the value of samples (observations) in the dataset. However, the sample value in the root node of my decision tree does not equal 248 and it equals 184 instead, as seen in the image below.
https://towardsdatascience.com/scikit-learn-decision-trees-explained-803f3812290d
Root node of decision tree with wrong sample value
the code to my model is:
rf = RandomForestClassifier(max_depth=3,
n_estimators=30,
random_state=42,
bootstrap=False)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
and the code to my tree is:
fig = plt.figure(figsize=(20, 10))
fig = tree.plot_tree(rf.estimators_[13],
feature_names=x_df.columns,
class_names=y_train,
filled=True,
impurity=True,
rounded=True,
proportion=False)
fig = fig
Unfortunately, I can't share the data due to an NDA but does anyone know why the sample field on the root node of the tree does not equal 248?
The sample size on the root node is 198 because the tree is based on the training data, not the entire dataset.

Understanding FastRP vs scaleProperties

I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)
During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.
Project graph
//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})
Scale variables of interest between 0-1 for future Euclidean distance calculation
//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten
We then can run KNN based on euclidean distance
//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
{nodeProperties: {scaledProperties: "EUCLIDEAN"},
topK: 15,
mutateRelationshipType: "IS_SIMILAR",
mutateProperty: "similarity",
similarityCutoff: 0.6409912109375,
sampleRate:1,
randomSeed:42,
concurrency:1}
)
However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.
FastRP
CALL gds.fastRP.mutate(
'bd_graph',
{
embeddingDimension: 8,
mutateProperty: 'fastrp-embedding',
featureProperties: ['bd_load', 'weight', 'length']
}
)
YIELD nodePropertiesWritten
But when I run the below code
ALL gds.knn.stats("bd_graph",
{
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
I receive an error:
Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?
I am glad you are finding the tutorial helpful and getting into GDS!
Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/
The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.
CALL gds.knn.stats("bd_graph",
{
nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.

Finding the Jacobian of a frame with respect to the joints of a given model in Pydrake

Is there any way to find the Jacobian of a frame with respect to the joints of a given model (as opposed to the whole plant), or alternatively to determine which columns of the full plant Jacobian correspond to a given model’s joints? I’ve found MultibodyPlant.CalcJacobian*, but I’m not sure if those are the right methods.
I also tried mapping the JointIndex of each joint in the model to a column of MultibodyPlant.CalcJacobian*, but the results didn't make sense -- the joint indices are sequential (all of one model followed by all of the other), but the Jacobian columns look interleaved (a column corresponding to one model followed by one corresponding to the other).
Assuming you are computing with respect to velocities, you'll want to use Joint.velocity_start() and Joint.num_velocities() to create a mask or set of indices. If you are in Python, then you can use NumPy's array slicing to select the desired columns of your Jacobian.
(If you compute w.r.t. position, then make sure you use Joint.position_start() and Joint.num_positions().)
Example notebook:
https://nbviewer.jupyter.org/github/EricCousineau-TRI/repro/blob/eb7f11d/drake_stuff/notebooks/multibody_plant_jacobian_subset.ipynb
(TODO: Point to a more official source.)
Main code to pay attention to:
def get_velocity_mask(plant, joints):
"""
Generates a mask according to supplied set of ``joints``.
The binary mask is unable to preserve ordering for joint indices, thus
`joints` required to be a ``set`` (for simplicity).
"""
assert isinstance(joints, set)
mask = np.zeros(plant.num_velocities(), dtype=np.bool)
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
mask[start:end] = True
return mask
def get_velocity_indices(plant, joints):
"""
Generates a list of indices according to supplies list of ``joints``.
The indices are generated according to the order of ``joints``, thus
``joints`` is required to be a list (for simplicity).
"""
indices = []
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
for i in range(start, end):
indices.append(i)
return indices
...
# print(Jv1_WG1) # Prints 7 dof from a 14 dof plant
[[0.000 -0.707 0.354 0.707 0.612 -0.750 0.256]
[0.000 0.707 0.354 -0.707 0.612 0.250 0.963]
[1.000 -0.000 0.866 -0.000 0.500 0.612 -0.079]
[-0.471 0.394 -0.211 -0.137 -0.043 -0.049 0.000]
[0.414 0.394 0.162 -0.137 0.014 0.008 0.000]
[0.000 -0.626 0.020 0.416 0.035 -0.064 0.000]]

How to computer Document Length and Average Document Length in BM25

Please tell me anyone as how to compute document(dl) length and average document length(avdl) in BM25. For example we have the following 4 documents:
new york times east // Doc1
los angeles times west //Doc2
washington post district columbia //Doc3
wall street journal north //Doc4
The first step is to remove stop-words and perform stemming so that we can consider a document d as a set of constituent terms with corresponding term frequencies {tf(t,d) : t \in d}.
Now, the notion of document length is slightly different in vector space and probabilistic models, e.g. BM25, language model etc. While in the former, document length refers to the norm of a vector, in the latter it typically refers to total number of terms in a document.
Nonetheless, the vector norm notion of documents can, in principle, be also applied to probabilistic models as well because the term frequency values still remain normalized between 0 and 1. However, the normalized term frequency values would no longer sum to 1.
To illustrate with your example: In the case of vector space model, the length is defined as the norm of a vector, which is the case of doc1, is norm(doc1) = square root of the sum of squares of the term frequency values for each unique term in doc1 = sqrt(1^2 + 1^2 + 1^2 + 1^2) = sqrt(4) = 2.
For the probabilistic models, length would be defined as summation of term frequencies of the component terms = 1 + 1 + 1 + 1 = 4. The normalized term frequency values of a term t would be P(t,d) = tf(t,d)/dl(d) so that \sum{P(t,d) t \in d} = 1, e.g. 1/4+1/4+1/4+1/4=1.
The BM25Similarity implementation of Lucene uses vector norms as document lengths whereas the Terrier uses sum of tfs of constituent terms as document lengths.

Resources