What is wrong here with OneHotEncoding()? - machine-learning

Please Open the Image for the problem
All the problem is with Embarked Attribute. Whenever in onehotencoding() I remove column no 11, the fit_transform() works fine. But when I add the 11th column again, i get the Value error saying input contains NaN.

ColumnTransformer does not apply its transformers sequentially, it does so in parallel. So the 11th column going into the Encoder doesn't do so after having been imputed, and OneHotEncoder fails on data with missing values.
You could use a Pipeline with steps Impute and Encode, and use that only for column 11 in the ColumnTransformer.

Related

Trying to Add Double Quotes Before and After the String Created by formula

I am working on Google Sheets since couple of years and found i very useful but now there is one situation i got faced and make me troubled to fix it.
In the current stage, when the line break is included in the cell value, when the cell is copied and pasted, it seems that it becomes """test""". In that case, when i want to copy and paste the cell values like "test" instead of """test""",
And there are some situations where i have added IF condition in formula which results comes sometimes in Line Break or sometimes in single line.
=CHAR(34)&SUBSTITUTE(A1,",",CHAR(10))&CHAR(34)
This formula works for Single Line string but when it comes to Line Break it adds more qoutation like"""test"""
I only want the solution like this even if its single line or Multiple line breaks always result should be "test".
your help will be much appreciated.
this is a known issue and one way how to counter it is to copy the content of fx bar (or active cell) instead of the cell itself. you may also want to move down the output to not catch the formula itself. try:
={""; CHAR(34)&SUBSTITUTE(A1,",",CHAR(10))&CHAR(34)}
This formula should output the expected value, even with several line returns and several " characters.
=CHAR(34)&SUBSTITUTE(A1,"""","")&CHAR(34)

Parsing Error trying to import Coinbase Pro API into Google Sheets

New to APIs in Google sheets, but I feel like I'm 95% of the way to where I'm trying to go.
I'm trying to pull crypto data into my spreadsheet to do a simple 24 hour price comparison and gauge whether the price has gone up or down, maybe use some conditional code to change the background to green or red. Simple enough. Most of the sites that offer APIs have given me various errors, though, so coinbase pro (and weirdly the deprecated gdax) have been most reliable (although I haven't ruled out that it started breaking because I'm now putting in too many call requests).
Found this as a way to get the current price of ETH, for instance:
=VALUE(SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.gdax.com/products/ETH-USD/ticker"),0,2), "price:",""), """", ""))
Works like a charm. So I changed the request to target different info, specifically the 24hr stats listed on the API doc, and the first value in that section, "open" for opening price (this ensures I get the price exactly 24hrs earlier). But I got a weird parsing error using the request, which is here:
=VALUE(SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,1), "open:",""), """", ""))
I've figured out the issue, but not the solution. Google Sheets says I am pulling in text. Because the "open" (opening price) value is the first listed in the JSON code, it is pulling in the code bracket from the nested HTML/JSON code. For instance, it says I can't parse "{open" into a number. And I get the same problem in reverse when I pull the last value listed in the stats section, which is "volume_30day:"
=VALUE(SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,6), "volume_30day:",""), """", ""))
This returns an error saying "volume_30day: #}" can't be parsed, so now it is the closing bracket of the JSON code. So I can't use "open" the first item in the API 24hr stats section, or Volume_30day, which is the sixth item on that list, but items 2-5 work just fine. Seems super weird to me, but I've tested it and it is seems to be what's going on.
There must be something stupid I need to tweak here, but I don't know what it is.
Answer 1:
About =VALUE(SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,1), "open:",""), """", ""))
When I checked =SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,1), "open:",""), """", ""), the value is {open:617. I think that when by this, when VALUE is used for the value, the error occurs.
In order to retrieve the values you expect, I would like to propose to use REGEXREPLACE instead of SUBSTITUTE. The modified formula is as follows.
=VALUE(REGEXREPLACE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,1), "open|""|{|}|:",""))
In this modified formula, open|""|{|}|: is used as the regex. These are replaced with "".
In this case, I think that =VALUE(REGEXEXTRACT(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,1), "\d+")) can be also used. But when I thought about your 2nd question, I thought that above formula might be useful.
Result:
Answer 2:
About =VALUE(SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,6), "volume_30day:",""), """", ""))
When I checked =SUBSTITUTE(SUBSTITUTE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,6), "volume_30day:",""), """", ""), the value is 7101445.64098932}. I think that when by this, when VALUE is used for the value, the error occurs.
In order to retrieve the values you expect, I would like to propose to use REGEXREPLACE instead of SUBSTITUTE. The modified formula is as follows.
=VALUE(REGEXREPLACE(INDEX(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"),0,6), "volume_30day|""|{|}|:",""))
In this modified formula, volume_30day|""|{|}|: is used as the regex. These are replaced with "".
In this regex, it can use by replacing open of open|""|{|}|: to volume_30day at above regex.
Result:
Other pattern 1:
As other pattern using the built-in formula, how about the following modified formulas?
=VALUE(TEXTJOIN("",TRUE,ARRAYFORMULA(IFERROR(VALUE(REGEXREPLACE(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"), "open|""|{|}|:","")),""))))
=VALUE(TEXTJOIN("",TRUE,ARRAYFORMULA(IFERROR(VALUE(REGEXREPLACE(IMPORTDATA("https://api.pro.coinbase.com/products/ETH-USD/stats"), "volume_30day|""|{|}|:","")),""))))
In these formulas, the values can be retrieved by replacing KEY of KEY|""|{|}|: of the regex.
Other pattern 2:
The returned value from https://api.pro.coinbase.com/products/ETH-USD/stats is the JSON value. So in this case, when the custom function created by Google Apps Script can be also used.
The Google Apps Script is as follows.
const SAMPLE = (url, key) => JSON.parse(UrlFetchApp.fetch(url).getContentText())[key] || "no value";
When you use this script, please copy and paste the above script to the script editor of Spreadsheet and save it. And please put the custom function like =SAMPLE("https://api.pro.coinbase.com/products/ETH-USD/stats","open") and =SAMPLE("https://api.gdax.com/products/ETH-USD/ticker","price") to a cell. By this, the value can be retrieved.
References:
REGEXREPLACE
Custom Functions in Google Sheets

Encoding categorical data to dummy variables using modified OneHotEncoder in python?

This is my code, I was trying to dummy encode the categorical data of the first column of 'X' but this isn't working, when I visited the OneHotEncoder documentation page it said that OneHotEncoder has been changed. I wasn't able to figure out how to use this changed OnehotEncoder.
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categorical_features = [0])
X[:, 0]=onehotencoder.fit_transform(X).toarray()
There is several issues here.
First, the one-hot-encoder will return an array with several columns while the input will be a single one. Therefore, you assignment will fail.
*Then, scikit-learn will return a numpy array. So there in no need to use toarray.
Finally, you probably want to apply the encoding on some of the columns and let some other column untouched (or maybe apply another processing). In this case, you want to use what is called a ColumnTransformer. You can look at the following example which illustrate perfectly how to make such preprocessing: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

Can a cross-reference data stream be achieved using Code in Zapier. Especifically to generate rows in GSheets from 2 different data streams?

My goal is to create automatic to-do's in several projects in Basecamp 3 on a weekly basis.
To do this, I am using the Schedule app in Zapier which triggers the creation of several rows on a google spreadsheet. These rows then trigger the creation of to-do's in specific projects in Basecamp 3. The input of this rows, should be: project's name (used for the search step), to-do text.
I am using the Formatter App in Zapier to try and achieve this. On a first Formatter action, I am splitting the text using commas of all of the names of our projects and returning all the segments. On a second Formatter, I am splitting the text of all the to-do's text and, again, returning all the segments.
Formatter 1 Input (Projects): AA,BB,CC,DD
Formatter 2 Input (To-Do's Text): buy it, use it, break it, fix it
Now, the goal I am trying to achieve is illustrated in the attached diagram. Also illustrated is what the zap (as it is) is achieving with the data. Data Stream Diagrams
We work often with Grasshopper, a Rhino 3D plug-in used for parametric modeling and data-drive design. In grasshopper this would be called a "cross-referenced" data-stream. Is this possible to achieve using Code in Zapier? If so, can I get a little help?
Thanks!
Zapier Zap GIF: https://drive.google.com/open?id=0B85_sQemABgmQVd6MENRd0NsNGc
I don't have permission to view your Google drive link, but I think I get the gist of what you're trying to do.
Rather than use Formatter, you're probably better off using Python as you'll have more control over what's getting returned. You can use Python's itertools.product to get every combination of your two lists.
# zapier runs in a vanilla python environment
# so you can import standard packages
import itertools
letters = ['AA', 'BB', 'CC', 'DD']
actions = ['buy it', 'use it', 'break it', 'fix it']
combos = list(itertools.product(letters, actions))
# [('AA', 'buy it'), ('AA', 'use it'), ... ('DD', 'fix it')]
From there, you'll want to format that list as your action step expects (probably via a list comprehension such as [{'code': c[0], 'action': c[1]} for c in combos] and return the list from the code step.
A hidden feature of Zapier is that if an array is returned from a code step, the zap will run the action for each array element. Be careful though, that means for each input to the zap, the output will be run 16x (which can quickly eat into your task limit).
Hopefully that sets you on the right track. Let me know if somethings was unclear!

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

Resources