How do I pass categorical features in CatBoostRegressor?

How do I pass categorical features in CatBoostRegressor? - machine-learning

I have dataframe cars. And the structure of it described below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 91313 entries, 0 to 93099
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Manufacturer 91313 non-null string
1 Model 91313 non-null string
2 Year 91313 non-null Int64
3 Category 91313 non-null string
4 Mileage 91313 non-null Int64
5 FuelType 91313 non-null string
6 EngineVolume 91313 non-null float64
7 DriveWheels 91313 non-null string
8 GearBox 91313 non-null string
9 Doors 91313 non-null string
10 Wheel 91313 non-null string
11 Color 91313 non-null string
12 InteriorColor 91313 non-null string
13 LeatherInterior 91313 non-null boolean
14 Price 91313 non-null Int64
15 Clearance 91313 non-null boolean
dtypes: Int64(3), boolean(2), float64(1), string(10)
memory usage: 11.1 MB
I want to make a model that predicts the price of the car using CatBoostRegressor. And I am trying it like this:
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
cat_features = ['Manufacturer','Model','Category','FuelType','DriveWheels','GearBox','Doors','Wheel','Color','InteriorColor','LeatherInterior','Clearance']
model = cb.CatBoostRegressor(loss_function = 'RMSE',eval_metric = 'R2',cat_features = cat_features)
grid = {'iterations': [250, 300, 400],
'learning_rate': [0.1,0.2],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3],
'cat_features' : cat_features
}
model.grid_search(grid, train_dataset)
I am tried to put cat_features in the model and grid also. But both cases did not help.
TypeError Traceback (most recent call last)
<ipython-input-34-2cb43214da9d> in <module>
----> 1 train_dataset = cb.Pool(X_train, y_train)
2 test_dataset = cb.Pool(X_test, y_test)
3 cat_features = ['Manufacturer','Model','Category','FuelType','DriveWheels','GearBox','Doors','Wheel','Color','InteriorColor','LeatherInterior','Clearance']
4 model = cb.CatBoostRegressor(loss_function = 'RMSE',eval_metric = 'R2',cat_features = cat_features)
5 grid = {'iterations': [250, 300, 400],
~\anaconda3\lib\site-packages\catboost\core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
586 )
587
--> 588 self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
589 super(Pool, self).__init__()
590
~\anaconda3\lib\site-packages\catboost\core.py in _init(self, data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1100 baseline = np.reshape(baseline, (samples_count, -1))
1101 self._check_baseline_shape(baseline, samples_count)
-> 1102 self._init_pool(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1103
1104
_catboost.pyx in _catboost._PoolBase._init_pool()
_catboost.pyx in _catboost._PoolBase._init_pool()
_catboost.pyx in _catboost._PoolBase._init_features_order_layout_pool()
_catboost.pyx in _catboost._set_features_order_data_pd_data_frame()
TypeError: Cannot convert StringArray to numpy.ndarray
How I can handle this error?

If you're using the name of the features in cat_features, you must as well provide them in the features_name parameter. Otherwise, providing the index of the categorical features in cat_features would be enought.
In your case that'd be:
cat_features = [0, 1, 3, 5, 7, 8, 9, 10, 11, 12, 13, 15]

Related

Tidymodels (Fitting Bagged Trees with 10-Fold Cross Validation in R): x Fold01: model: Error: Input must be a vector, not NULL

Overview:
I have produced four models using the tidymodels package with the data frame FID (see below):
General Linear Model
Bagged Tree
Random Forest
Boosted Trees
The data frame contains three predictors:
Year (numeric)
Month (Factor)
Days (numeric)
The dependent variable is Frequency (numeric)
Issue
I am attempting to fit the bagged tree model and I am experiencing this error message below:
Any idea why I am getting an error when using bag_tree(), and fit_resamples()?
There is not much material online, except I found this post; however, this problem relates to logistic regression, not bagged tree models.
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
If anyone can help with solving this error message, I would be deeply appreciative for your advice.
Many thanks in advance
R-code
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
library(rlang)
library(poissonreg)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Update the model with cost complexity
##A positive number for the cost/complexity parameter, and
##The cost/complexity parameter
Updated_bag<-update(mod_bag, cost_complexity=1)
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(Updated_bag)
##Fit and predict the general linear model
bag_fit_model <- fit(wflow_bag, data = train_data)
##We can access the fit using pull_workflow_fit(), and even
##tidy() the model coefficient results into a convenient dataframe format.
##STACKOVERFLOW
bag_fit_model %>%
pull_workflow_fit()
##Predict the model
bag_predict<-predict(bag_fit_model, train_data)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
x Fold01: model: Error: Input must be a vector, not NULL.
x Fold02: model: Error: Input must be a vector, not NULL.
x Fold03: model: Error: Input must be a vector, not NULL.
x Fold04: model: Error: Input must be a vector, not NULL.
x Fold05: model: Error: Input must be a vector, not NULL.
x Fold06: model: Error: Input must be a vector, not NULL.
x Fold07: model: Error: Input must be a vector, not NULL.
x Fold08: model: Error: Input must be a vector, not NULL.
x Fold09: model: Error: Input must be a vector, not NULL.
x Fold10: model: Error: Input must be a vector, not NULL.
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
Data Frame - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")

The cost_complexity for decision trees is sometimes called alpha, and it should be a positive number smaller than one. Your model runs fine when you should a cost_complexity less than one:
library(tidymodels)
library(baguette)
FID <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017),
Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L),
.Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"),
Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38),
Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v = 10)
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
mod_bag <- bag_tree(cost_complexity = 0.1) %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
fit(wflow_bag, data = train_data)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bag_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#>
#> ● step_nzv()
#> ● step_novel()
#> ● step_medianimpute()
#> ● step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Bagged CART (regression with 10 members)
#>
#> Variable importance scores include:
#>
#> # A tibble: 12 x 4
#> term value std.error used
#> <chr> <dbl> <dbl> <int>
#> 1 Days 4922. 369. 10
#> 2 Month_June 2253. 260. 9
#> 3 Month_July 1375. 139. 8
#> 4 Month_November 306. 96.4 3
#> 5 Year 272. 519. 2
#> 6 Month_May 270. 103. 4
#> 7 Month_February 191. 116. 4
#> 8 Month_August 105. 30.2 3
#> 9 Month_April 45.8 42.5 2
#> 10 Month_September 13.4 0 1
#> 11 Month_December 11.9 0 1
#> 12 Month_March 10.1 0 1
Created on 2020-12-17 by the reprex package (v0.3.0.9001)
I bet you tried a value of 1 because that is shown in the docs here, and this is very misleading. We'll get that fixed.

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.

For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)

In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.

Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)

postgresql Get latest value before date

Let's say I have the following Inventory table.
id item_id stock_amount Date
(1, 1, 10, '2020-01-01T00:00:00')
(2, 1, 9, '2020-01-02T00:00:00')
(3, 1, 8, '2020-01-02T10:00:00')
(4, 3, 11, '2020-01-03T00:00:00')
(5, 3, 13, '2020-01-04T00:00:00')
(6, 4, 7, '2020-01-05T00:00:00')
(7, 2, 12, '2020-01-06T00:00:00')
Basically, per each day, I want to get the sum of stock_amount for each unique item_id but it should exclude the current day's stock amount. The item_id chosen should be the latest one. This is to calculate the starting stock on each day. So the response in this case would be:
Date starting_amount
'2020-01-01T00:00:00' 0
'2020-01-02T00:00:00' 10
'2020-01-03T00:00:00' 8
'2020-01-04T00:00:00' 19 -- # -> 11 + 8 (id 5 + id 3)
'2020-01-05T00:00:00' 21 -- # -> 13 + 8
'2020-01-06T00:00:00' 28 -- # -> 7 + 13 + 8
Any help would be greatly appreciated.

Using nested subqueries like this:
select
Date,
coalesce(sum(stock_amount), 0) starting_amount
from
(
select
row_number() over(partition by i1.Date, item_id order by i2.Date desc) i,
i1.Date,
i2.item_id,
i2.stock_amount
from
(select distinct date_trunc('day', Date) as Date from Inventory) i1
left outer join
Inventory i2
on i2.Date < i1.Date
) s
where i = 1
group by Date
order by Date
This query sorts in descending order and uses the first row.

how to compute the classification report for sentiment analysis with scikit-learn

how can I get the classification report measures precision, recall, accuracy, and support for 3 class classification and the classes are "positive", "negative" and "neutral". below is the code:
vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
print vec_clf.fit(X_train.values.astype('U'),y_train.values.astype('U'))
y_pred = vec_clf.predict(X_test.values.astype('U'))
print "SVM Accuracy-",metrics.accuracy_score(y_test, y_pred)
print "confuson metrics :\n", metrics.confusion_matrix(y_test, y_pred, labels=["positive","negative","neutral"])
print(metrics.classification_report(y_test, y_pred))
and it is giving error as:
SVM Accuracy- 0.850318471338
confuson metrics :
[[206 9 67]
[ 4 373 122]
[ 9 21 756]]
Traceback (most recent call last):
File "<ipython-input-62-e6ab3066790e>", line 1, in <module>
runfile('C:/Users/HP/abc16.py', wdir='C:/Users/HP')
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/HP/abc16.py", line 133, in <module>
print(metrics.classification_report(y_test, y_pred))
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 1391, in classification_report
labels = unique_labels(y_true, y_pred)
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\utils\multiclass.py", line 104, in unique_labels
raise ValueError("Mix of label input types (string and number)")
ValueError: Mix of label input types (string and number)
please guide me where I am getting wrong
EDIT 1: this is how the y_true and y_pred looks
print "y_true :" ,y_test
print "y_pred :",y_pred
y_true : 5985 neutral
899 positive
2403 neutral
3963 neutral
3457 neutral
5345 neutral
3779 neutral
299 neutral
5712 neutral
5511 neutral
234 neutral
1684 negative
3701 negative
2886 neutral
.
.
.
2623 positive
3549 neutral
4574 neutral
4972 positive
Name: sentiment, Length: 1570, dtype: object
y_pred : [u'neutral' u'positive' u'neutral' ..., u'neutral' u'neutral' u'negative']
EDIT 2: output for type(y_true) and type(y_pred)
type(y_true): <class 'pandas.core.series.Series'>
type(y_pred): <type 'numpy.ndarray'>

Cannot reproduce your error:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# toy data, similar to yours:
data = {'id':[5985,899,2403, 1684], 'sentiment':['neutral', 'positive', 'neutral', 'negative']}
y_true = pd.Series(data['sentiment'], index=data['id'], name='sentiment')
y_true
# 5985 neutral
# 899 positive
# 2403 neutral
# 1684 negative
# Name: sentiment, dtype: object
type(y_true)
# pandas.core.series.Series
y_pred = np.array(['neutral', 'positive', 'negative', 'neutral'])
# all metrics working fine:
accuracy_score(y_true, y_pred)
# 0.5
confusion_matrix(y_true, y_pred)
# array([[0, 1, 0],
# [1, 1, 0],
# [0, 0, 1]], dtype=int64)
classification_report(y_true, y_pred)
# result:
precision recall f1-score support
negative 0.00 0.00 0.00 1
neutral 0.50 0.50 0.50 2
positive 1.00 1.00 1.00 1
total 0.50 0.50 0.50 4

Find elements that sum to a given number

I have to find the list of orders that have sum of order amount equal to or greater than a given number. For example,
order # amount
o1 100
o2 50
o3 90
o4 150
o5 20
o6 30
o7 50
And if I need to find the orders in which sum of order amount is equal to 300 or greater than 300, then I should get o5, o6, o2, o7, o3,o1 or o1, o4, o3. It does not matter if order is min to max or max to min. How can I do it in a minimal way? I know first step would be to sort. I can use array sum to get sum of all elements but how do I get the elements that add up to or are just greater than a given number?
I am using Ruby on Rails with Oracle as db.

Your problem is actually quite simple. First, order the orders by decreasing quantity:
orders = [["o1", 100], ["o2", 50], ["o3", 90], ["o4", 150],
["o5", 20], ["o6", 30], ["o7", 50]]
sorted_orders = orders.sort_by(&:last).reverse
#=> [["o4", 150], ["o1", 100], ["o3", 90], ["o7", 50],
# ["o2", 50], ["o6", 30], ["o5", 20]]
Suppose:
min_req = 300
First see if min_req can be achieved by using all the items:
orders.reduce(0) { |tot,(_,qty)| tot+qty } < min_req
#=> false
Had this returned true we'd be finished: since the quantities are all non-negative, we would have computed the largest possible value for the sum of a subset of quantities.
Then simply take items in the sorted order until the quantities sum to at least min_req:
tot = 0
sorted_orders.take_while { |_,qty| tot < min_req && tot += qty }
#=> [["o4", 150], ["o1", 100], ["o3", 90]]
We can wrap this in a method:
def smallest_combination(orders, min_req)
return nil if orders.reduce(0) { |tot,(_,qty)| tot+qty } < min_req
tot = 0
orders.sort_by(&:last)
.reverse
.take_while { |_,qty| tot < min_req && tot += qty }
end
smallest_combination(orders, 300)
#=> [["o4", 150], ["o1", 100], ["o3", 90]]
smallest_combination(orders, 400)
#=> [["o4", 150], ["o1", 100], ["o3", 90], ["o7", 50], ["o2", 50]]
smallest_combination(orders, 500)
#=> nil

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How do I pass categorical features in CatBoostRegressor? - machine-learning

If you're using the name of the features in cat_features, you must as well provide them in the features_name parameter. Otherwise, providing the index of the categorical features in cat_features would be enought. In your case that'd be: cat_features = [0, 1, 3, 5, 7, 8, 9, 10, 11, 12, 13, 15]

Related

Tidymodels (Fitting Bagged Trees with 10-Fold Cross Validation in R): x Fold01: model: Error: Input must be a vector, not NULL

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

postgresql Get latest value before date

how to compute the classification report for sentiment analysis with scikit-learn

Find elements that sum to a given number

Categories

Resources