How to perform a Linear Regression by group in PySpark? - machine-learning

The goal is to perform linear regression for each user in a scalable way in PySpark. Features: x1 and x2. Output: y
Regression equation (zero intercept): y = m(x1) + n(x2)
Example:
pdf = pd.DataFrame(
{
"user": [1, 1, 1, 2, 2, 2],
"x1": [1, 2, 3, 1, 2, 3],
"x2": [2, 3, 4, 5, 6, 7],
"y": [2, 4, 6, 3, 6, 9],
}
)
df = sc.createDataFrame(pdf)
df.show()
Data looks like:
+----+---+---+---+
|user| x1| x2| y|
+----+---+---+---+
| 1| 1| 2| 2|
| 1| 2| 3| 4|
| 1| 3| 4| 6|
| 2| 1| 5| 3|
| 2| 2| 6| 6|
| 2| 3| 7| 9|
+----+---+---+---+

I have used PandasUDF, that works for my use case. Ben Webber explains it in his post
from his post we can take the following approach: (BDR:6.4;Spark:2.4.5)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import statsmodels.api as sm
import pandas as pd
pdf = pd.DataFrame(
{
"user": [1, 1, 1, 2, 2, 2],
"x1": [1, 2, 3, 1, 2, 3],
"x2": [2, 3, 4, 5, 6, 7],
"y": [2, 4, 6, 3, 6, 9],
}
)
df = spark.createDataFrame(pdf)
schema = StructType([StructField('user', DoubleType(), True),
StructField('r_squared', DoubleType(), True)])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def train_LR(input_pd):
usr = input_pd.iloc[0]['user']
# Implement lineal regression, as per your needs
model = sm.OLS(input_pd['y'], input_pd[['x1','x2']]).fit()
R_sq = model.rsquared
return pd.DataFrame({'user': usr, 'r_squared': R_sq }, index=[0])
results = df.groupby('user').apply(train_LR)
display(results)
Note that the UDF is getting a pandas dataframe with only the data for the groupby

Related

Problems with same query in mongoid rails db but different parameters

I’m using mongoid gem ’mongoid’, ’~> 7.2.4’ (mongoDB 3.6) with rails (5) and I have a database with customer collections and bills with this relation:
class Bill
...
belongs_to :customer, index: true
...
end
class Customer
....
has_many :bills
...
end
then in a pry console I test with two clients:
[55] pry(main)> c_slow.class
=> Customer
[58] pry(main)> c_slow.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1030 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1030 | db_api_production.aggregate | SUCCEEDED | 0.008s
=> 523
[59] pry(main)> c_fast.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1031 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1031 | db_api_production.aggregate | SUCCEEDED | 0.135s
=> 35913
until this moment it seems correct but when I execute this query:
[60] pry(main)> c_slow.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1083 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1083 | db_api_production.find | SUCCEEDED | 10.075s
MONGODB | pro-app-mongodb-05:27017 req:1087 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd7ba5f8 #value=165481790189>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd7a4b90 #seconds=1652511506, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1087 | db_api_production.getMore | SUCCEEDED | 1.181s
[61] pry(main)> c_fast.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1091 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1091 | db_api_production.find | SUCCEEDED | 0.004s
MONGODB | pro-app-mongodb-05:27017 req:1092 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd89c4d0 #value=166614148534>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd88eab0 #seconds=1652511516, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1092 | db_api_production.getMore | SUCCEEDED | 0.013s
The slow customer is taking 10 seconds and the fast one is taking 0.004s in the same query. and the slow customer has less than 600 documents and the fast client more than 35000. it has no sense for me.
We did on the bills collection a Reindex, we take the query over all customers and it seems too work at the beginnign but in thre second query it went slow again but the same customers are always slow than the fastest one
[1] pry(main)> Customer.all.collect do |c|
[1] pry(main)* starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* c.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference_string, :id);nil
[1] pry(main)* ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* [c.acronym, ending - starting]
[1] pry(main)* end
I cannot apply explain on pluck query. I reviewd the index and it worked correctly placed in the collection
but doing explain it is slow on the same query
MONGODB | pro-app-mongodb-05:27017 req:1440 | dbapiproduction.explain | SUCCEEDED | 10.841s
MONGODB | pro-app-mongodb-05:27017 req:2005 | dbapiproduction.explain | SUCCEEDED | 0.006s
obviously time, but also docsExamined
the query is the same, changing obyously de ids:
[23] pry(main)> h_slow["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
[24] pry(main)> h_fast["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('571636f44a506256d6000003')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
"inputStage": {
"advanced": 1000,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 0,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 0,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 1000,
"multiKeyPaths": {
"reference": []
},
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 1000
},
"invalidates": 0,
"isEOF": 0,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "FETCH",
"works": 1000
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "LIMIT",
"works": 1001
},
"executionSuccess": true,
"executionTimeMillis": 7,
"nReturned": 1000,
"totalDocsExamined": 1000,
"totalKeysExamined": 1000
}
"inputStage": {
"advanced": 604411,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 320,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 1,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 604411,
"multiKeyPaths": {
"reference": []
},
"nReturned": 604411,
"needTime": 0,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "FETCH",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "LIMIT",
"works": 604412
},
"executionSuccess": true,
"executionTimeMillis": 9472,
"nReturned": 523,
"totalDocsExamined": 604411,
"totalKeysExamined": 604411
}
Why happen this differences, and what I can do to correct this collection

How to filter a number of records and get only the outer most records from a postgres ltree structure?

I have database records arranged in an ltree structure (Postgres ltree extension).
I want to filter these items down to the outer most ancestors of the current selection.
Test cases:
[11, 111, 1111, 2, 22, 222, 2221, 2222] => [11, 2];
[1, 11, 111, 1111, 1112, 2, 22, 222, 2221, 2222, 3, 4, 5] => [1, 2, 3, 4, 5];
[1111, 1112, 2221, 2222] => [1111, 1112, 2221, 2222];
1
|_1.1
| |_1.1.1
| |_1.1.1.1
| |_1.1.1.2
2
|_2.2
| |_2.2.2
| |_2.2.2.1
| |_2.2.2.2
3
|
4
|
5
I have implemented this in Ruby like so.
def fetch_outer_most_items(identifiers)
ordered_items = Item.where(id: identifiers).order("path DESC")
items_array = ordered_items.to_a
outer_most_item_ids = []
while(items_array.size > 0) do
item = items_array.pop
outer_most_item_ids.push(item.id)
duplicate_ids = ordered_items.where("items.path <# '#{item.path}'").pluck(:id)
if duplicate_ids.any?
items_array = items_array.select { |i| !duplicate_ids.include?(i.id) }
end
end
return ordered_items.where(id: outer_most_item_ids)
end
I have eliminated descendants as duplicates via recursion. I'm pretty sure there is an SQL way of doing this which will be the preferred solution as this one triggers n+1 queries. Ideally I would add this function as a named scope for the Item model.
Any pointers please?

Can I use cvxpy to split integer-2D-array to two arrays?

I have a problem that I wonder if I can solve using cvxpy:
The problem:
I have a two dimensional integers array and I want to split it to two array in a way that each row of the source array is either in the 1st or 2nd array.
The requirement from these arrays us that for each column, the sum of integers in array #1 will be as close as possible to twice the sum of integers in array #2.
Example:
Consider the input array:
[
[1, 2, 3, 4],
[4, 6, 2, 5],
[3, 9, 1, 2],
[8, 1, 0, 9],
[8, 4, 0, 5],
[9, 8, 0, 4]
]
The sums of its columns is [33, 30, 6, 29] so ideally we are looking for 2 arrays that the sums of their columns will be:
Array #1: [22, 20, 4, 19]
Array #2: [11, 10, 2, 10]
Off course this is not always possible but I looking for the best solution for this problem.
A possible solution for this specific example might be:
Array #1:
[
[1, 2, 3, 4],
[4, 6, 2, 5],
[8, 4, 0, 5],
[9, 8, 0, 4]
]
With column sums: [22, 20, 5, 18]
Array #2:
[
[3, 9, 1, 2],
[8, 1, 0, 9],
]
With column sums: [11, 10, 1, 11]
Any suggestions?
You can use a boolean vector variable to select rows. The only thing left to decide is how much to penalize errors. In this case I just used the norm of the difference vector.
import cvxpy as cp
import numpy as np
data = np.array([
[1, 2, 3, 4],
[4, 6, 2, 5],
[3, 9, 1, 2],
[8, 1, 0, 9],
[8, 4, 0, 5],
[9, 8, 0, 4]
])
x = cp.Variable(data.shape[0], boolean=True)
prob = cp.Problem(cp.Minimize(cp.norm((x - 2 * (1 - x)) * data)))
prob.solve()
A = np.round(x.value) # data
B = np.round(1 - x.value) # data
A and B are the sum of rows.
(array([21., 20., 4., 19.]), array([12., 10., 2., 10.]))

Case-sensitive formula to perform a COUNTIF for number-letter combination of IDs

We have 15 to 18 symbols long IDs that are a mix of letters and numbers.
On the regular, we need to perform a COUNTIF() to determine the exact number of unique IDs.
The issue is that sometimes the only difference between one ID or another is whether the case of one letter is upper or lower.
COUNTIF() is not case sensitive and we need to apply a very long formula that converts the IDs to a unique combination of numbers in a separate column and perform the COUNTIF() in yet another column.
It is very important that one of the duplicate IDs is market with 1, as this is key for further processes.
Is there a more simple, but also an accurate way to do this with a single formula?
The formula in mention:
=IFERROR(CODE(MID(AL3,1,1))&CODE(MID(AL3,2,1))&CODE(MID(AL3,3,1))&CODE(MID(AL3,4,1))&CODE(MID(AL3,5,1))&CODE(MID(AL3,6,1))&CODE(MID(AL3,7,1))&CODE(MID(AL3,8,1))&CODE(MID(AL3,9,1))&CODE(MID(AL3,10,1))&CODE(MID(AL3,11,1))&CODE(MID(AL3,12,1))&CODE(MID(AL3,13,1))&CODE(MID(AL3,14,1))&CODE(MID(AL3,15,1))&IFERROR(CODE(MID(AL3,16,1)),""))
Some dummy sample IDs:
003B999992CcVWS
003B999992GdEDo
003B999992D4afI
003B999992CcVWs
003B999992CcVWZ
003B999992D40gR
003B999992D40gR
003B999992CcVWz
Formula's output:
484851665757575757506799868783
48485166575757575750711006968111
4848516657575757575068529710273
4848516657575757575067998687115
484851665757575757506799868790
4848516657575757575068524810382
4848516657575757575068524810382
4848516657575757575067998687122
The desired output can be seen in the last column on the right:
+---+-----------------+----------------------------------+---------+
| # | Account ID | Formula ID | Countif |
+---+-----------------+----------------------------------+---------+
| 1 | 003B999992CcVWS | 484851665757575757506799868783 | 1 |
+---+-----------------+----------------------------------+---------+
| 2 | 003B999992GdEDo | 48485166575757575750711006968111 | 1 |
+---+-----------------+----------------------------------+---------+
| 3 | 003B999992D4afI | 4848516657575757575068529710273 | 1 |
+---+-----------------+----------------------------------+---------+
| 4 | 003B999992CcVWs | 4848516657575757575067998687115 | 1 |
+---+-----------------+----------------------------------+---------+
| 5 | 003B999992CcVWZ | 484851665757575757506799868790 | 1 |
+---+-----------------+----------------------------------+---------+
| 6 | 003B999992D40gR | 4848516657575757575068524810382 | 1 |
+---+-----------------+----------------------------------+---------+
| 7 | 003B999992D40gR | 4848516657575757575068524810382 | 2 |
+---+-----------------+----------------------------------+---------+
| 8 | 003B999992CcVWz | 4848516657575757575067998687122 | 1 |
+---+-----------------+----------------------------------+---------+
={"#", "Account ID", "Formula ID", "Countif";
ARRAYFORMULA({ROW(INDIRECT("A1:A"&COUNTA(A21:A))), ARRAY_CONSTRAIN({A21:A,
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )),
IF(LEN(A21:A), MMULT((
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )) = TRANSPOSE(
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )))) * (ROW(A21:A) >= TRANSPOSE(ROW(A21:A))),
SIGN(ROW(A21:A))), IFERROR(1/0))}, COUNTA(A21:A), 3)})}
How about MMULT?
Case-sensitive COUNTIFS
=ARRAYFORMULA(MMULT(
N(EXACT(A2:A9,TRANSPOSE(A2:A9))),
ROW(A2:A9)^0
))
Case-sensitive COUNTIFS with increment
=ARRAYFORMULA(VLOOKUP(
{ROW(A2:A9)&A2:A9},
{
QUERY(
{ROW(A2:A9)&A2:A9,A2:A9},
"select Col1,Col2 order by Col2 label Col1'',Col2''"
),
TRANSPOSE(SPLIT(TEXTJOIN("|",0,
IF(TRANSPOSE(ROW(A2:A9)-1)<=QUERY(
{A2:A9},
"select count(Col1) where Col1<>'' group by Col1 label count(Col1)''",
),
TRANSPOSE(ROW(A2:A9)-1),
)
),"|"))
},
{3},
))
Updates 2019-09-26 08:01:56
The final formula is
=ARRAYFORMULA(MMULT(
(ROW(A2:A17)>=TRANSPOSE(ROW(A2:A17))) *
EXACT(A2:A17,TRANSPOSE(A2:A17))^1,
ROW(A2:A17)^0
))
Sheet example

How would I find the mode (stats) of pixel values of an image?

I'm using opencv and I'm able to get a pixel of an image-- a 3-dimensional tuple, via the code below. However, I'm not quite sure how to calculate the mode of the pixels values in the image.
import cv2
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import cv2
img =cv2.imread('C:\\Users\Moondra\ABEO.png')
#px = img[100,100] #gets pixel value
#print (px)
I tried,
from scipy import stats
stats.mode(img)[0]
But this returns an array shape of
stats.mode(img)[0].shape
(1, 800, 3)
Not sure how exactly stats is calculating the dimensions from which to choose the mode, but I'm looking for each pixel value (3 dimensional tuple) to be one element.
EDIT:
For clarity, I'm going to lay out exactly what I'm looking for.
Let's say we have an array that is of shape (3,5,3) and looks like this
array([[[1, 1, 2], #[1,1,2] = represents the RGB values
[2, 2, 2],
[1, 2, 2],
[2, 1, 1],
[1, 2, 2]],
[[1, 2, 2],
[2, 2, 2],
[2, 2, 2],
[1, 2, 2],
[1, 2, 1]],
[[2, 2, 1],
[2, 2, 1],
[1, 1, 2],
[2, 1, 2],
[1, 1, 2]]])
I would then convert it to an array that looks like this for easier calculation
Turn this into
array([[1, 1, 2],
[2, 2, 2],
[1, 2, 2],
[2, 1, 1],
[1, 2, 2],
[1, 2, 2],
[2, 2, 2],
[2, 2, 2],
[1, 2, 2],
[1, 2, 1],
[2, 2, 1],
[2, 2, 1],
[1, 1, 2],
[2, 1, 2],
[1, 1, 2]])
which is of shape(15,3)
I would like to calculate the mode by counting each set of RGB as follows:
[1,1,2] = 3
[2,2,2] = 4
[1,2,2] = 4
[2,1,1] = 2
[1,1,2] =1
Thank you.
From the description, it seems you are after the pixel that's occurring the most in the input image. To solve for the same, here's one efficient approach using the concept of views -
def get_row_view(a):
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[-1])))
a = np.ascontiguousarray(a)
return a.reshape(-1, a.shape[-1]).view(void_dt).ravel()
def get_mode(img):
unq, idx, count = np.unique(get_row_view(img), return_index=1, return_counts=1)
return img.reshape(-1,img.shape[-1])[idx[count.argmax()]]
We can also make use of np.unique with its axis argument, like so -
def get_mode(img):
unq,count = np.unique(img.reshape(-1,img.shape[-1]), axis=0, return_counts=True)
return unq[count.argmax()]
Sample run -
In [69]: img = np.random.randint(0,255,(4,5,3))
In [70]: img.reshape(-1,3)[np.random.choice(20,10,replace=0)] = 120
In [71]: img
Out[71]:
array([[[120, 120, 120],
[ 79, 105, 218],
[ 16, 55, 239],
[120, 120, 120],
[239, 95, 209]],
[[241, 18, 221],
[202, 185, 142],
[ 7, 47, 161],
[120, 120, 120],
[120, 120, 120]],
[[120, 120, 120],
[ 62, 41, 157],
[120, 120, 120],
[120, 120, 120],
[120, 120, 120]],
[[120, 120, 120],
[ 0, 107, 34],
[ 9, 83, 183],
[120, 120, 120],
[ 43, 121, 154]]])
In [74]: get_mode(img)
Out[74]: array([120, 120, 120])

Resources