implementing LSTM with theano scan, way slower then using loops - machine-learning

I am using Theano/Pylearn2 to implement LSTM model inside my own network. However, I've found that Theano scan is much, much slower than using plain loops. I used the Theano profiler
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
95.4% 95.4% 25.255s 4.31e-02s Py 586 3 theano.scan_module.scan_op.Scan
1.8% 97.2% 0.466s 4.72e-05s C 9864 41 theano.sandbox.cuda.basic_ops.GpuElemwise
0.8% 97.9% 0.199s 8.75e-05s C 2276 10 theano.sandbox.cuda.basic_ops.GpuAlloc
0.7% 98.7% 0.196s 1.14e-04s C 1724 8 theano.sandbox.cuda.blas.GpuDot22
0.3% 99.0% 0.087s 1.06e-04s C 828 3 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.2% 99.2% 0.051s 1.66e-04s Py 310 2 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
and the Ops,
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
77.2% 77.2% 20.433s 7.40e-02s Py 276 1 forall_inplace,gpu,grad_of_lstm__layers}
18.2% 95.4% 4.822s 1.56e-02s Py 310 2 forall_inplace,gpu,lstm__layers}
So lots and lots of time are spent on Scan (which is kind of as expected, but I didn't expect it to be soo slow).
The main body of my code is
def fprop(self, state_below, state_prev = 0, cell_prev = 0):
if state_prev == None:
state_prev = self.state_prev;
if cell_prev == None:
cell_prev = self.cell_prev;
i_gate = T.nnet.sigmoid(T.dot(state_below,self.Wi) +
T.dot(state_prev,self.Ui));
f_gate = T.nnet.sigmoid(T.dot(state_below,self.Wf) +
T.dot(state_prev,self.Uf));
C = T.tanh(T.dot(state_below, self.Wc) +
T.dot(state_prev, self.Uc));
C = i_gate * C + f_gate * cell_prev;
o_gate = T.nnet.sigmoid(T.dot(state_below,self.Wo) +
T.dot(state_prev,self.Uo) +
T.dot(C, self.Vo));
h_out = o_gate * T.tanh(C);
return h_out, C
And I wrote my scan as:
[h,c,out], _ = theano.scan(fn=self.fprop_with_output,
sequences=[X.T,Y[:,1:].T],
outputs_info=[dict(initial=h_,taps=[-1]), dict(initial=c_,taps=[-1]),None],n_steps=X.shape[1]-1);
One thing I've noticed is that the type of Theano scan uses Python implementation (?) is that the reason why this is ridiculously slow? or did I do something wrong? Why is Theano python implementation of Scan instead of C's.
(I said using loops is faster, but it's faster at runtime, for large model I didn't manage to compile the version of using loops within reasonable amount of time).

This was asked a while ago but I had/have the same problem. Answer is that scan is slow on the GPU.
See: https://github.com/Theano/Theano/issues/1168

It takes time for Theano developers to implement scan and gradient-of-scan using C and GPU, because it is much more complicated than other functions. That is why when you profile it, it shows GpuElemwise, GpuGemv, GpuDot22, etc., but you don't see a GpuScan or GpuGradofScan.
Meanwhile, you can only fall back to for loops.

Related

Reasons why swifter/dask/ray only use one core for an apply task?

I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?

For each element, loop over all previous elements

I have a 2D JAX array containing an image.
For each pixel P[y, x] of the image, I would like to loop over all pixels P[y, x-i] to the left of that pixel and reduce those to a single value. The exact reduction computation involves finding a particular maximum over a weighted sum involving those pixels' values, as well as i and x. Therefore, the result (or any intermediate results) for P[y, x] can't be reused for P[y, x+1] either; this is an O(x²y) operation overall.
Can I accomplish this somewhat efficiently in JAX? If so, how?
JAX does not provide any native tool to do this sort of operation for an arbitrary function. It can be done via lax.scan or perhaps jnp.cumsum for functions where each successive value can be computed from the last, but it sounds like that is not the case here.
I believe the best you can do is to combine vmap with Python for-loops to achieve what you want: just be aware that during JIT compilation JAX will flatten all for loops, so if your image size is very large, the compilation time will be long. Here's a short example:
import jax.numpy as jnp
from jax import vmap
def reduction(x):
# some 1D reduction
assert x.ndim == 1
return len(x) + jnp.sum(x)
def cumulative_apply(row, reduction=reduction):
return jnp.array([reduction(row[:i]) for i in range(1, len(row) + 1)])
P = jnp.arange(20).reshape(4, 5)
result = vmap(cumulative_apply)(P)
print(result)
# [[ 1 3 6 10 15]
# [ 6 13 21 30 40]
# [11 23 36 50 65]
# [16 33 51 70 90]]

Can I get extra information to a custom scorer function in sklearn?

I am performing a classification task which is essentially doing algorithm configuration, i.e. trying to pick a configuration (or 'mode') which is likely to make the problem-solving algorithm finish in the quickest time.
I am learning to classify the "best" configuration based on features of problem instances. I see that scikit-learn enables you to create your own scoring function to use in tuning the models. However the score_func only takes the true label and the predicted label as input.
Is it possible to identify which row in the dataset a prediction came from (when passing to this custom scorer)? That way I could figure out the performance hit of a predicted ("wrong") config and score the model accordingly. Basically sometimes a "wrong" selection can still be very good and close to the best, but a naive classification has no way of knowing this when the classification labels are purely based on the best config.
Here's a contrived example to illustrate what I'm trying to do
import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].\
rename(columns={'config':'best_config'})
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem config time
0 instance_0 analytic 15.307044
1 instance_0 bruteforce 36.742846
2 instance_0 hybrid 35.053416
3 instance_1 analytic 57.781358
4 instance_1 bruteforce 31.723275
5 instance_1 hybrid 8.080238
6 instance_2 analytic 4.211297
7 instance_2 bruteforce 24.034830
8 instance_2 hybrid 39.073023
9 instance_3 analytic 36.325485
10 instance_3 bruteforce 14.717841
11 instance_3 hybrid 57.103908
12 instance_4 analytic 7.358539
13 instance_4 bruteforce 10.805536
14 instance_4 hybrid 2.605044
15 instance_5 analytic 0.489870
16 instance_5 bruteforce 42.888858
17 instance_5 hybrid 58.634073
dataset:
best_config feature_0 feature_1 feature_2 feature_3 feature_4
0 analytic 0.645388 0.641626 0.975619 0.680713 0.209235
5 hybrid 0.993443 0.221038 0.893763 0.408532 0.254791
6 analytic 0.263872 0.142887 0.264538 0.166985 0.800054
10 bruteforce 0.155023 0.601300 0.258767 0.614732 0.850529
14 hybrid 0.766183 0.993692 0.597047 0.401482 0.275133
15 analytic 0.386327 0.065699 0.349115 0.370136 0.357329
I am using sklearn with the dataset where the X would be the feature columns and the y would be the best_config column. In this example, the "bad" choices for instance_0 are both almost equally bad, but for instance_1, the two wrong choices are not equally bad. So I'd like my custom scorer to be able to reflect this somehow. Is that possible?
In the end I did find a way to get the information I was after in the original question. If you're passing a pandas.Series as your target labels, the index attribute is available, so you can look up whatever you want in the full dataset.
In the solution below, the first part is pretty much the same as the original minimal working example - i.e. generating a fake dataset.
In the second part, a custom scorer function is defined, which is then passed to the cross-validating hyperparameter tuner, RandomizedSearchCV. Please bear in mind the data is garbage, so the "results" are meaningless; this is just a demo of how to refer back to a fuller set of results so that you can evaluate the quality of predictions made during hyperparameter tuning based on more specialised information rather than just "match / fail" when doing a classification.
import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES = 5
HP_ITER = 10
SEED = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]\
.rename(columns={'config':'target'})\
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn
data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds
def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)
print("Timings for all configs:", df_times, "", sep="\n")
print("Labelled dataset:", dataset, "", sep="\n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))
Here's the output:
** Timings for all configs **
problem config time
0 p_000 analytic 21.811701
1 p_000 bruteforce 29.652341
2 p_000 hybrid 20.376605
3 p_001 analytic 12.989269
4 p_001 bruteforce 51.759137
.. ... ... ...
595 p_198 bruteforce 10.874092
596 p_198 hybrid 14.723661
597 p_199 analytic 24.984775
598 p_199 bruteforce 4.899111
599 p_199 hybrid 36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem feature_0 feature_1 feature_2 feature_3 feature_4
0 hybrid p_000 0.864952 0.487293 0.946654 0.863503 0.310866
1 analytic p_001 0.514093 0.007643 0.948784 0.582419 0.258159
2 bruteforce p_002 0.319059 0.872320 0.321495 0.807644 0.158471
3 analytic p_003 0.421063 0.955742 0.114808 0.980013 0.900057
4 hybrid p_004 0.325935 0.125824 0.697967 0.037196 0.923626
.. ... ... ... ... ... ... ...
195 hybrid p_195 0.179126 0.578338 0.391535 0.632501 0.442677
196 bruteforce p_196 0.827637 0.641567 0.710201 0.833341 0.215357
197 hybrid p_197 0.116661 0.480170 0.253893 0.623913 0.465419
198 bruteforce p_198 0.670555 0.037084 0.954332 0.408546 0.935973
199 bruteforce p_199 0.371541 0.463060 0.549176 0.581093 0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 9.1s finished
Test loss with custom CV scorer : 522.3236277796698

CVXPY trying to formulate big problem: ValueError: negative dimensions are not allowed. Is RAM usage the problem here?

I am facing a problem trying to run a fairly large optimization problem in my opinion. You can see the code below. The variable b size is 500 x 96. What I am trying to do is to match a sum of timeseries profiles (351236 x 15 min timesteps) with a bigger profile by minimizing their difference. With the same formulation and a much smaller problem (672 timesteps and a b variable of the size 10 x 5) the problem is solved in under 2 seconds without a problem. But when I am running it for the full scale problem I get the error you see below.
I am running this on Jupyter Lab and python 3.7.4. The python installation is done with conda.
I would expect the problem to solve as with the much smaller problem. But when I run this one, RAM usage explodes up to 100 GB (about 99% of the available RAM on the server). After a while the RAM usage goes down and then a periodical swinging begins (RAM goes up and down from 50% to 100% every few minutes). From the error and after a lot of googling my suspicion is that the problem is too big for the memory and that at some point data is getting broken down to smaller pieces. I do not think it reaches to the point, where the solver does its work. I tried to optimize the code by vectorizing everything (current version) and trying not to have loops etc. in the formulation. But this did not change anything. Do you guys have any clue if this is a bug or a limitation? Or do you maybe have an idea on how to solve this?
X_opt = cp.Constant(np.asarray(X.iloc[:,:500])) # the array size is (35136,500)
K_opt = cp.Constant(np.asarray(K.YearlyDemand)) # the vector size is 96
b = cp.Variable((500,96),boolean = True, value = np.zeros((500,96)))
Y_opt = cp.Constant(np.asarray(y)) # the vector size is 35136
constraints = []
constraints.append( cp.sum(b, axis = 0) == 1 ) # the sum of the elements of every column of b must be equal to 1
constraints.append( cp.sum(b, axis = 1) <= 1 ) # the sum of the elements of every row of b must be smaller or equal to 1
objective = cp.Minimize(cp.sum(cp.abs(Y_opt-cp.sum((cp.diag(K_opt)*((X_opt#b).T)).T, axis = 1))))
prob = cp.Problem(objective, constraints)
prob.solve(solver = cp.GLPK_MI, verbose = True)
ValueError Traceback (most recent call last)
in
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in solve(self, *args, **kwargs)
287 else:
288 solve_func = Problem._solve
--> 289 return solve_func(self, *args, **kwargs)
290
291 #classmethod
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in _solve(self, solver, warm_start, verbose, parallel, gp, qcp, **kwargs)
567 self._construct_chains(solver=solver, gp=gp)
568 data, solving_inverse_data = self._solving_chain.apply(
--> 569 self._intermediate_problem)
570 solution = self._solving_chain.solve_via_data(
571 self, data, warm_start, verbose, kwargs)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\chain.py in apply(self, problem)
63 inverse_data = []
64 for r in self.reductions:
---> 65 problem, inv = r.apply(problem)
66 inverse_data.append(inv)
67 return problem, inverse_data
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\matrix_stuffing.py in apply(self, problem)
98 # Batch expressions together, then split apart.
99 expr_list = [arg for c in cons for arg in c.args]
--> 100 Afull, bfull = extractor.affine(expr_list)
101 if 0 not in Afull.shape and 0 not in bfull.shape:
102 Afull = cvxtypes.constant()(Afull)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\utilities\coeff_extractor.py in affine(self, expr)
76 size = sum([e.size for e in expr_list])
77 op_list = [e.canonical_form[0] for e in expr_list]
---> 78 V, I, J, b = canonInterface.get_problem_matrix(op_list, self.id_map)
79 A = sp.csr_matrix((V, (I, J)), shape=(size, self.N))
80 return A, b.flatten()
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\canonInterface.py in get_problem_matrix(linOps, id_to_col, constr_offsets)
65
66 # Unpacking
---> 67 V = problemData.getV(len(problemData.V))
68 I = problemData.getI(len(problemData.I))
69 J = problemData.getJ(len(problemData.J))
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\cvxcore.py in getV(self, values)
320
321 def getV(self, values):
--> 322 return _cvxcore.ProblemData_getV(self, values)
323
324 def getI(self, values):
ValueError: negative dimensions are not allowed
This problem is solved here:
https://github.com/cvxgrp/cvxpy/issues/826#issuecomment-648618636
Note that the general problem is that large problems create underlying matrixies too large for the numpy.int32 which CVXPY uses. You can modify the code in CVXPY fairly easily to continue using the SCS solver.
You will have to modify the file canonInterface.py here:
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\
If you have trouble finding the second file to modify, just modify the first one, and use the traceback to find the second file.

Is there an SIMD instruction to achieve batch array memory index mapping?

In my RGB to grey case:
Y = (77*R + 150*G + 29*B) >> 8;
I know SIMD (NEON, SSE2) can do like:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = 77*{R0,R1,R2,R3,R4,R5,R6,R7}
{B0,B1,B2,B3,B4,B5,B6,B7} = 150*{G0,G1,G2,G3,G4,G5,G6,G7}
{C0,C1,C2,C3,C4,C5,C6,C7} = 29*{B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
However, the multiply instruction take at least 2 clock cycles, and R,G,B in [0-255],
we can use three lookup table(an array, length=256) to store the partial result of
77*R(mark as X), 150*G(mark as Y), 29*B(mark as Z).
So I'm looking for instructions can do the intention:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = {X[R0],X[R1],X[R2],X[R3],X[R4],X[R5],X[R6],X[R7]}
{B0,B1,B2,B3,B4,B5,B6,B7} = {Y[G0],Y[G1],Y[G2],Y[G3],Y[G4],Y[G5],Y[G6],Y[G7]}
{C0,C1,C2,C3,C4,C5,C6,C7} = {Z[B0],Z[B1],Z[B2],Z[B3],Z[B4],Z[B5],Z[B6],Z[B7]}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
Any good suggestions?
There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.
You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.
So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.
Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.
Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.
If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

Resources