dask read_parquet with pyarrow memory blow up - dask

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine.
My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the worker to restart.
I have a reproducible example below which fails with pyarrow on a worker of 1gb memory limit.
In reality my dataset is much more bigger than this. The only reason of using pyarrow is it gives me speed boost while scanning compared to fastparquet(somewhere around 7x-8x)
dask : 0.17.1
pyarrow : 0.9.0.post1
fastparquet : 0.1.3
import dask.dataframe as dd
import numpy as np
import pandas as pd
size = 9900000
tmpdir = '/tmp/test/outputParquet1'
d = {'a': np.random.normal(0, 0.3, size=size).cumsum() + 50,
'b': np.random.choice(['A', 'B', 'C'], size=size),
'c': np.random.choice(['D', 'E', 'F'], size=size),
'd': np.random.normal(0, 0.4, size=size).cumsum() + 50,
'e': np.random.normal(0, 0.5, size=size).cumsum() + 50,
'f': np.random.normal(0, 0.6, size=size).cumsum() + 50,
'g': np.random.normal(0, 0.7, size=size).cumsum() + 50}
df = dd.from_pandas(pd.DataFrame(d), 200)
df.to_parquet(tmpdir, compression='snappy', write_index=True,
engine='fastparquet')
#engine = 'pyarrow' #fails due to worker restart
engine = 'fastparquet' #works fine
df_partitioned = dd.read_parquet(tmpdir + "/*.parquet", engine=engine)
print(df_partitioned.count().compute())
df_partitioned.query("b=='A'").count().compute()
Edit: My original setup has spark jobs running that writes data parallely into partitions using fastparquet. So the metadata file is created in the innermost partition rather than the parent directory.Hence using glob paths instead of parent directory(fastparquet is much faster with parent directory read whereas pyarrow wins when scanning with glob path)

I recommend selecting the columns you need in the read_parquet call
df = dd.read_parquet('/path/to/*.parquet', engine='pyarrow', columns=['b'])
This will allow you to efficiently read only a few columns that you need rather than all of the columns at once.

Some timing results on my non memory-restricted system
With your example data
In [17]: df_partitioned = dd.read_parquet(tmpdir, engine='fastparquet')
In [18]: %timeit df_partitioned.count().compute()
2.47 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: df_partitioned = dd.read_parquet(tmpdir, engine='pyarrow')
In [20]: %timeit df_partitioned.count().compute()
1.93 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With columns b and c converted to categorical before writing
In [30]: df_partitioned = dd.read_parquet(tmpdir, engine='fastparquet')
In [31]: %timeit df_partitioned.count().compute()
1.25 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [32]: df_partitioned = dd.read_parquet(tmpdir, engine='pyarrow')
In [33]: %timeit df_partitioned.count().compute()
1.82 s ± 63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With fastparquet direct, single-threaded
In [36]: %timeit fastparquet.ParquetFile(tmpdir).to_pandas().count()
1.82 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With 20 partitions instead of 200 (fastparquet, categories)
In [42]: %timeit df_partitioned.count().compute()
863 ms ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You could also filter as you load the data.e.g by a specific column
df = dd.read_parquet('/path/to/*.parquet', engine='fastparquet', filters=[(COLUMN, 'operation', 'SOME_VALUE')]).
Imagine operations like ==, >, <, and so on.

Related

Drift Detection in categorical variables of high cardinality (10000+)

I am trying to solve a drift detection problem where I have to find out the drift in high cardinality (10000+) categorical variables such as ip_address, zipcode, cities. I have data points in the order of millions. I have tried the following methods -
chi square test from evidently python package https://github.com/evidentlyai/evidently/blob/main/src/evidently/analyzers/stattests/chisquare_stattest.py
maximum mean discrepancy test with GaussianRBF Kernel from alibi-detecthttps://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/mmd.py
I have faced the below problems while applying these methods on my data
In chi square test, there is a constraint that we should have same set of categories in both training and inference datasets. This is highly unlikely for the features like ip address and zipcode. There are some data points which are available in training data but not in inference data. For such data points, I don't get observed frequency. I can assume their frequency as 0 as a work around.
But there are data points which have been newly introduced in the inference dataset and don't have their presence in the training dataset. So I would not be able to find out their expected frequency from training dataset. For such data points, I would have 0 in the denominator of the chi square formula and test statistic will be NaN. As a workaround, I can assume their minimum expected frequency equal to 1. But I wonder whether this is the correct way to approach the drift detection.
Moreover, the larger problem is the following -
The nature of these categorical feature variables is such that they can take any possible value from a very very large set of values. I don't have any control over these features taking a value. The users of the system can login from any IP address and from any zipcode. This becomes very difficult to find out the real drift in the data. Methods like chi square test can always give the significant result for such features. Is their any method which can handle such features for drift detection which takes into consideration the high cardinality and the aforementioned nature of the data.
In MMD test with GaussianRBF kernel, we use to calculate pairwise distance between two vectors. X is my training dataset which is having 15 millions records and 10 features. and Y is my inference dataset which is having 10 millions records and 10 features. Now when I perform MMD test on these datasets. I get the following -
a) K_XX = within similarity of X
b) K_YY = within similarity of Y
c) K_XY = cross similarity between X and Y
K_XX will try to generate a matrix (15 million X 15 million). This gives me "ResourceExhaustedError".
ResourceExhaustedError Traceback (most recent call last)
<command-574623167633967> in <module>
1 from alibi_detect.cd import MMDDrift
---> 2 detector = MMDDrift(x_ref=X, backend='tensorflow')
3 res = detector.predict(x=Y)
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/mmd.py in __init__(self, x_ref, backend, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type)
101 if backend == 'tensorflow' and has_tensorflow:
102 kwargs.pop('device', None)
--> 103 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore
104 else:
105 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/warnings.py in wrapper(*args, **kwargs)
15 def wrapper(*args, **kwargs):
16 _rename_kwargs(f.__name__, kwargs, aliases)
---> 17 return f(*args, **kwargs)
18 return wrapper
19 return deco
/databricks/python/lib/python3.8/site-packages/alibi_detect/cd/tensorflow/mmd.py in __init__(self, x_ref, p_val, x_ref_preprocessed, preprocess_at_init, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, input_shape, data_type)
86 # compute kernel matrix for the reference data
87 if self.infer_sigma or isinstance(sigma, tf.Tensor):
---> 88 self.k_xx = self.kernel(self.x_ref, self.x_ref, infer_sigma=self.infer_sigma)
89 self.infer_sigma = False
90 else:
/databricks/python/lib/python3.8/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/kernels.py in call(self, x, y, infer_sigma)
75 y = tf.cast(y, x.dtype)
76 x, y = tf.reshape(x, (x.shape[0], -1)), tf.reshape(y, (y.shape[0], -1)) # flatten
---> 77 dist = distance.squared_pairwise_distance(x, y) # [Nx, Ny]
78
79 if infer_sigma or self.init_required:
/databricks/python/lib/python3.8/site-packages/alibi_detect/utils/tensorflow/distance.py in squared_pairwise_distance(x, y, a_min, a_max)
28 x2 = tf.reduce_sum(x ** 2, axis=-1, keepdims=True)
29 y2 = tf.reduce_sum(y ** 2, axis=-1, keepdims=True)
---> 30 dist = x2 + tf.transpose(y2, (1, 0)) - 2. * x # tf.transpose(y, (1, 0))
31 return tf.clip_by_value(dist, a_min, a_max)
32
ResourceExhaustedError: Exception encountered when calling layer "gaussian_rbf_20" (type GaussianRBF).
OOM when allocating tensor with shape[14335347,14335347] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:AddV2]
The error seems pretty obvious given the quadratic complexity of MMD. Is their any way to mitigate this issue given the constraints that number of records in millions and high cardinality of categorical features.

When doing classification, why do I get different precision for the same testing data?

I am testing a dataset with two labels 'A' and 'B' on a decision tree classifier. I accidentally found out that the model get different precision result on the same testing data. I want to know why.
Here is what I do, I train the model, and test it on
1. the testing set,
2. the data only labelled 'A' in the testing set,
3. and the data only labelled 'B'.
Here is what I got:
for testing dataset
precision recall f1-score support
A 0.94 0.95 0.95 25258
B 0.27 0.22 0.24 1963
for data only labelled 'A' in testing dataset
precision recall f1-score support
A 1.00 0.95 0.98 25258
B 0.00 0.00 0.00 0
for data only labelled 'B' in testing dataset
precision recall f1-score support
A 0.00 0.00 0.00 0
B 1.00 0.22 0.36 1963
The training dataset and model are the same, the data in 2 and 3rd test are also same with those in 1. Why the precision for 'A' and 'B' differ so much? What is the real precision for this model? Thank you very much.
You sound confused, and it is not at all clear why you are interested in metrics where you have completely remove one of the two labels from your evaluation set.
Let's explore the issue with some reproducible dummy data:
from sklearn.metrics import classification_report
import numpy as np
y_true = np.array([0, 1, 0, 1, 1, 0, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 1])
target_names = ['A', 'B']
print(classification_report(y_true, y_pred, target_names=target_names))
Result:
precision recall f1-score support
A 0.50 0.50 0.50 4
B 0.33 0.33 0.33 3
avg / total 0.43 0.43 0.43 7
Now, let's keep only class A in our y_true:
indA = np.where(y_true==0)
print(indA)
print(y_true[indA])
print(y_pred[indA])
Result:
(array([0, 2, 5, 6], dtype=int64),)
[0 0 0 0]
[0 1 0 1]
Now, here is the definition of precision from the scikit-learn documentation:
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
For class A, a true positive (tp) would be a case where the true class is A (0 in our case), and we have indeed predict A (0); from above, it is apparent that tp=2.
The tricky part is the false positives (fp): they are the cases where we have predicted A (0), where the true label is B (1). But it is apparent here that we cannot have any such cases, since we have (intentionally) removed all the B's from our y_true (why we would want to do such a thing? I don't know, it does not make any sense at all); hence, fp=0 in this (weird) setting. Hence, our precision for class A will be tp / (tp+0) = tp/tp = 1.
Which is the exact same result given by the classification report:
print(classification_report(y_true[indA], y_pred[indA], target_names=target_names))
# result:
precision recall f1-score support
A 1.00 0.50 0.67 4
B 0.00 0.00 0.00 0
avg / total 1.00 0.50 0.67 4
and obviously the case for B is identical.
why the precision is not 1 in case #1 (for both A and B)? The data are the same
No, they are very obviously not the same - the ground truth is altered!
Bottom line: removing classes from your y_true before computing precision etc. does not make any sense at all (i.e. your reported results in case #2 and case #3 are of no practical use whatsoever); but, since for whatever reasons you decide to do so, your reported results are exactly as expected.

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name.
The current method used by the system I'm on is K-means, but that seems like overkill.
Is there a better way of performing this task?
Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?
I see how KDE returns a density, but how do I tell it to split the data into bins?
How do I have a fixed number of bins independent of the data (that's one of my requirements) ?
More specifically, how would one pull this off using scikit learn?
My input file looks like:
str ID sls
1 10
2 11
3 9
4 23
5 21
6 11
7 45
8 20
9 11
10 12
I want to group the sls number into clusters or bins, such that:
Cluster 1: [10 11 9 11 11 12]
Cluster 2: [23 21 20]
Cluster 3: [45]
And my output file will look like:
str ID sls Cluster ID Cluster centroid
1 10 1 10.66
2 11 1 10.66
3 9 1 10.66
4 23 2 21.33
5 21 2 21.33
6 11 1 10.66
7 45 3 45
8 20 2 21.33
9 11 1 10.66
10 12 1 10.66
Write code yourself. Then it fits your problem best!
Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.
%matplotlib inline
from numpy import array, linspace
from sklearn.neighbors.kde import KernelDensity
from matplotlib.pyplot import plot
a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)
from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878 33.67346939]
> Maxima: [ 10.20408163 21.42857143 44.89795918]
Your clusters therefore are
print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11 9 11 11 12] [23 21 20] [45]
and visually, we did this split:
plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
s[mi[1]:], e[mi[1]:], 'b',
s[ma], e[ma], 'go',
s[mi], e[mi], 'ro')
We cut at the red markers. The green markers are our best estimates for the cluster centers.
There is a little error in the accepted answer by #Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).
The line:
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
Should be edited into:
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])
That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:
import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema
a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])
result:
Grouping by Has QUIT:
[] [10 11 9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11 9 11 11 12] [23 21 20] [45]
Further improving the responses above by #yasirroni, to dynamically print all clusters (not just 3 from the above) the line:
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])
can be changed into:
print(a[a < s[mi][0]]) # print most left cluster
# print all middle cluster
for i_cluster in range(len(mi)-1):
print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])
print(a[a >= s[mi][-1]]) # print most right cluster
This would ensure that all the clusters are taken into account.

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

extremely slow program from using AVX instructions

I'm trying to write a geometric mean sqrt(a * b) using AVX intrinsics, but it runs slower than molasses!
int main()
{
int count = 0;
for (int i = 0; i < 100000000; ++i)
{
__m128i v8n_a = _mm_set1_epi16((++count) % 16),
v8n_b = _mm_set1_epi16((++count) % 16);
__m128i v8n_0 = _mm_set1_epi16(0);
__m256i temp1, temp2;
__m256 v8f_a = _mm256_cvtepi32_ps(temp1 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_a, v8n_0)), _mm_unpackhi_epi16(v8n_a, v8n_0), 1)),
v8f_b = _mm256_cvtepi32_ps(temp2 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_b, v8n_0)), _mm_unpackhi_epi16(v8n_b, v8n_0), 1));
__m256i v8n_meanInt32 = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_mul_ps(v8f_a, v8f_b)));
__m128i v4n_meanLo = _mm256_castsi256_si128(v8n_meanInt32),
v4n_meanHi = _mm256_extractf128_si256(v8n_meanInt32, 1);
g_data[i % 8] = v4n_meanLo;
g_data[(i + 1) % 8] = v4n_meanHi;
}
return 0;
}
The key to this mystery is that I'm using Intel ICC 11 and it's only slow when compiling with icc -O3 sqrt.cpp. If I compile with icc -O3 -xavx sqrt.cpp, then it runs 10x faster.
But it's not obvious if there's emulation happening because I used performance counters and the number of instructions executed for both versions is roughly 4G:
Performance counter stats for 'a.out':
16867.119538 task-clock # 0.999 CPUs utilized
37 context-switches # 0.000 M/sec
8 CPU-migrations # 0.000 M/sec
281 page-faults # 0.000 M/sec
35,463,758,996 cycles # 2.103 GHz
23,690,669,417 stalled-cycles-frontend # 66.80% frontend cycles idle
20,846,452,415 stalled-cycles-backend # 58.78% backend cycles idle
4,023,012,964 instructions # 0.11 insns per cycle
# 5.89 stalled cycles per insn
304,385,109 branches # 18.046 M/sec
42,636 branch-misses # 0.01% of all branches
16.891160582 seconds time elapsed
-----------------------------------with -xavx----------------------------------------
Performance counter stats for 'a.out':
1288.423505 task-clock # 0.996 CPUs utilized
3 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
279 page-faults # 0.000 M/sec
2,708,906,702 cycles # 2.102 GHz
1,608,134,568 stalled-cycles-frontend # 59.36% frontend cycles idle
798,177,722 stalled-cycles-backend # 29.46% backend cycles idle
3,803,270,546 instructions # 1.40 insns per cycle
# 0.42 stalled cycles per insn
300,601,809 branches # 233.310 M/sec
15,167 branch-misses # 0.01% of all branches
1.293986790 seconds time elapsed
Is there some kind of processor internal emulation going on? I know for denormal numbers, adds end up being 64 times slower than normal.
You need vzeroupper when mixing VEX and non-VEX vector instructions. Otherwise you get huge stalls on Intel hardware.

Resources