Condition for memory access conflict in memory-banked vector processors - vectorization

The Hennessy-Patterson book on Computer Architecture (Quantitative Approach 5ed) says that in a vector architecture with multiple memory banks, a bank conflict can happen if the following condition is met (Page 279 in 5ed):
(Number of banks) / LeastCommonMultiple(Number of banks, Stride) < Bank busy time
However, I think it should be GreatestCommonFactor instead of LCM, because memory conflict would occur if the effective number of banks you have is less than the busy time. By effective number of banks I mean this - let's say you have 8 banks, and a stride of 2. Then effectively you have 4 banks, because the memory accesses will be lined up only at four banks (e.g, let's say your accesses are all even numbers, starting from 0, then your accesses will be lined up at banks 0,2,4,6).
In fact, this formula even fails for the example given right below it. Suppose we have 8 memory banks with busy time of 6 clock cycles, with total memory latency of 12 clock cycles, how long will it take to complete a 64-element vector load with stride of 1? - Here they calculate the time as 12+64=76 clock cycles. However, memory bank conflict will occur according to the condition given, so we clearly can't have one access per cycle (64 in the equation).
Am I getting it wrong, or has the wrong formula managed to survive 5 editions of this book (unlikely)?

GCD(banks, stride) should come into it; your argument about that is correct.
Let's try this for a few different strides and see what we get,
for number of banks = b = 8.
# generated with the calc(1) function
define f(s) { print s, " | ", lcm(s,8), " | ", gcd(s,8), " | ", 8/lcm(s,8), " | ", 8/gcd(s,8) }`
stride | LCM(s,b) | GCF(s,b) | b/LCM(s,b) | b/GCF(s,b)
1 | 8 | 1 | 1 | 8 # 8 < 6 = false: no conflict
2 | 8 | 2 | 1 | 4 # 4 < 6 = true: conflict
3 | 24 | 1 | ~0.333 | 8 # 8 < 6 = false: no conflict
4 | 8 | 4 | 1 | 2 # 2 < 6 = true: conflict
5 | 40 | 1 | 0.2 | 8
6 | 24 | 2 | ~0.333 | 4
7 | 56 | 1 | ~0.143 | 8
8 | 8 | 8 | 1 | 1
9 | 72 | 1 | ~0.111 | 8
x >=8 2^0..3 <=1 1 2 4 or 8
b/LCM(s,b) is always <=1, so it always predicts conflicts.
I think GCF (aka GCD) looks right for the stride values I've looked at so far. You only have a problem if the stride doesn't distribute the accesses over all the banks, and that's what b/GCF(s,b) tells you.
Stride = 8 should be the worst-case, using the same bank every time. gcd(8,8) = lcm(8,8) = 8. So both expressions give 8/8 = 1 which is less than the bank busy/recovery time, thus correctly predicting conflicts.
Stride=1 is of course the best case (no conflicts if there are enough banks to hide the busy time). gcd(8,1) = 1 correctly predicts no conflicts: (8/1 = 8, which is not less than 6). lcm(8,1) = 8. (8/8 < 6 is true) incorrectly predicts conflicts.

Related

machine learning model different inputs

i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.
Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further

How to optimize pyspark join with fill last occurance?

I have two dataframes: stage_changes, journeys with their description as follows,
In my app, every candidate gets associated with a level_id on his starting day. Candidate's level is changed whenever some progress is made, so duration of num of days between level change is not fixed.
For example, candiate-A is on level-0 on day-0 then level-1 on day-5 then directly level-4 on day-25. This data is tracked in the dataframe stage_changes.
stage_changes:
account_id | candidate_id | day_num | level_id
21 | 23097 | 0 | 0
21 | 23097 | 5 | 1
21 | 23097 | 25 | 4
45 | 53838 | 4 | 0
45 | 23097 | 30 | 7
21 | 23056 | 45 | 1
Every candidate is active for a specific period, described as starting_day to ending_day. And this is tracked in another dataframe journeys as,
journeys:
account_id | candidate_id | starting_day | ending_day
21 | 23097 | 0 | 76
45 | 53838 | 4 | 45
21 | 23056 | 45 | 101
I want to get level_id of every candidate on each day during his/her journey. I am currently doing this as follows,
#udf("array<integer>")
def day_range(start_day, end_day):
return list(range(start_day, end_day+1))
all_journey_days = journeys \
.withColumn("day_range", day_range(col("starting_day"), col("ending_day"))) \
.select(["account_id", "candidate_id", explode("day_range").alias("day_num")])
window = Window().partitionBy(["account_id", "candidate_id"]).orderBy("day_num") \
.rowsBetween(Window.unboundedPreceding, 0)
all_day_stage_changes = all_journey_days \
.join(stage_changes, on=["account_id", "candidate_id", "day_num"], how="left") \
.withColumn("final_level_id", last(col("level_id"), ignorenulls=True).over(window))
I'm getting the correct output as per this code, but this takes several minutes considering huge data size at my end. Is there any optimization to this, so that the whole process can finish quickly?
Important points:
1. Every candidate has different activity period.
2. Level changes of every candidate are different, there is no hidden logic here.
3. The number of candidates is ~1M and their avg. activity period ~100 days.
4. Every pair of account_id, candidate_id is unique in the system and not only the candidate_id.

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

prepare clickstream for k-means clustering

i'm new to machine learning algorithms and i'm trying to do a user segmentation based on the users clickstreams of a news website. i have prepared the clickstreams so that i know which user id read which news-category and how many times.
so my table looks something like this:
-------------------------------------------------------
| UserID | Category 1 | Category 2 | ... | Category 20
-------------------------------------------------------
| 123 | 4 | 0 | ... | 2
-------------------------------------------------------
| 124 | 0 | 10 | ... | 12
-------------------------------------------------------
i'm wondering if the k-means works well for so many categories? would it be better to use percentages instead of whole numbers for the read articles?
so e.g. user123 read 6 articles overall - 4 of 6 were category 1 so its 66,6% interest in category 1.
another idea would be to pick the 3 most-read categories of each user and transform the table to something like this whereby Interest 1 : 12 means that the user is most interested in Category 12
-------------------------------------------------------
| UserID | Interest 1 | Interest 2 | Interest 3
-------------------------------------------------------
| 123 | 1 | 12 | 7
-------------------------------------------------------
| 124 | 12 | 13 | 20
-------------------------------------------------------
K-means will not work well for two main reasons:
It is for continuous, dense data. Your data is discrete.
It is not robust to outliers, you probably have a lot of noisy data
well, the number of users is not defined because it's a theoretical approach, but because it's a news website let's assume there are millions of users...
would there be another, better algorithm for clustering user groups based on their category interests? and when i prepare the data of the first table so that i have the interest of one user for each category in percentage - the data would be continuous and not discrete - or am i wrong?

Description matching in record linkage using Machine learning Approach

We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.

Resources