Dask Tutorial: Explain unexpected results from tutorial

Dask Tutorial: Explain unexpected results from tutorial - dask

The tutorial page requested that we ask questions here.
On tutorial 01_dask.delayed, there is the following code:
Parallelizing Increment
Prep
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
return x + y
data = [1, 2, 3, 4, 5, 6, 7, 8]
Calc
results = []
for x in data:
y = delayed(inc)(x)
results.append(y)
total = delayed(sum)(results)
print("Before computing:", total) # Let's see what type of thing total is
result = total.compute()
print("After computing :", result) # After it's computed
This code takes 1 second. This makes sense; each of the 8 inc calculations takes 1 second, the rest are ~ instantaneous, and it can all be run fully in parallel.
Parallelizing Increment and Double
Prep
def double(x):
sleep(1)
return 2 * x
def is_even(x):
return not x % 2
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Calc
results = []
for x in data:
if is_even(x): # even
y = delayed(double)(x)
else: # odd
y = delayed(inc)(x)
results.append(y)
#total = delayed(sum)(results)
total = sum(results)
This takes 2 seconds, which seems strange to me. The situation is the same as above; there are 10 actions that each take 1 second each, and can again be run fully in parallel.
The only thing I can imagine is that my machine is only able to allow for 8 tasks in parallel, but this is tough to know for sure because I have an Intel Core i7 and it seems that some have 8 threads and some have 16. (I have a MacBook Pro, and Apple notoriously likes to hide this detailed information from us pleebs.)
Can anyone confirm if this is what is going on? I am nearly certain, because bumping the data object for the first portion from data = [1, 2, 3, 4, 5, 6, 7, 8] to data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] also bumps the time up to 2 seconds.

I believe your analysis is correct, and you have 8 threads running in parallel for 1s each, before moving on to the remaining data, which do not fill all the threads, but still take 1s to complete.
You may want to try with the distributed scheduler, which provides dashboards for more feedback on what is going on (see later in the tutorial).

Related

Correct way to split data to batches for Keras stateful RNNs

As the documentation states
the last state for each sample at index i in a batch will be used as
initial state for the sample of index i in the following batch
does it mean that to split data to batches I need to do it the following way
e.g. let's assume that I am training a stateful RNN to predict the next integer in range(0, 5) given the previous one
# batch_size = 3
# 0, 1, 2 etc in x are samples (timesteps and features omitted for brevity of the example)
x = [0, 1, 2, 3, 4]
y = [1, 2, 3, 4, 5]
batches_x = [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
batches_y = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
then the state after learning on x[0, 0] will be initial state for x[1, 0]
and x[0, 1] for x[1, 1] (0 for 1 and 1 for 2 etc)?
Is it the right way to do it?

Based on this answer, for which I performed some tests.
Stateful=False:
Normally (stateful=False), you have one batch with many sequences:
batch_x = [
[[0],[1],[2],[3],[4],[5]],
[[1],[2],[3],[4],[5],[6]],
[[2],[3],[4],[5],[6],[7]],
[[3],[4],[5],[6],[7],[8]]
]
The shape is (4,6,1). This means that you have:
1 batch
4 individual sequences = this is batch size and it can vary
6 steps per sequence
1 feature per step
Every time you train, either if you repeat this batch or if you pass a new one, it will see individual sequences. Every sequence is a unique entry.
Stateful=True:
When you go to a stateful layer, You are not going to pass individual sequences anymore. You are going to pass very long sequences divided in small batches. You will need more batches:
batch_x1 = [
[[0],[1],[2]],
[[1],[2],[3]],
[[2],[3],[4]],
[[3],[4],[5]]
]
batch_x2 = [
[[3],[4],[5]], #continuation of batch_x1[0]
[[4],[5],[6]], #continuation of batch_x1[1]
[[5],[6],[7]], #continuation of batch_x1[2]
[[6],[7],[8]] #continuation of batch_x1[3]
]
Both shapes are (4,3,1). And this means that you have:
2 batches
4 individual sequences = this is batch size and it must be constant
6 steps per sequence (3 steps in each batch)
1 feature per step
The stateful layers are meant to huge sequences, long enough to exceed your memory or your available time for some task. Then you slice your sequences and process them in parts. There is no difference in the results, the layer is not smarter or has additional capabilities. It just doesn't consider that the sequences have ended after it processes one batch. It expects the continuation of those sequences.
In this case, you decide yourself when the sequences have ended and call model.reset_states() manually.

Dijkstra algorithm under constraint

I have N vertices one being the source. I would like to find the shortest path that connects all the vertices together (so a N-steps path) with the constraint that all the vertices cannot be visited at whichever step.
A network is defined by N the number of vertices, the source, the cost to travel between each pair of vertices and, for each step the list of vertices that can be visited
For example, if N=5 and the vertices are 1(the source),2,3,4 and 5, the list [[2, 3, 4], [2, 3, 4, 5], [2, 3, 4, 5], [3, 4, 5]] means that for step 2 only vertices 2,3 and 4 can be visited and so forth...
I can't figure out how to adapt the Dijkstra algorithm to my problem. I would really like some ideas Or maybe a better solution is to find something else, are there others algorithm that can handle this problem ?
Note : I posted the same question at math.stackexchange, I apologize if it is considered as a duplicate

You don't need any adaptation. Dijkstra algorithm will work fine under these constraints.
Following your example:
Starting from the vertex 1 we can get to 2 (let's suppose distance d = 2), 3 (d = 7) and 4 (d = 11) - current values of distance is [0, 2, 7, 11, N/A]
Next, pick the vertex with the shortest distance (vertex 2) - we can get from it to 2 again (shouldn't be counted), 3 (d = 3), 4 (d = 4) or 5 (d = 9). We see, that we can get to the vertex 3 with distance 2 + 3 = 5 < 7, which is shorter than 7, so update the value. The same is for the vertex 4 (2 + 4 = 6 < 11) - current values are [0, 2, 5, 6, 9]
Mark all the vertices we visited and follow the algorithm until all the vertices are selected.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?

Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.

What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Dividing data sets into testing and training data

I have a dataset with k examples and I want to partition into m sets.
How can I do it programmatically.
For example, if k = 5 and m = 2, therefore, 5 / 2 = 2.5
How do I partition it into the 2 and 3, and not 2, 2 and 1?
Similarly, if k = 10 and m = 3, I want it to be partitioned into 3, 3 and 4, but not 3, 3, 3 and 1.

Usually, this sort of functionality is built into tools. But, assuming that your observations are independent, just set up a random number generator and do something like:
for i = 1 to k do;
set r = rand();
if r < 0.5 then data[i].which = 'set1'
else data[i].which = 'set2'
You can extend this for any number of sets and probabilities.
For an example where k = 5, then you could actually get all the rows in a single set (I'm thinking about 3% of the time). However, the point of splitting data is for dealing with larger amounts of data. If you only have 5 or 10 rows, then splitting your observations into different partitions is probably not the way to go.

How do you document your development? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am building my first complex app (in RoR) and as I think about passing it on to new programmers, I have been thinking about the best ways to document what I'm building.
How do you like to document your work?
Are there softwares or websites that allow one to easily accumulate sections of documentation, perhaps with tagging for easy reference later on?

If I'm honost: I don't document my apps. When I get new programmers on my team, I give them an introduction to the domain, and that's it. They can read the specs and cucumber features themselves. If there is any special setup required, it's in the README. They can check out the CI configuration too.
That's the power of convention over configuration for ya!

I like to use a wiki. I think it would meet all the goals you named:
an easy way to have various pages and sections
searching and tagging is usually built-in
Plus, there are other features:
You can allow others to help out with the documentation
The docs can grow as they need to: Start out with just a simple one-page site. Then expand when it makes sense.
My two favorites are pbworks.com for private projects: it's free for some uses, and lets you set permissions to private. My other favorite is github, which includes a wiki with every project you create.

I add lots of comments; everywhere. I took the time to write out what logic is happening in human-readable form for every single line of my 500 line music-generation algorithm, and it saved me so much time, and my other friends who were helping.
Here's what I did (as a start):
def __init__(self):
self.chromatic = ['C', ['C#', 'Db'], 'D', ['D#', 'Eb'], 'E', 'F', ['F#', 'Gb'], 'G', ['G#', 'Ab'], 'A', ['A#', 'Bb'], 'B']
self.steps = {}
self.steps['major'] = [2, 2, 1, 2, 2, 2, 1]
self.steps['natural minor'] = [2, 1, 2, 2, 1, 2, 2]
self.steps['harmonic minor'] = [2, 1, 2, 2, 1, 3]
self.steps['melodic minor up'] = [2, 1, 2, 2, 2, 2, 1]
self.steps['melodic minor down'] = [2, 2, 1, 2, 2, 1, 2]
self.steps['dorian'] = [2, 1, 2, 2, 2, 1, 2]
self.steps['mixolydian'] = [2, 2, 1, 2, 2, 1, 2]
self.steps['ahava raba'] = [1, 3, 1, 2, 1, 2, 2]
self.steps['minor penatonic blues'] = [3, 2, 2, 3, 2]
self.list = []
def scale(self, note, name): # Function to generate a scale from the required base note.
if re.sub('[^0-9]', '', note) == '': # Checks for nonexistent octave number
octave = 5 # Defaults to 5
else: # If octave number exists
octave = int(re.sub('[^0-9]', '', note)) # Extracts octave number from note
note = re.sub('[0-9]', '', note) # Strips all numbers from note
scale = [] # Initializes the scale to be empty
for i in rlen(self.chromatic): # Loops through all elements of the Chromatic scale
if self.chromatic[i] is not list: # If the note is just a natural
if note == self.chromatic[i]: scale = [i + 1] # Check if the note is in the chromatic. If it is, add it.
else:
if note in self.chromatic[i]: scale = [i + 1] # If the note is in a key of the chromatic, add it too. It is a sharp/flat.
for i in rlen(self.steps[name]): # Loops through desired scale
scale.append(self.steps[name][i] + scale[i]) # Adds string notes following the algorithm of the scale
scale[i + 1] = scale[i + 1] % len(self.chromatic) # Modulo length of Chromatic scale
It's a start (and an example with cruddy code), but it helps me debug code really quickly.

How about rake doc:app along with expected code commenting?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Dask Tutorial: Explain unexpected results from tutorial - dask

Related

Correct way to split data to batches for Keras stateful RNNs

Dijkstra algorithm under constraint

How to predict users' preferences using item similarity?

Dividing data sets into testing and training data

How do you document your development? [closed]

Categories

Resources