Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am building my first complex app (in RoR) and as I think about passing it on to new programmers, I have been thinking about the best ways to document what I'm building.
How do you like to document your work?
Are there softwares or websites that allow one to easily accumulate sections of documentation, perhaps with tagging for easy reference later on?
If I'm honost: I don't document my apps. When I get new programmers on my team, I give them an introduction to the domain, and that's it. They can read the specs and cucumber features themselves. If there is any special setup required, it's in the README. They can check out the CI configuration too.
That's the power of convention over configuration for ya!
I like to use a wiki. I think it would meet all the goals you named:
an easy way to have various pages and sections
searching and tagging is usually built-in
Plus, there are other features:
You can allow others to help out with the documentation
The docs can grow as they need to: Start out with just a simple one-page site. Then expand when it makes sense.
My two favorites are pbworks.com for private projects: it's free for some uses, and lets you set permissions to private. My other favorite is github, which includes a wiki with every project you create.
I add lots of comments; everywhere. I took the time to write out what logic is happening in human-readable form for every single line of my 500 line music-generation algorithm, and it saved me so much time, and my other friends who were helping.
Here's what I did (as a start):
def __init__(self):
self.chromatic = ['C', ['C#', 'Db'], 'D', ['D#', 'Eb'], 'E', 'F', ['F#', 'Gb'], 'G', ['G#', 'Ab'], 'A', ['A#', 'Bb'], 'B']
self.steps = {}
self.steps['major'] = [2, 2, 1, 2, 2, 2, 1]
self.steps['natural minor'] = [2, 1, 2, 2, 1, 2, 2]
self.steps['harmonic minor'] = [2, 1, 2, 2, 1, 3]
self.steps['melodic minor up'] = [2, 1, 2, 2, 2, 2, 1]
self.steps['melodic minor down'] = [2, 2, 1, 2, 2, 1, 2]
self.steps['dorian'] = [2, 1, 2, 2, 2, 1, 2]
self.steps['mixolydian'] = [2, 2, 1, 2, 2, 1, 2]
self.steps['ahava raba'] = [1, 3, 1, 2, 1, 2, 2]
self.steps['minor penatonic blues'] = [3, 2, 2, 3, 2]
self.list = []
def scale(self, note, name): # Function to generate a scale from the required base note.
if re.sub('[^0-9]', '', note) == '': # Checks for nonexistent octave number
octave = 5 # Defaults to 5
else: # If octave number exists
octave = int(re.sub('[^0-9]', '', note)) # Extracts octave number from note
note = re.sub('[0-9]', '', note) # Strips all numbers from note
scale = [] # Initializes the scale to be empty
for i in rlen(self.chromatic): # Loops through all elements of the Chromatic scale
if self.chromatic[i] is not list: # If the note is just a natural
if note == self.chromatic[i]: scale = [i + 1] # Check if the note is in the chromatic. If it is, add it.
else:
if note in self.chromatic[i]: scale = [i + 1] # If the note is in a key of the chromatic, add it too. It is a sharp/flat.
for i in rlen(self.steps[name]): # Loops through desired scale
scale.append(self.steps[name][i] + scale[i]) # Adds string notes following the algorithm of the scale
scale[i + 1] = scale[i + 1] % len(self.chromatic) # Modulo length of Chromatic scale
It's a start (and an example with cruddy code), but it helps me debug code really quickly.
How about rake doc:app along with expected code commenting?
Related
In the scenario of having three sets
A train set of e.g. 80% (for model training)
A validation set of e.g. 10% (for model training)
A test set of e.g. 10% (for final model testing)
let's say I perform k-fold cross validation (CV) on the example dataset of [1,2,3,4,5,6,7,8,9,10]. Let's also say
10 is the test set in this example
the remaining [1,2,3,4,5,6,7,8,9] will be used for training and validation
leave-one-out CV would than look something like this
# Fold 1
[2, 3, 4, 5, 6, 7, 8, 9] # train
[1] # validation
# Fold 2
[1, 3, 4, 5, 6, 7, 8, 9] # train
[2] # validation
# Fold 3
[1, 2, 4, 5, 6, 7, 8, 9] # train
[3] # validation
# Fold 4
[1, 2, 3, 5, 6, 7, 8, 9] # train
[4] # validation
# Fold 5
[1, 2, 3, 4, 6, 7, 8, 9] # train
[5] # validation
# Fold 6
[1, 2, 3, 4, 5, 7, 8, 9] # train
[6] # validation
# Fold 7
[1, 2, 3, 4, 5, 6, 8, 9] # train
[7] # validation
# Fold 8
[1, 2, 3, 4, 5, 6, 7, 9] # train
[8] # validation
# Fold 9
[1, 2, 3, 4, 5, 6, 7, 8] # train
[9] # validation
Great, now the model has been built and validation using each data point of the combined train and validation set once.
Next, I would test my model on the test set (10) and get some performance.
What I was wondering now is why we not also perform CV using the test set and average the result to see the impact of different test sets? Meaning why we don't do the above process 10 times such that we have each data point also in the test set?
It would be obviously computationally extremely expensive but I was thinking about that cause it seemed difficult to choose an appropriate test set. For example, it could be that my model from above would have performed much differently when I would have chosen 1 as the test set and trained and validated on the remaining points.
I wondered about this in scenarios where I have groups in my data. For example
[1,2,3,4] comes from group A,
[5,6,7,8] comes from group B and
[9,10] comes from group C.
In this case when choosing 10 as the test set, it could perform much differently than choosing 1 right, or am I missing something here?
All your train-validation-test splits should be randomly sampled and sufficiently big. Hence if your data comes from different groups you should have roughly the same distribution of groups across train, validation and test pools. If your test performance varies based on the sampling seed you're definitely doing something wrong.
As to why not use test set for cross-validation, this would result in overfitting. Usually you would run your cross-validation many times with different hyperparameters and use cv score to select best models. If you don't have a separate test set to evaluate your model at the end of model selection you would never know if you overfitted to the training pool during model selection iterations.
The tutorial page requested that we ask questions here.
On tutorial 01_dask.delayed, there is the following code:
Parallelizing Increment
Prep
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
return x + y
data = [1, 2, 3, 4, 5, 6, 7, 8]
Calc
results = []
for x in data:
y = delayed(inc)(x)
results.append(y)
total = delayed(sum)(results)
print("Before computing:", total) # Let's see what type of thing total is
result = total.compute()
print("After computing :", result) # After it's computed
This code takes 1 second. This makes sense; each of the 8 inc calculations takes 1 second, the rest are ~ instantaneous, and it can all be run fully in parallel.
Parallelizing Increment and Double
Prep
def double(x):
sleep(1)
return 2 * x
def is_even(x):
return not x % 2
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Calc
results = []
for x in data:
if is_even(x): # even
y = delayed(double)(x)
else: # odd
y = delayed(inc)(x)
results.append(y)
#total = delayed(sum)(results)
total = sum(results)
This takes 2 seconds, which seems strange to me. The situation is the same as above; there are 10 actions that each take 1 second each, and can again be run fully in parallel.
The only thing I can imagine is that my machine is only able to allow for 8 tasks in parallel, but this is tough to know for sure because I have an Intel Core i7 and it seems that some have 8 threads and some have 16. (I have a MacBook Pro, and Apple notoriously likes to hide this detailed information from us pleebs.)
Can anyone confirm if this is what is going on? I am nearly certain, because bumping the data object for the first portion from data = [1, 2, 3, 4, 5, 6, 7, 8] to data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] also bumps the time up to 2 seconds.
I believe your analysis is correct, and you have 8 threads running in parallel for 1s each, before moving on to the remaining data, which do not fill all the threads, but still take 1s to complete.
You may want to try with the distributed scheduler, which provides dashboards for more feedback on what is going on (see later in the tutorial).
As the documentation states
the last state for each sample at index i in a batch will be used as
initial state for the sample of index i in the following batch
does it mean that to split data to batches I need to do it the following way
e.g. let's assume that I am training a stateful RNN to predict the next integer in range(0, 5) given the previous one
# batch_size = 3
# 0, 1, 2 etc in x are samples (timesteps and features omitted for brevity of the example)
x = [0, 1, 2, 3, 4]
y = [1, 2, 3, 4, 5]
batches_x = [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
batches_y = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
then the state after learning on x[0, 0] will be initial state for x[1, 0]
and x[0, 1] for x[1, 1] (0 for 1 and 1 for 2 etc)?
Is it the right way to do it?
Based on this answer, for which I performed some tests.
Stateful=False:
Normally (stateful=False), you have one batch with many sequences:
batch_x = [
[[0],[1],[2],[3],[4],[5]],
[[1],[2],[3],[4],[5],[6]],
[[2],[3],[4],[5],[6],[7]],
[[3],[4],[5],[6],[7],[8]]
]
The shape is (4,6,1). This means that you have:
1 batch
4 individual sequences = this is batch size and it can vary
6 steps per sequence
1 feature per step
Every time you train, either if you repeat this batch or if you pass a new one, it will see individual sequences. Every sequence is a unique entry.
Stateful=True:
When you go to a stateful layer, You are not going to pass individual sequences anymore. You are going to pass very long sequences divided in small batches. You will need more batches:
batch_x1 = [
[[0],[1],[2]],
[[1],[2],[3]],
[[2],[3],[4]],
[[3],[4],[5]]
]
batch_x2 = [
[[3],[4],[5]], #continuation of batch_x1[0]
[[4],[5],[6]], #continuation of batch_x1[1]
[[5],[6],[7]], #continuation of batch_x1[2]
[[6],[7],[8]] #continuation of batch_x1[3]
]
Both shapes are (4,3,1). And this means that you have:
2 batches
4 individual sequences = this is batch size and it must be constant
6 steps per sequence (3 steps in each batch)
1 feature per step
The stateful layers are meant to huge sequences, long enough to exceed your memory or your available time for some task. Then you slice your sequences and process them in parts. There is no difference in the results, the layer is not smarter or has additional capabilities. It just doesn't consider that the sequences have ended after it processes one batch. It expects the continuation of those sequences.
In this case, you decide yourself when the sequences have ended and call model.reset_states() manually.
I have N vertices one being the source. I would like to find the shortest path that connects all the vertices together (so a N-steps path) with the constraint that all the vertices cannot be visited at whichever step.
A network is defined by N the number of vertices, the source, the cost to travel between each pair of vertices and, for each step the list of vertices that can be visited
For example, if N=5 and the vertices are 1(the source),2,3,4 and 5, the list [[2, 3, 4], [2, 3, 4, 5], [2, 3, 4, 5], [3, 4, 5]] means that for step 2 only vertices 2,3 and 4 can be visited and so forth...
I can't figure out how to adapt the Dijkstra algorithm to my problem. I would really like some ideas Or maybe a better solution is to find something else, are there others algorithm that can handle this problem ?
Note : I posted the same question at math.stackexchange, I apologize if it is considered as a duplicate
You don't need any adaptation. Dijkstra algorithm will work fine under these constraints.
Following your example:
Starting from the vertex 1 we can get to 2 (let's suppose distance d = 2), 3 (d = 7) and 4 (d = 11) - current values of distance is [0, 2, 7, 11, N/A]
Next, pick the vertex with the shortest distance (vertex 2) - we can get from it to 2 again (shouldn't be counted), 3 (d = 3), 4 (d = 4) or 5 (d = 9). We see, that we can get to the vertex 3 with distance 2 + 3 = 5 < 7, which is shorter than 7, so update the value. The same is for the vertex 4 (2 + 4 = 6 < 11) - current values are [0, 2, 5, 6, 9]
Mark all the vertices we visited and follow the algorithm until all the vertices are selected.
Given a very huge table of the following format (e.g. snippet):
Subject, Condition, VPH, Task, Round, Item, Decision, Self, Other, RT
1, 1, 1, SVO, 0, 0, 4, 2.5, 2.0, 8.598
1, 1, 1, SVO, 1, 5, 3, 4.1, 3.4, 7.785
1, 1, 1, SVO, 2, 4, 3, 3.2, 3.4, 15.713
2, 2, 1, SVO, 0, 0, 4, 2.5, 2.0, 15.439
2, 2, 1, SVO, 1, 2, 7, 4.9, 2.3, 30.777
2, 2, 1, SVO, 2, 3, 8, 4.3, 4.3, 13.549
3, 3, 1, SVO, 0, 0, 5, 2.8, 1.5, 9.066
... (And so on)
Needed: Compute the mean over all rounds for self and others for each subject.
What i have so far:
I sorted the about 100mb .txt file using bash sort so the subject and the related rounds appear after each other (like the example shows). After that i imported the .txt file into SPSS24. Right now i have no idea to write a function that computes for each subject the mean of variable self and others over the three rounds. E.g.: (some pseudo-code)
for n = 1 to last_subject do:
get row self where lines have line_subject as n
compute mean over these content
write result as new variable self_mean as new variable after variabel RT at line n
increase n by one
As i am totally new to SPSS i really appreciate detailed help. I am also satisfied with references that specifically attend to computation over rows (i found lots of stuff over columns).
Thank you very much!
Edit: example output
After computing the table should look like this:
Subject, Mean_Self, Mean_Others
1, 3.27, 2.9
2, ..., ...
3,
... (And so on)
So now we computed the Mean_Self from the top example like so:
mean(2.5 + 4.1 + 3.2)
where:
2.5 was used from line 1 of Variable Self
4.1 was used from line 2 of Variable Self
3.2 was used from line 3 of Variable Self
2.5 was not used from line 4 of Variable Self because Variable Subject changed, there for we want to repeat the process with the new Subject (here 2) until it changes again. The results should create a table like the one above. Same procedure for Variable Other.
If I understand right what you need is the aggregate command. aggregate can create a new dataset/file with your aggregated data, or add the aggregated data to your active dataset, like you described above:
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=Subject
/Self_mean=MEAN(Self)
/Other_mean=MEAN(Other).
In order to get the new variables in a new, separate tabe, look up other AGGREGATE options, e.g. /OUTFILE=* (removing MODE=ADDVARIABLES) will result in the new aggregated data replacing the original file in the window, while /OUTFILE="path/filename" will save the aggregated data to a file.