sliding window in verilog when doing convolution

sliding window in verilog when doing convolution - memory

I am working on my CNN project in Verilog , but I am having some problems of implementing convolution procedure of Image with 3x3 Filter. I wrote a code for convolutional module, but now when it comes to convolution, I have to read the values from memory, which contains the pixels of the image. The thing is that I have to read these values in particular order, since convolution takes the dot product of 2 matrices and then strides it by 1 to the right. So let's say if the image is 5x5 matrix which stored in a memory array
[ a1 a2 a3 a4 a5
a6 a7 a8 a9 a10
a11 a12 a13 a14 a15 ] - memory Ram
how can I read the values of the memory in the following order:
a1 then a2 then a3 , then a6 then a7 then a8, and last row a11 a12 a13, and then stride and start over starting with a2 , a3, etc. til I reach the end of my array. Please suggest any solution how I should address the memory in this situation, the code snippet would be highly appreciated. Thank you.
p.s. my memory array will contain a lot of data, approximately will be a matrix of [400x300] , where the filter is [3x3].

Looks like a simple case of nested for-loops. This walk through the 16-entry memory as you wanted:
for (start=0; start<3; start=start+1)
for(i=1; i<16; i=i+5)
for (j=0; j<3; j=j+1)
data = mem[start+i+j]; // C: printf("%d\n",start+i+j);
Note that the code is both C and Verilog compatible so you can test your sequence in a C-compiler if you want (I did).
If you don't like the for loops you can make them into counters. In HDL you always reverse the order and start with the inner loop:
if (j<3)
j <= j + 1;
else
begin
j <=0;
if (i<16) // should be 15 if you start from 0
i <= i + 5;
else
begin
i <= 1; // You sure it should not be zero?
if (start<3)
start <= start + 1;
else
begin
start <= 0;
all_done <= 1'b1
end // nd of start
end // end of j
end // end of i
In a different part pf the design you can now use start+i+j as address.
Last : I would start with indices 0,1,2 as your picture is likely to start from memory address 0. You need to change the 'i' loop for that.
(HDL code is not compiled or tested)

Related

Vectorization of FOR loop

Is there a way to vectorize this FOR loop I know about gallery ("circul",y) thanks to user carandraug
but this will only shift the cell over to the next adjacent cell I also tried toeplitz but that didn't work).
I'm trying to make the shift adjustable which is done in the example code with circshift and the variable shift_over.
The variable y_new is the output I'm trying to get but without having to use a FOR loop in the example (can this FOR loop be vectorized).
Please note: The numbers that are used in this example are just an example the real array will be voice/audio 30-60 second signals (so the y_new array could be large) and won't be sequential numbers like 1,2,3,4,5.
tic
y=[1:5];
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
shift_over=-2; %cell amount to shift over
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
y_new
fprintf('\nfinally Done-elapsed time -%4.4fsec- or -%4.4fmins- or -%4.4fhours-\n',toc,toc/60,toc/3600);
y_new =
1 2 3 4 5
3 4 5 1 2
5 1 2 3 4
2 3 4 5 1
4 5 1 2 3
Ps: I'm using Octave 4.2.2 Ubuntu 18.04 64bit.

I'm pretty sure this is a classic XY problem where you want to calculate something and you think it's a good idea to build a redundant n x n matrix where n is the length of your audio file in samples. Perhaps you want to play with autocorrelation but the key point here is that I doubt that building the requested matrix is a good idea but here you go:
Your code:
y = rand (1, 3e3);
shift_over = -2;
clear -x y shift_over
tic
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
toc
my code:
clear -x y shift_over
tic
n = numel (y);
y2 = y (mod ((0:n-1) - shift_over * (0:n-1).', n) + 1);
toc
gives on my system:
Elapsed time is 1.00379 seconds.
Elapsed time is 0.155854 seconds.

Image Processing Pipelining in VHDL

I am currently trying to develop a Sobel filter in VHDL. I am using a 640x480 picture that is stored in a BRAM. The algorithm uses a 3x3 matrix of pixels of the image for processing each output pixel. My problem is that I currently only know of putting an image into a BRAM where each address of the BRAM holds one pixel value. This means I can only read one pixel per clock. My problem is that I am trying to pipeline the data so I would ideally need to be able to get three pixel values (one from each row of the picture) per clock so after my initial latency, I can load in three new pixel values per clock and get an output pixel on every clock. I am looking for a way to do this but cannot figure it out.
The only way I can think of to fix this is to have the image in 3 BRAMs. that way I can read in values from 3 rows per each clock cycle. However, there is not enough memory space to fit even one more RAM large enough to fit a 640x480 image let alone three. I could lower the picture size to do it this way, but I really want to do it with my current 640x480 image size.
Any help or guidance would be greatly appreciated.

A simple solution would be to store 1/4th of the image in 4 separate memories. First memory contain every 4th line, second every 4th line, starting from second line, etc. I would use 4 even if you need 3 lines, since 4 evenly divides 480 and every other standard resolution. Also, finding a binary number modulo 4 is trivial, which is needed to order the memories.
You can use the MSB of the line number to address your RAM, and the LSBs to figure out the relative order of each RAM output (code is only to demonstrate idea, it's not usable as is...):
address <= line(line'left downto 2) & col; -- Or something more efficent on packing
data0 <= ram0(address);
data1 <= ram1(address);
data2 <= ram2(address);
data3 <= ram3(address);
case line(1 downto 0) is
when "00" =>
line0 <= data0;
line1 <= data1;
line2 <= data2;
when "01" =>
line0 <= data1;
line1 <= data2;
line2 <= data3;
when "10" =>
line0 <= data2;
line1 <= data3;
line2 <= data0;
when "11" =>
line0 <= data3;
line1 <= data0;
line2 <= data1;
when others => null;
end case;

I made a sobel filter few years ago. To do that, i wrote a pipeline that gives 9 pixels at each clock cycle:
architecture rtl of matrix_3x3_builder_8b is
type fifo_t is array (0 to 2*IM_WIDTH + 2) of std_logic_vector(7 downto 0);
signal fifo_int : fifo_t;
begin
p0_build_5x5: process(rst_i,clk_i)
begin
if( rst_i = '1' )then
fifo_int <= (others => (others => '0'));
elsif( rising_edge(clk_i) )then
if(data_valid_i = '1')then
for i in 1 to 2*IM_WIDTH + 2 loop
fifo_int(i) <= fifo_int(i-1);
end loop;
fifo_int(0) <= data_i;
end if;
end if;
end process p0_build_5x5;
data_o1 <= fifo_int(0*IM_WIDTH + 0);
data_o2 <= fifo_int(0*IM_WIDTH + 1);
data_o3 <= fifo_int(0*IM_WIDTH + 2);
data_o4 <= fifo_int(1*IM_WIDTH + 0);
data_o5 <= fifo_int(1*IM_WIDTH + 1);
data_o6 <= fifo_int(1*IM_WIDTH + 2);
data_o7 <= fifo_int(2*IM_WIDTH + 0);
data_o8 <= fifo_int(2*IM_WIDTH + 1);
data_o9 <= fifo_int(2*IM_WIDTH + 2);
end rtl;
Here you read the image pixel by pixel to build your 3x3 matrix. The pipeline is longer to fill up but once completed, you have a new matrix each clock pulse.

If you want to continue storing the whole image, then I would do as Jonathan Drolet recommended and cycle between four rams while writing and read all 4 at once (muxing the three you care about into 3 registers).
This works because your rams will be deep enough that you will still be able to get full BRAM utilization at 1/4 the depth (77k deep still) and your reads can be predictably segmented.
For the specifics of this problem, Nicolas Roudel's method is much cheaper with BRAM, although you can't store the whole image at one time, so wherever you send your results can't backpressure you unless you can backpressure your data source. That may or may not matter for your application.
When you try to do something like this with extremely wide, but fairly shallow (1k deep) rams segmenting will use more block ram (or even start inferring distributed ram). When the reads do not follow a particular pattern (the pattern in your case is that they are all sequential and adjacent locations), the ram cannot be segmented. The best strategy to maintain efficient BRAM use is often to build quad port rams from the natively dual port block rams by clocking them with a 2x clock that is phase aligned with your normal clock, allowing you to do a write and 3 reads every 1x clock cycle.

constraint programming mesh network

I have a mesh network as shown in figure.
Now, I am allocating values to all edges in this sat network. I want to propose in my program that, there are no closed loops in my allocation. For example the constraint for top-left most square can be written as -
E0 = 0 or E3 = 0 or E4 = 0 or E7 = 0, so either of the link has to be inactive in order not to form a loop. However, in this kind of network, there are many possible loops.
For example loop formed by edges - E0, E3, E7, E11, E15, E12, E5, E1.
Now my problem is that I have to describe each possible combination of loop which can occur in this network. I tried to write constraints in one possible formula, however I was not able to succeed.
Can anyone throw any pointers if there is a possible way to encode this situation?
Just for information, I am using Z3 Sat Solver.

The following encoding can be used with any graph with N nodes and M edges. It is using (N+1)*M variables and 2*M*M 3-SAT clauses. This ipython notebook demonstrates the encoding by comparing the SAT solver results (UNSAT when there is a loop, SAT otherwise) with the results of a straight-forward loop finding algorithm.
Disclaimer: This encoding is my ad-hoc solution to the problem. I'm pretty sure that it is correct but I don't know how it compares performance-wise to other encodings for this problem. As my solution works with any graph it is to be expected that a better solution exists that uses some of the properties of the class of graphs the OP is interested in.
Variables:
I have one variable for each edge. The edge is "active" or "used" if its corresponding variable is set. In my reference implementation the edges have indices 0..(M-1) and this variables have indices 1..M:
def edge_state_var(edge_idx):
assert 0 <= edge_idx < M
return 1 + edge_idx
Then I have an M bits wide state variable for each edge, or a total of N*M state bits (nodes and bits are also using zero-based indexing):
def node_state_var(node_idx, bit_idx):
assert 0 <= node_idx < N
assert 0 <= bit_idx < M
return 1 + M + node_idx*M + bit_idx
Clauses:
When an edge is active, it links the state variables of the two nodes it connects together. The state bits with the same index as the node must be different on both sides and the other state bits must be equal to their corresponding partner on the other node. In python code:
# which edge connects which nodes
connectivity = [
( 0, 1), # edge E0
( 1, 2), # edge E1
( 2, 3), # edge E2
( 0, 4), # edge E3
...
]
cnf = list()
for i in range(M):
eb = edge_state_var(i)
p, q = connectivity[i]
for k in range(M):
pb = node_state_var(p, k)
qb = node_state_var(q, k)
if k == i:
# eb -> (pb != qb)
cnf.append([-eb, -pb, -qb])
cnf.append([-eb, +pb, +qb])
else:
# eb -> (pb == qb)
cnf.append([-eb, -pb, +qb])
cnf.append([-eb, +pb, -qb])
So basically each edge tries to segment the graph it is part of into a half that is on one side of the edge and has all the state bits corresponding to the edge set to 1 and a half that is on the other side of the edge and has the state bits corresponding to the edge set to 0. This is not possible for a loop where all nodes in the loop can be reached from both sides of each edge in the loop.

Need a specific example of U-Matrix in Self Organizing Map

I'm trying to develop an application using SOM in analyzing data. However, after finishing training, I cannot find a way to visualize the result. I know that U-Matrix is one of the method but I cannot understand it properly. Hence, I'm asking for a specific and detail example how to construct U-Matrix.
I also read an answer at U-matrix and self organizing maps but it only refers to 1 row map, how about 3x3 map? I know that for 3x3 map:
m(1) m(2) m(3)
m(4) m(5) m(6)
m(7) m(8) m(9)
a 5x5 matrix must me created:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
but I don't know how to calculate u-weight u(1,2,4,5), u(2,3,5,6), u(4,5,7,8) and u(5,6,8,9).
Finally, after constructing U-Matrix, is there any way to visualize it using color, e.g. heat map?
Thank you very much for your time.
Cheers

I don't know if you are still interested in this but I found this link
http://www.uni-marburg.de/fb12/datenbionik/pdf/pubs/1990/UltschSiemon90
which explains very speciffically how to calculate the U-matrix.
Hope it helps.
By the way, the site were I found the link has several resources referring to SOMs I leave it here in case anyone is interested:
http://www.ifs.tuwien.ac.at/dm/somtoolbox/visualisations.html

The essential idea of a Kohonen map is that the data points are mapped to a
lattice, which is often a 2D rectangular grid.
In the simplest implementations, the lattice is initialized by creating a 3D
array with these dimensions:
width * height * number_features
This is the U-matrix.
Width and height are chosen by the user; number_features is just the number
of features (columns or fields) in your data.
Intuitively this is just creating a 2D grid of dimensions w * h
(e.g., if w = 10 and h = 10 then your lattice has 100 cells), then
into each cell, placing a random 1D array (sometimes called "reference tuples")
whose size and values are constrained by your data.
The reference tuples are also referred to as weights.
How is the U-matrix rendered?
In my example below, the data is comprised of rgb tuples, so the reference tuples
have length of three and each of the three values must lie between 0 and 255).
It's with this 3D array ("lattice") that you begin the main iterative loop
The algorithm iteratively positions each data point so that it is closest to others similar to it.
If you plot it over time (iteration number) then you can visualize cluster
formation.
The plotting tool i use for this is the brilliant Python library, Matplotlib,
which plots the lattice directly, just by passing it into the imshow function.
Below are eight snapshots of the progress of a SOM algorithm, from initialization to 700 iterations. The newly initialized (iteration_count = 0) lattice is rendered in the top left panel; the result from the final iteration, in the bottom right panel.
Alternatively, you can use a lower-level imaging library (in Python, e.g., PIL) and transfer the reference tuples onto the 2D grid, one at a time:
for y in range(h):
for x in range(w):
img.putpixel( (x, y), (
SOM.Umatrix[y, x, 0],
SOM.Umatrix[y, x, 1],
SOM.Umatrix[y, x, 2])
)
Here img is an instance of PIL's Image class. Here the image is created by iterating over the grid one pixel at a time; for each pixel, putpixel is called on img three times, the three calls of course corresponding to the three values in an rgb tuple.

From the matrix that you create:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
The elements with single numbers like u(1), u(2), ..., u(9) as just the elements with more than two numbers like u(1,2,4,5), u(2,3,5,6), ... , u(5,6,8,9) are calculated using something like the mean, median, min or max of the values in the neighborhood.
It's a nice idea calculate the elements with two numbers first, one possible code for that is:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
nb = (0,0)
if not (i % 2) and (j % 2):
nb = (0,1)
elif (i % 2) and not (j % 2):
nb = (1,0)
self.u_matrix[(i,j)] = np.linalg.norm(
self.weights[i //2, j //2] - self.weights[i //2 +nb[0], j // 2 + nb[1]],
axis = 0
)
In the code above the self.h_u_matrix = self.weights.shape[0]*2 - 1 and self.w_u_matrix = self.weights.shape[1]*2 - 1 are the dimensions of the U-Matrix. With that said, for calculate the others elements it's necessary obtain a list with they neighboors and apply a mean for example. The following code implements that's idea:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
if not (i % 2) and not (j % 2):
nodelist = []
if i > 0:
nodelist.append((i-1,j))
if i < 4:
nodelist.append((i+1, j))
if j > 0:
nodelist.append((i,j -1))
if j < 4:
nodelist.append((i,j+1))
meanlist = [self.u_matrix[u_node] for u_node in nodelist]
self.u_matrix[(i,j)] = np.mean(meanlist)
elif (i % 2) and (j % 2):
meanlist = [
(i - 1, j),
(i + 1, j),
(i, j - 1),
(i, j + 1)]
self.u_matrix[(i,j)] = np.mean(meanlist)

linear transformation function

I need to write a function that takes 4 bytes as input, performs a reversible linear transformation on this, and returns it as 4 bytes.
But wait, there is more: it also has to be distributive, so changing one byte on the input should affect all 4 output bytes.
The issues:
if I use multiplication it won't be reversible after it is modded 255 via the storage as a byte (and its needs to stay as a byte)
if I use addition it can't be reversible and distributive
One solution:
I could create an array of bytes 256^4 long and fill it in, in a one to one mapping, this would work, but there are issues: this means I have to search a graph of size 256^8 due to having to search for free numbers for every value (should note distributivity should be sudo random based on a 64*64 array of byte). This solution also has the MINOR (lol) issue of needing 8GB of RAM, making this solution nonsense.
The domain of the input is the same as the domain of the output, every input has a unique output, in other words: a one to one mapping. As I noted on "one solution" this is very possible and I have used that method when a smaller domain (just 256) was in question. The fact is, as numbers get big that method becomes extraordinarily inefficient, the delta flaw was O(n^5) and omega was O(n^8) with similar crappiness in memory usage.
I was wondering if there was a clever way to do it. In a nutshell, it's a one to one mapping of domain (4 bytes or 256^4). Oh, and such simple things as N+1 can't be used, it has to be keyed off a 64*64 array of byte values that are sudo random but recreatable for reverse transformations.

Balanced Block Mixers are exactly what you're looking for.
Who knew?

Edit! It is not possible, if you indeed want a linear transformation. Here's the mathy solution:
You've got four bytes, a_1, a_2, a_3, a_4, which we'll think of as a vector a with 4 components, each of which is a number mod 256. A linear transformation is just a 4x4 matrix M whose elements are also numbers mod 256. You have two conditions:
From Ma, we can deduce a (this means that M is an invertible matrix).
If a and a' differ in a single coordinate, then Ma and Ma' must differ in every coordinate.
Condition (2) is a little trickier, but here's what it means. Since M is a linear transformation, we know that
M(a - a) = Ma - Ma'
On the left, since a and a' differ in a single coordinate, a - a has exactly one nonzero coordinate. On the right, since Ma and Ma' must differ in every coordinate, Ma - Ma' must have every coordinate nonzero.
So the matrix M must take a vector with a single nonzero coordinate to one with all nonzero coordinates. So we just need every entry of M to be a non-zero-divisor mod 256, i.e., to be odd.
Going back to condition (1), what does it mean for M to be invertible? Since we're considering it mod 256, we just need its determinant to be invertible mod 256; that is, its determinant must be odd.
So you need a 4x4 matrix with odd entries mod 256 whose determinant is odd. But this is impossible! Why? The determinant is computed by summing various products of entries. For a 4x4 matrix, there are 4! = 24 different summands, and each one, being a product of odd entries, is odd. But the sum of 24 odd numbers is even, so the determinant of such a matrix must be even!

Here are your requirements as I understand them:
Let B be the space of bytes. You want a one-to-one (and thus onto) function f: B^4 -> B^4.
If you change any single input byte, then all output bytes change.
Here's the simplest solution I have thusfar. I have avoided posting for a while because I kept trying to come up with a better solution, but I haven't thought of anything.
Okay, first of all, we need a function g: B -> B which takes a single byte and returns a single byte. This function must have two properties: g(x) is reversible, and x^g(x) is reversible. [Note: ^ is the XOR operator.] Any such g will do, but I will define a specific one later.
Given such a g, we define f by f(a,b,c,d) = (a^b^c^d, g(a)^b^c^d, a^g(b)^c^d, a^b^g(c)^d). Let's check your requirements:
Reversible: yes. If we XOR the first two output bytes, we get a^g(a), but by the second property of g, we can recover a. Similarly for the b and c. We can recover d after getting a,b, and c by XORing the first byte with (a^b^c).
Distributive: yes. Suppose b,c, and d are fixed. Then the function takes the form f(a,b,c,d) = (a^const, g(a)^const, a^const, a^const). If a changes, then so will a^const; similarly, if a changes, so will g(a), and thus so will g(a)^const. (The fact that g(a) changes if a does is by the first property of g; if it didn't then g(x) wouldn't be reversible.) The same holds for b and c. For d, it's even easier because then f(a,b,c,d) = (d^const, d^const, d^const, d^const) so if d changes, every byte changes.
Finally, we construct such a function g. Let T be the space of two-bit values, and h : T -> T the function such that h(0) = 0, h(1) = 2, h(2) = 3, and h(3) = 1. This function has the two desired properties of g, namely h(x) is reversible and so is x^h(x). (For the latter, check that 0^h(0) = 0, 1^h(1) = 3, 2^h(2) = 1, and 3^h(3) = 2.) So, finally, to compute g(x), split x into four groups of two bits, and take h of each quarter separately. Because h satisfies the two desired properties, and there's no interaction between the quarters, so does g.

I'm not sure I understand your question, but I think I get what you're trying to do.
Bitwise Exclusive Or is your friend.
If R = A XOR B, R XOR A gives B and R XOR B gives A back. So it's a reversible transformation, assuming you know the result and one of the inputs.

Assuming I understood what you're trying to do, I think any block cipher will do the job.
A block cipher takes a block of bits (say 128) and maps them reversibly to a different block with the same size.
Moreover, if you're using OFB mode you can use a block cipher to generate an infinite stream of pseudo-random bits. XORing these bits with your stream of bits will give you a transformation for any length of data.

I'm going to throw out an idea that may or may not work.
Use a set of linear functions mod 256, with odd prime coefficients.
For example:
b0 = 3 * a0 + 5 * a1 + 7 * a2 + 11 * a3;
b1 = 13 * a0 + 17 * a1 + 19 * a2 + 23 * a3;
If I remember the Chinese Remainder Theorem correctly, and I haven't looked at it in years, the ax are recoverable from the bx. There may even be a quick way to do it.
This is, I believe, a reversible transformation. It's linear, in that af(x) mod 256 = f(ax) and f(x) + f(y) mod 256 = f(x + y). Clearly, changing one input byte will change all the output bytes.
So, go look up the Chinese Remainder Theorem and see if this works.

What you mean by "linear" transformation?
O(n), or a function f with f(c * (a+b)) = c * f(a) + c * f(b)?
An easy approach would be a rotating bitshift (not sure if this fullfils the above math definition). Its reversible and every byte can be changed. But with this it does not enforce that every byte is changed.
EDIT: My solution would be this:
b0 = (a0 ^ a1 ^ a2 ^ a3)
b1 = a1 + b0 ( mod 256)
b2 = a2 + b0 ( mod 256)
b3 = a3 + b0 ( mod 256)
It would be reversible (just subtract the first byte from the other, and then XOR the 3 resulting bytes on the first), and a change in one bit would change every byte (as b0 is the result of all bytes and impacts all others).

Stick all of the bytes into 32-bit number and then do a shl or shr (shift left or shift right) by one, two or three. Then split it back into bytes (could use a variant record). This will move bits from each byte into the adjacent byte.
There are a number of good suggestions here (XOR, etc.) I would suggest combining them.

You could remap the bits. Let's use ii for input and oo for output:
oo[0] = (ii[0] & 0xC0) | (ii[1] & 0x30) | (ii[2] & 0x0C) | (ii[3] | 0x03)
oo[1] = (ii[0] & 0x30) | (ii[1] & 0x0C) | (ii[2] & 0x03) | (ii[3] | 0xC0)
oo[2] = (ii[0] & 0x0C) | (ii[1] & 0x03) | (ii[2] & 0xC0) | (ii[3] | 0x30)
oo[3] = (ii[0] & 0x03) | (ii[1] & 0xC0) | (ii[2] & 0x30) | (ii[3] | 0x0C)
It's not linear, but significantly changing one byte in the input will affect all the bytes in the output. I don't think you can have a reversible transformation such as changing one bit in the input will affect all four bytes of the output, but I don't have a proof.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart