I'm having a bit of trouble with a homework problem, and I was wondering if anyone could point me in the right direction.
Suppose we are compiling for a machine
with 1-byte characters, 2-byte
shorts, 4-byte integers, and 8-byte
reals, and with alignment rules that
require the address of every primitive
data element to be an even multiple of
the element’s size. Suppose further
that the compiler is not permitted to
reorder fields. How much space will be
consumed by the following array?
A : array [0..9] of record
s : short;
c : char;
t : short;
d : char;
r : real;
i : integer;
end;
Now I understand the problem, for the most part, but the thing that is really throwing me for a loop is the "alignment rules that require the address of every primitive data
element to be an even multiple of the element’s size". My book isn't very description when it comes to alignment rules and to be completely honest, I'm not even positive on what an even multiple is. Any help would be appreciated.
Also, I believe the answer is 240-bytes, I just need some help getting there.
Let's break that down:
"alignment rules" that "require the address of every primitive data element to be an even multiple of the element’s size". It's not very interesting that we're talking about alignment rules; we knew that already.
"require the address" of "every primitive data element" to be "an even multiple of the element’s size". Now we're getting somewhere. We have a requirement and a scope:
Requirement: The address is an even multiple of the element's size.
Scope: Every primitive data element.
So, every time we position an element, we must impose the requirement.
Let us try to position an element in memory. The first thing we will position is the short labelled s. Since a short takes up 2 bytes of memory, and we must make its address a multiple of that size, the address must be a multiple of 2. Let's call that address N.
So, s takes up the space from N up to N + 2. (NOTE: For all of these ranges, the first endpoint is included, but the last endpoint is not. This is the normal way to describe ranges of integers in computer science; in most cases it is by far the most useful and least error-prone way to do it. Trust me.)
We continue with each other field.
c takes up one byte, from N + 2 to N + 3.
We are at N + 3, but we cannot start t there, because N + 3 is odd (since N is even). So we must skip a byte. Thus t ranges from N + 4 to N + 6.
Continuing with this sort of logic, we end up with d from N + 6 to N + 7; r from N + 8 to N + 16; i from N + 16 to N + 20. (NOTE that this only works if we restrict N to be a multiple of 8, or r will be unaligned. This is ok; when we allocate the memory for the array, we can align the start of it however we want - we just have to be consistent about the sequence of data after that point.)
So we need at least 20 bytes for this structure. (That's one of the advantages of the half-open ranges: the difference between the endpoints equals the size. If we included or excluded both endpoints from the range, we'd have to make a +1 or -1 correction.)
Now let's say we try to lay out the array as ten consecutive chunks of 20 bytes. Will this work? No; say that element 0 is at address 256 (a multiple of 8). Now r in element 1 will be unaligned, because it will start at 256 + 20 + 8, which is not divisible by 8. That's not allowed.
So what do we do now? We can't just insert an extra 4 bytes before r in element 1, because every element of the array must have the same layout (not to mention size). But there is a simple solution: we insert 4 bytes of additional padding at the end of each element. Now, as long as the array starts at some multiple of 8, every element will also start at a multiple of 8 (which, in turn, keeps r aligned), because the size is now a multiple of 8.
We conclude that we need 24 bytes for the structure, and thus 24 * 10 = 240 bytes for the array.
The phrase "an even multiple of the element’s size" may indicate that a 2-byte short must be aligned on a 4-byte boundary, for example.
That seems a bit wasteful to me but, since it's homework, it's certainly possible.
Using those rules, you would have (for an array of size 2):
Offset Variable Size Range
------ -------- ---- -----
0 s 2 0-1
4 c 1 2-2
8 t 2 4-5
12 d 1 6-6
16 r 8 16-23
24 i 4 24-27
28 * 4 28-31
32 s 2 32-33
34 c 1 34-34
36 t 2 36-37
38 d 1 38-38
48 r 8 48-55
56 i 4 56-59
60 * 4 60-63
The reason you have the padding is to bring each array element up to a multiple of 16 so that the r variable in each can be aligned to 16 bytes.
So the ten array elements would take up 320 bytes in that case.
It may also mean "even" as in "integral" rather than "multiple of two" (far more likely since it matches reality).
That would make the array:
Offset Variable Size Range
------ -------- ---- -----
0 s 2 0-1
4 c 1 2-2
8 t 2 4-5
12 d 1 6-6
16 r 8 8-15
24 i 4 16-19
28 * 4 20-23
32 s 2 24-25
34 c 1 26-26
36 t 2 28-29
38 d 1 30-30
48 r 8 32-39
56 i 4 40-43
60 * 4 44-47
In that case, you have 24 bytes per element for a total of 240 bytes. Again, you need padding to ensure that r is aligned correctly.
I disagree - I read "an even multiple of the element’s size" as "2-byte shorts must have even addresses", or "4-byte ints must be 4-byte aligned". So, an int at address 0x101 to 0x103 is a bus error, but 0x100 and 0x104 is correct
Hopefully this clears things out to an extent:(ans would be 236(232+4) to be exact)
import pprint
l=[2,1,2,1,8,4]
count=0
i=0
d={}
while(count<10):
for ele in l:
while True:
if(i%ele==0):
d[i]=ele
i=i+ele
break
i=i+1
count+=1
pprint.pprint(d)
Output :
{0: 2,
2: 1,
4: 2,
6: 1,
8: 8,
16: 4,
20: 2,
22: 1,
24: 2,
26: 1,
32: 8,
40: 4,
44: 2,
46: 1,
48: 2,
50: 1,
56: 8,
64: 4,
68: 2,
70: 1,
72: 2,
74: 1,
80: 8,
88: 4,
92: 2,
94: 1,
96: 2,
98: 1,
104: 8,
112: 4,
116: 2,
118: 1,
120: 2,
122: 1,
128: 8,
136: 4,
140: 2,
142: 1,
144: 2,
146: 1,
152: 8,
160: 4,
164: 2,
166: 1,
168: 2,
170: 1,
176: 8,
184: 4,
188: 2,
190: 1,
192: 2,
194: 1,
200: 8,
208: 4,
212: 2,
214: 1,
216: 2,
218: 1,
224: 8,
232: 4}
ans should be 240
Size of structure will be the alignment of immediate larger structure
https://www.google.com/amp/s/www.geeksforgeeks.org/data-structure-alignment/amp/
-s 2 short size 2 byte
-c 2 char 1 byte + 1 paddling
-t 2
-d 2
-r 8
-i 8 int 4 + 4 paddling
=24
so
24*10=240
Max is 8 byte so it should be divisible by 8
according to alignment rules
Related
I know that Common Lisp discourages a programmer from touching raw memory, but I would like to know whether it is possible to see how an object is stored on a byte level. Of course, a garbage collector moves objects in memory space and two subsequent calls of a function (obj-as-bytes obj) could yield different results, but let us assume that we need just a memory snapshot. How would you implement such function?
My attempt with SBCL looks as follows:
(defun obj-as-bytes (obj)
(let* ((addr (sb-kernel:get-lisp-obj-address obj)) ;; get obj address in memory
(ptr (sb-sys:int-sap addr)) ;; make pointer to this area
(size (sb-ext:primitive-object-size obj)) ;; get object size
(output))
(dotimes (idx size)
(push (sb-sys:sap-ref-64 ptr idx) output)) ;; collect raw bytes into list
(nreverse output))) ;; return bytes in the reversed order
Let's try:
(obj-as-bytes #(1)) =>
(0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 111 40 161 4 16 0 0 0 23 1 16 80 0 0 0)
(obj-as-bytes #(2) =>
(0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 95 66 161 4 16 0 0 0 23 1 16 80 0 0 0)
From this output I conclude that there is a lot of garbage, which occupies space for future memory allocations. And we see it because (sb-ext:primitive-object-size obj) seems to return a chunk of memory which is large enough to fit the object.
This code demonstrates it:
(loop for n from 0 below 64 collect
(sb-ext:primitive-object-size (make-string n :initial-element #\a))) =>
(16 32 32 32 32 48 48 48 48 64 64 64 64 80 80 80 80 96 96 96 96 112 112 112 112 128 128 128 128 144 144 144 144 160 160 160 160 176 176 176 176 192 192 192 192 208 208 208 208 224 224 224 224 240 240 240 240 256 256 256 256 272 272 272)
So, obj-as-bytes would give a correct result if sb-ext:primitive-object-size were more accurate. But I cannot find any alternative.
Do you have any suggestions how to fix this function or how to implement it differently?
As I mentioned in a comment the layout of objects in memory is very implementation-specific and the tools to explore it are necessarily also implementation-dependent.
This answer discusses the layout for 64-bit versions of SBCL and only for 64-bit versions which have 'wide fixnums'. I'm not sure in which order these two things arrived in SBCL as I haven't looked seriously at any of this since well before SBCL and CMUCL diverged.
This answer also may be wrong: I'm not an SBCL developer and I'm only adding it because no one who is has (I suspect tagging the question properly might help with this).
Information below comes from looking at the GitHub mirror, which seems to be very up to date with the canonical source but a lot faster.
Pointers, immediate objects, tags
[Information from here.] SBCL allocates on two-word boundaries. On a 64-bit system this means that the low four bits of any address are always zero. These low four bits are used as a tag (the documentation calls this the 'lowtag') to tell you what sort of thing is in the rest of the word.
A lowtag of xyz0 means that the rest of the word is a fixnum, and in particular xyz will then be the low bits of the fixnum, rather than tag bits at all. This means both that there are 63 bits available for fixnums and that fixnum addition is trivial: you don't need to mask off any bits.
A lowtag of xy01 means that the rest of the word is some other immediate object. Some of the bits to the right of the lowtag (which I think SBCL calls a 'widetag' although I am confused about this as the term seems to be used in two ways) will say what the immediate object is. Examples of immediate objects are characters and single-floats (on a 64-bit platform!).
the remaining lowtag patterns are xy11, and they all mean that things are pointers to some non-immediate object:
0011 is an instance of something;
0111 is a cons;
1011 is a function;
1111 is something else.
Conses
Because conses don't need any additional type information (a cons is a cons) the lowtag is enough: a cons is then just two words in memory, each of which in turn has lowtags &c.
Other non-immediate objects
I think (but am not sure) that all other non-immediate objects have a word which says what they are (which may also be called a 'widetag') and at least one other word (because allocation is on two-word boundaries). I suspect that the special tag for functions means that function call can just jump to the entry point of the function's code.
Looking at this
room.lisp has a nice function called hexdump which knows how to print out non-immediate objects. Based on that I wrote a little shim (below) which tries to tell you useful things. Here are some examples.
> (hexdump-thing 1)
lowtags: 0010
fixnum: 0000000000000002 = 1
1 is a fixnum and its representation is just shifted right one bit as described above. Note that the lowtags actually contain the whole value in this case!
> (hexdump-thing 85757)
lowtags: 1010
fixnum: 0000000000029DFA = 85757
... but not in this case.
> (hexdump-thing #\c)
lowtags: 1001
immediate: 0000000000006349 = #\c
> (hexdump-thing 1.0s0)
lowtags: 1001
immediate: 3F80000000000019 = 1.0
Characters and single floats are immediate: some of the bits to the left of the lowtag tells the system what they are, I think?
> (hexdump-thing '(1 . 2))
lowtags: 0111
cons: 00000010024D6E07 : 00000010024D6E00
10024D6E00: 0000000000000002 = 1
10024D6E08: 0000000000000004 = 2
> (hexdump-thing '(1 2 3))
lowtags: 0111
cons: 00000010024E4BC7 : 00000010024E4BC0
10024E4BC0: 0000000000000002 = 1
10024E4BC8: 00000010024E4BD7 = (2 3)
Conses. In the first case you can see the two fixnums sitting as immediate values in the two fields of the cons. In the second, if you decoded the lowtag of the second field it would be 0111: it's another cons.
> (hexdump-thing "")
lowtags: 1111
other: 00000010024FAE8F : 00000010024FAE80
10024FAE80: 00000000000000E5
10024FAE88: 0000000000000000 = 0
> (hexdump-thing "x")
lowtags: 1111
other: 00000010024FC22F : 00000010024FC220
10024FC220: 00000000000000E5
10024FC228: 0000000000000002 = 1
10024FC230: 0000000000000078 = 60
10024FC238: 0000000000000000 = 0
> (hexdump-thing "xyzt")
lowtags: 1111
other: 00000010024FDDAF : 00000010024FDDA0
10024FDDA0: 00000000000000E5
10024FDDA8: 0000000000000008 = 4
10024FDDB0: 0000007900000078 = 259845521468
10024FDDB8: 000000740000007A = 249108103229
Strings. These have some type information, a length field, and then characters are packed two to a word. A single-character string needs four words, the same as a four-character one. You can read the character codes out of the data.
> (hexdump-thing #())
lowtags: 1111
other: 0000001002511C3F : 0000001002511C30
1002511C30: 0000000000000089
1002511C38: 0000000000000000 = 0
> (hexdump-thing #(1))
lowtags: 1111
other: 00000010025152BF : 00000010025152B0
10025152B0: 0000000000000089
10025152B8: 0000000000000002 = 1
10025152C0: 0000000000000002 = 1
10025152C8: 0000000000000000 = 0
> (hexdump-thing #(1 2))
lowtags: 1111
other: 000000100252DC2F : 000000100252DC20
100252DC20: 0000000000000089
100252DC28: 0000000000000004 = 2
100252DC30: 0000000000000002 = 1
100252DC38: 0000000000000004 = 2
> (hexdump-thing #(1 2 3))
lowtags: 1111
other: 0000001002531C8F : 0000001002531C80
1002531C80: 0000000000000089
1002531C88: 0000000000000006 = 3
1002531C90: 0000000000000002 = 1
1002531C98: 0000000000000004 = 2
1002531CA0: 0000000000000006 = 3
1002531CA8: 0000000000000000 = 0
Same deal for simple vectors: header, length, but now each entry takes a word of course. Above all entries are fixnums and you can see them in the data.
And so it goes on.
The code that did this
This may be wrong and an earlier version of it definitely did not like small bignums (I think hexdump doesn't like them). If you want real answers either read the source or ask an SBCL person. Other implementations are available, and will be different.
(defun hexdump-thing (obj)
;; Try and hexdump an object, including immediate objects. All the
;; work is done by sb-vm:hexdump in the interesting cases.
#-(and SBCL 64-bit)
(error "not a 64-bit SBCL")
(let* ((address/thing (sb-kernel:get-lisp-obj-address obj))
(tags (ldb (byte 4 0) address/thing)))
(format t "~&lowtags: ~12T~4,'0b~%" tags)
(cond
((zerop (ldb (byte 1 0) tags))
(format t "~&fixnum:~12T~16,'0x = ~S~%" address/thing obj))
((= (ldb (byte 2 0) tags) #b01)
(format t "~&immediate:~12T~16,'0x = ~S~%" address/thing obj))
((= (ldb (byte 2 0) tags) #b11) ;must be true
(format t "~&~A:~12T~16,'0x : ~16,'0x~%"
(case (ldb (byte 2 2) tags)
(#b00 "instance")
(#b01 "cons")
(#b10 "function")
(#b11 "other"))
address/thing (dpb #b0000 (byte 4 0) address/thing))
;; this tells you at least something (and really annoyingly
;; does not pad addresses on the left)
(sb-vm:hexdump obj))
;; can't happen
(t (error "mutant"))))
(values))
Minimum number states in the DFA accepting strings (base 3 i.e,, ternary form) congruent to 5 modulo 6?
I have tried but couldn't do it.
At first sight, It seems to have 6 states but then it can be minimised further.
Let's first see the state transition table:
Here, the states q0, q1, q2,...., q5 corresponds to the states with modulo 0,1,2,..., 5 respectively when divided by 6. q0 is our initial state and since we need modulo 5 therefore our final state will be q5
Few observations drawn from above state transition table:
states q0, q2 and q4 are exactly same
states q1, q3 and q5 are exactly same
The states which make transitions to the same states on the same inputs can be merged into a single state.
Note: Final and Non-final states can never be merged.
Therefore, we can merge q0, q2, q4 together and q1, q3 together leaving the state q5 aloof from collation.
The final Minimal DFA has 3 states as shown below:
Let's look at a few strings in the language:
12 = 1*3 + 2 = 5 ~ 5 (mod 6)
102 = 1*9 + 0*3 + 2 = 11 ~ 5 (mod 6)
122 = 1*9 + 2*3 + 2 = 17 ~ 5 (mod 6)
212 = 2*9 + 1*3 + 2 = 23 ~ 5 (mod 6)
1002 = 1*18 + 0*9 + 0*9 + 2 = 29 ~ 5 (mod 6)
We notice that all the strings end in 2. This makes sense since 6 is a multiple of 3 and the only way to get 5 from a multiple of 3 is to add 2. Based on this, we can try to solve the problem of strings congruent to 3 modulo 6:
10 = 3
100 = 9
120 = 15
210 = 21
1000 = 27
There's not a real pattern emerging, but consider this: every base-3 number ending in 0 is definitely divisible by 3. The ones that are even are also divisible by 6; so the odd numbers whose base-3 representation ends in 0 must be congruent to 3 mod 6. Because all the powers of 3 are odd, we know we have an odd number if the number of 1s in the string is odd.
So, our conditions are:
the string begins with a 1;
the string has an odd number of 1s;
the string ends with 2;
the string can contain any number of 2s and 0s.
To get the minimum number of states in such a DFA, we can use the Myhill-Nerode theorem beginning with the empty string:
the empty string can be followed by any string in the language. Call its equivalence class [e]
the string 0 cannot be followed by anything since valid base-3 representations don't have leading 0s. Call its equivalence class [0].
the string 1 must be followed with stuff that has an even number of 1s in it ending with a 2. Call its equivalence class [1].
the string 2 can be followed by anything in the language. Indeed, you can verify that putting a 2 at the front of any string in the language gives another string in the language. However, it can also be followed by strings beginning with 0. Therefore, its class is new: [2].
the string 00 can't be followed by anything to fix it; its class is the same as its prefix 0, [0]. same for the string 01.
the string 10 can be followed by any string with an even number of 1s that ends in a 2; it is therefore equivalent to the class [1].
the string 11 can be followed by any string in the language whatever; indeed, you can verify prepending 11 in front of any string in the language gives another solution. However, it can also be followed by strings beginning with 0. Therefore, its class is the same as [2].
12 can be followed by a string with an even number of 1s ending in 2, as well as by the empty string (since 12 is in fact in the language). This is a new class, [12].
21 is equivalent to 1; class [1]
22 is equivalent to 2; class [2]
20 is equivalent to 2; class [2]
120 is indistinguishable from 1; its class is [1].
121 is indistinguishable from [2].
122 is indistinguishable from [12].
We have seen no new equivalence classes on new strings of length 3; so, we know we have seen all the equivalence classes. They are the following:
[e]: any string in the language can follow this
[0]: no string can follow this
[1]: a string with an even number of 1s ending in 2 can follow this
[2]: same as [e] but also strings beginning with 0
[12]: same as [1] but also the empty string
This means that a minimal DFA for our language has five states. Here is the DFA:
[0]
^
|
0
|
----->[e]--2-->[2]<-\
| ^ |
| | |
1 __1__/ /
| / /
| | 1
V V |
[1]--2-->[12]
^ |
| |
\___0___/
(transitions not pictured are self-loops on the respective states).
Note: I expected this DFA to have 6 states, as Welbog pointed out in the other answer, so I might have missed an equivalence class. However, the DFA seems right after checking a few examples and thinking about what it's doing: you can only get to accepting state [12] by seeing a 2 as the last symbol (definitely necessary) and you can only get to state [12] from state [1] and you must have seen an odd number of 1s to get to [1]…
The minimum number of states for almost all modulus problems is the base of the modulus. The general strategy is one state for every modulus, as transitions between moduli are independent of what the previous numbers were. For example, if you're in state r4 (representing x = 4 (mod 6)), and you encounter a 1 as your next input, your new modulus is 4x6+1 = 25 = 1 (mod 6), so the transition from r4 on input 1 is to r1. You'll find that the start state and r0 can be merged, for a total of 6 states.
As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):
Example
3 2 5 5
a b c d
Joining first two
5 | 5 5
3 2 | c d
a b |
I have to put the new tree of five into the queue
Am I obligated to put it in the end like this:
5 5 5
c d / \
3 2
a b
Or can I put it in the beginning:
5 5 5
3 2 c d
a b
Or even in the middle of 'c' and 'd'
Is it my choice or is there a rule?
It's not your choice, the Queue needs to be sorted at all times (by it's number of occurrences and in case of equal number of occurrences by the depth of the tree). So it needs to be inserted where it belongs into the order.
This is needed to pick the sub-trees with the least amount of occurrences and if there is choice the most shallow one of them by simply pop-ing them.
If you simply resort after every insertion (this is inefficient and should not be done) the position obviously doesn't matter.
Yes, it's your choice. Whichever way you will get an optimal Huffman code, even though two resulting codes can be manifestly different.
You can get:
a - 00
b - 01
c - 10
d - 11
or you can get:
a - 111
b - 110
c - 10
d - 0
Now if I multiply the number of bits in each symbol times the number of occurrences, I get for the first code: 2*3 + 2*2 + 2*5 + 2*5 = 30 bits. For the second code: 3*3 + 3*2 + 2*5 + 1*5 = 30 bits. So both codes will code the original message to exactly 30 bits.
I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.