How are Urbit phonetic names encoded? - urbit

Urbit points (network addresses) are identified by 32-bit integers, but they're typically not referred to by their number. Instead, I usually see them represented in a human-pronounceable form where every byte is converted into a three-letter syllable. For example:
8 bits galaxy ~lyt
16 bits star ~diglyt
32 bits planet ~picder-ragsyt
64 bits moon ~diglyt-diglyt-picder-ragsyt
128 bits comet ~racmus-mollen-fallyt-linpex--watres-sibbur-modlux-rinmex
I initially assumed that every byte had a single text representation, but have seen that planets names usually don't include the name of their star, so it must be more complicated than that.
How does Urbit's phonetic name encoding system (#p-names) work?

Urbit's phonetic naming system encodes unsigned integers as human-readable strings. These unsigned integers sometimes represent the byte strings they encode to in big-endian (although that representation can't track leading zeros so the byte length must communicated out-of-band if needed). The phonetic naming scheme operates on these big-endian bytes.
The phonetic naming system has two variants. For general use there is #q-encoding, which is suitable for values of any length, and is frequently used to represent binary data in Hoon code or when interacting with the Dojo REPL. For Urbit point names there is #p-encoding, which is based on #q-encoding but modifies certain cases.
#q-Encoding: Pairs of Syllables
Urbit phonetic names are made up of 3-letter syllables, organized in two lists of 256 syllables each. Each syllable consists of a consonant, a vowel, then another consonant. The "prefix" syllable list uses the vowels a, i, and o, and the "suffix" syllable list uses the vowels e, u, and y, with one exception: zod, the first entry in the suffix list. The full syllable lists are included below.
Values fitting in one byte, from 0x00 to 0xFF, are encoded by taking the corresponding syllable from the suffix list. Examples: 0x00 becomes ~zod, 0x01 becomes ~nec.
Values fitting in two bytes, from 0x0100 to 0xFFFF, are encoded by looking up the syllable corresponding to the high byte in the prefix list and concatenating the syllable corresponding to the low byte in the suffix list. Examples: 0x0100 becomes ~marzod, 0x0101 becomes ~marnec.
Larger values are encoded by splitting them into two-byte pairs in big-endian order, encoding each as described above for values fitting in two bytes, and joining the results with - hyphen/minus characters. If the value is an odd number of bytes, the first byte pair is padded with a leading zero. Examples: 0x01_0000 becomes ~doznec-dozzod, 0x0101_0101 becomes ~marnec-marnec.
#p-Encoding: Scrambling Planets
The #p-encoding scheme is the same as #q for most values. However, it is different for values between 17 and 64 bits, which correspond to the IDs of planets and moons.
Planets are intended to correspond to real individuals on the Urbit network. Each planet is spawned from a star, and the 16 lower bits of the planet's ID are those of its parent star's ID. Under the #q-encoding system, this would also mean that the last two syllables of every planet's name would be its star's name. The Urbit developers didn't want each individual's name on the network to include the name of the star that happened to spawn their planet initially: that would artificially associate them with the star forever, even though they could immediately transfer their planet to a different star.
Their solution was to scramble all planet names randomly, to obfuscate the relationship between a planet's name and its parent star's name. This is implemented as a custom (obviously non-secure) cipher over the space of possible planet IDs. Because each star has 216 - 1 planets, the number of planets is not a power of two, so a conventional block cipher won't work directly. Instead, they use the construction described in Ciphers with Arbitrary Finite Domains (Black and Rockway 2002) over a custom Feistel-style block cipher optimized for speed (and compatibility).
This scrambling is applied on planet IDs, and on the lower 32 bits of a moon ID (which correspond to its parent planet's ID). Under #p-encoding, the planet with ID 0x01_0101 becomes ~ralnyt-botdyt, showing no connections to its parent star ~marnec. The star-planet relationship is the only one that is obfuscated. If you look at the names of a planet's moons, they include the name of the planet directly: for example, ~ralnyt-botdyt's moon 0x01_0001_0101 becomes ~doznec-ralnyt-botdyt, and 0x02_0001_0101 becomes ~dozbud-ralnyt-botdyt.
Implementations
When writing Hoon code, such as at the Dojo REPL, you can use the standard #p and #q functions directly to encode values to the corresponding phonetic names. In Hoon, a #p-encoded value is identified with the prefix ~ and a #q-encoded value is identified with the prefix .~, and either can be decoded back with the #u function. Hoon also uses . the period character as a (mandatory) thousands separator in integer literals.
> `#p`1.529.729.032
~diglyt-diglyt
> `#q`1.529.729.032
.~fonbyn-mopful
> `#u`~diglyt-diglyt
1.529.729.032
> `#u`.~diglyt-diglyt
3.246.440.832
In JavaScript, the official urbit-ob package provides similar functions.
import ob from "urbit-ob";
ob.patp(1529729032); // ~diglyt-diglyt
ob.patq(1529729032); // ~fonbyn-mopful
ob.patp2dec("~diglyt-diglyt"); // 1529729032
ob.patq2dec("~diglyt-diglyt"); // 3246440832
Full Syllable Lists
prefixes = ["doz","mar","bin","wan","sam","lit","sig","hid","fid","lis","sog",
"dir","wac","sab","wis","sib","rig","sol","dop","mod","fog","lid","hop","dar",
"dor","lor","hod","fol","rin","tog","sil","mir","hol","pas","lac","rov","liv",
"dal","sat","lib","tab","han","tic","pid","tor","bol","fos","dot","los","dil",
"for","pil","ram","tir","win","tad","bic","dif","roc","wid","bis","das","mid",
"lop","ril","nar","dap","mol","san","loc","nov","sit","nid","tip","sic","rop",
"wit","nat","pan","min","rit","pod","mot","tam","tol","sav","pos","nap","nop",
"som","fin","fon","ban","mor","wor","sip","ron","nor","bot","wic","soc","wat",
"dol","mag","pic","dav","bid","bal","tim","tas","mal","lig","siv","tag","pad",
"sal","div","dac","tan","sid","fab","tar","mon","ran","nis","wol","mis","pal",
"las","dis","map","rab","tob","rol","lat","lon","nod","nav","fig","nom","nib",
"pag","sop","ral","bil","had","doc","rid","moc","pac","rav","rip","fal","tod",
"til","tin","hap","mic","fan","pat","tac","lab","mog","sim","son","pin","lom",
"ric","tap","fir","has","bos","bat","poc","hac","tid","hav","sap","lin","dib",
"hos","dab","bit","bar","rac","par","lod","dos","bor","toc","hil","mac","tom",
"dig","fil","fas","mit","hob","har","mig","hin","rad","mas","hal","rag","lag",
"fad","top","mop","hab","nil","nos","mil","fop","fam","dat","nol","din","hat",
"nac","ris","fot","rib","hoc","nim","lar","fit","wal","rap","sar","nal","mos",
"lan","don","dan","lad","dov","riv","bac","pol","lap","tal","pit","nam","bon",
"ros","ton","fod","pon","sov","noc","sor","lav","mat","mip","fip"]
suffixes = ["zod","nec","bud","wes","sev","per","sut","let","ful","pen","syt",
"dur","wep","ser","wyl","sun","ryp","syx","dyr","nup","heb","peg","lup","dep",
"dys","put","lug","hec","ryt","tyv","syd","nex","lun","mep","lut","sep","pes",
"del","sul","ped","tem","led","tul","met","wen","byn","hex","feb","pyl","dul",
"het","mev","rut","tyl","wyd","tep","bes","dex","sef","wyc","bur","der","nep",
"pur","rys","reb","den","nut","sub","pet","rul","syn","reg","tyd","sup","sem",
"wyn","rec","meg","net","sec","mul","nym","tev","web","sum","mut","nyx","rex",
"teb","fus","hep","ben","mus","wyx","sym","sel","ruc","dec","wex","syr","wet",
"dyl","myn","mes","det","bet","bel","tux","tug","myr","pel","syp","ter","meb",
"set","dut","deg","tex","sur","fel","tud","nux","rux","ren","wyt","nub","med",
"lyt","dus","neb","rum","tyn","seg","lyx","pun","res","red","fun","rev","ref",
"mec","ted","rus","bex","leb","dux","ryn","num","pyx","ryg","ryx","fep","tyr",
"tus","tyc","leg","nem","fer","mer","ten","lus","nus","syl","tec","mex","pub",
"rym","tuc","fyl","lep","deb","ber","mug","hut","tun","byl","sud","pem","dev",
"lur","def","bus","bep","run","mel","pex","dyt","byt","typ","lev","myl","wed",
"duc","fur","fex","nul","luc","len","ner","lex","rup","ned","lec","ryd","lyd",
"fen","wel","nyd","hus","rel","rud","nes","hes","fet","des","ret","dun","ler",
"nyr","seb","hul","ryl","lud","rem","lys","fyn","wer","ryc","sug","nys","nyl",
"lyn","dyn","dem","lux","fed","sed","bec","mun","lyr","tes","mud","nyt","byr",
"sen","weg","fyr","mur","tel","rep","teg","pec","nel","nev","fes"]

Related

Opposite function to IBITS? How to use a subset of bits to store an integer in FORTRAN?

I have a memory-critical FORTRAN program, which at the moment uses two integer arrays, the first uses 4 BYTES integers and the second uses 1 BYTE integers.
Neither array is "fully" used, i.e. I know that the number stored in the 4 byte array won't exceed a million or so (so I need more than 2 bytes, but I'm not using all the bits of 4 bytes) and likewise I only use 2 bits of the 1-byte array as "flags".
To save memory I wanted to try and combine the two, i.e. use the two left-most bits of the 4-byte array to store the 2 flags.
Now I know if I set the INTEGER value first, I can subsequently set the flags using IBSET/IBCLR, and still extract my integer safely using the IBITS function to extract the rightmost X bits (X=28 in this case):
integer(kind=4) :: imark,inum
imark=600245 ! example number
! use bits 29 and 30 to store flags...
imark=IBSET(imark,29)
imark=IBSET(imark,30)
print *,imark ! gives 1611212981
! extract right most bits:
inum=IBITS(imark,0,28)
print *,inum ! gives back 600245 as desired.
However, I don't know how to modify the integer, while retaining the memory of the flags. In other words, I need the opposite function to IBITS, to copy an integer assigning only the X rightmost bits.
Obviously if I assign an new integer value (imark=343525) it will reset the flag bits. I could write a function that cycles through the rightmost X bits of the new integer value one at a time and then sets them using IBSET/IBCLR in my 4-byte array, but that seems very long-winded. But I do not find any intrinsic function to perform this task.
There is intrinsic function merge_bits(i, j, mask) in Fortran 2008. It merges the bits from two integers according to a logical mask.
You must prepare a logical mask as an integer which has ones in the first two bits and zeros elsewhere (or the opposite one).

Is information stored in registers/memory structured as binary?

Looking at this question on Quora HERE ("Are data stored in registers and memory in hex or binary?"), I think the top answer is saying that data persistence is achieved through physical properties of hardware and is not directly relatable to either binary or hex.
I've always thought of computers as 'binary', but have just realized that that only applies to the usage of components (magnetic up/down or an on/off transistor) and not necessarily the organisation of, for example, memory contents.
i.e. you could, theoretically, create an abstraction in memory that used 'binary components' but that wasn't binary, like this:
100000110001010001100
100001001001010010010
111101111101010100001
100101000001010010010
100100111001010101100
And then recognize that as the (badly-drawn) image of 'hello', rather than the ASCII encoding of 'hello'.
An answer on SO (What's the difference between a word and byte?) mentions that processors can handle 'words', i.e. several bytes at a time, so while information representation has to be binary I don't see why information processing has to be.
Can computers do arithmetic on hex directly? In this case, would the internal representation of information in memory/registers be in binary or hex?
Perhaps "digital computer" would be a good starting term and then from there "binary digit" ("bit"). Electronically, the terms for the values are sometimes "high" and "low". You are right, everything after that depends on the operation. Most of the time, groups of bits are operated on together. Commonly groups are 1, 8, 16, 32 and 64 bits. The meaning of the bits depends on the program but some operations go hand-in-hand with some level of meaning.
When the meaning of a group of bits is not known or important, humans like to be able to decern the value of each bit. Binary could be used but more than 8 bits is hard to read. Although it is rare to operate on groups of 4 bits, hexadecimal is much more readable and is generally used regardless of the number of bits. Sometimes octal is used but that's based on contexts where there is some meaning to a subgrouping of the 3 bits or an avoidance of digits beyond 9.
Integers can be stored in two's complement format and often CPUs have instructions for such integers. Once such operation is negation. For a group of 8 bits, it would map 1 to -1,… 127 to -127, and -1 to 1, … -127 to 127, and 0 to 0 and -128 to -128. Decimal is likely the most valuable to humans here, not base 256, base 2 or base 16. In unsigned hexadecimal, that would be 01 to FF, …, 00 to 00, 80 to 80.
For an intro to how a CPU might do integer addition on a group of bits, see adder circuits.
Other number formats include IEEE-754 floating point and binary-coded decimal.
I think you understand that digital circuits are binary. So, based on the above, yes, operations do operate on a higher conceptual level despite the actual storage.

Is there some byte combination that can be used as a separator of streams of Int16

I was given the task to specify a file format for internal use inside an application.
One of the intended requirements says:
The data section of the file should be made up of a series of streams of type Int16 values (short integers), delimited by a suitable combination of one or more bytes.
As I understand, Int16 can contain any single byte value, so I don't know how I could choose some sequence of bytes that is guaranteed not to appear incidentally inside a stream. Is there such a sequence?
(And also, if the answer is "no", what would be a good way to determine the position and size of each stream in the file?)
By "streams," I assume the request indicates that the length is unknown when the writing of the data begins.
Therefore, I'd suggest a "chunked" encoding, where each substream is parcelled out into variable-size pieces, with the length of each piece written at the beginning as a fixed size integer. An empty chunk signals the end of the substream. Normally, there would be a maximum length of a chunk to facilitate allocation of buffers for efficient reading and writing.
This is patterned after HTTP's "chunked" transfer encoding and a similar approach is used in many other formats, such as the indefinite length encoding supported by the basic encoding rules for ASN.1.
I would suggest prefixing each stream with a length field, rather than trying to use delimiters, for the reason you've already given (no suitable unique delimiter). E.g.:
<length>
<stream>
<length>
<stream>
<length>
<stream>
...
where <length> is, say, a 4 byte integer which defines the number of 16 bit elements in the following stream.

Why bytes of one word has opposite order in binary files?

I was reading BMP file in hex editor while discovered something odd. Two first letters "BM" are written in order, however the next word(2B), which is means file size, is 36 30 in hex. Actual size is 0x3036. I've noticed that other numbers are stored the same way.
I'm also using MARS MIPS emulator which can display memory by words. String in.bmp is stored as b . n i / \0 p m.
Why data isn't stored continuously?
It depends not on the data itself but on how you store this data: per byte, per word (2 bytes, usually), or per long (4 bytes -- again, usually). As long as you store data per byte you don't see anything unusual; data appears "continuous". However, with longer units, you are subject to endianness.
It appears your emulator is assuming all words need to have their bytes reversed; and you can see in your example that this assumption is not always valid.
As for the BM "magic" signature: it's not meant to be read as a word value "BM", but rather as "first, a single byte B, then a single byte M". All next values are written in little-endian order, not only 'exchanging' your 36 and 30 but also the 2 zeroes 'before' (or 'after') (the larger values in the BMP header are of 4 bytes long type).

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?
The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.
Because computers address memory at the byte level, so anything smaller than a byte is not addressable.
The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.
Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Resources