Strategy Resources Structure Style Layout Composition Production Infrastructure


E-Book Formats Explained Image Formats Explained Advanced Styling & Layout Better Photography Working with SVG Unicode and All That Troubleshooting


Revenue Calculator Contact
Copyright © Dodeca Technologies Ltd. 2016

Unicode and All That

Chapter Three, Structure points out the range of different writing systems in the world, but that a tiny subset only of the symbols that comprise those systems can be found on a given computer-keyboard. It shows too that you can use what are known as Character Entity References to address this, such that you emplace ‘unusual’ characters within your content by means of a short mnemonic that represents a given character, prefixed with an ampersand, and suffixed with a semi-colon.

The set of all such mnemonics is limited, however, and some e-readers do not even recognise those defined within the standard. Given this, and as Chapter Three states, you must use a variant on the character-entity approach called ‘Numeric Entity References’. That is, you must cite a number rather than a mnemonic between the ampersand and semi-colon, and this number must be a valid instance of what is called a Unicode ‘Codepoint’.

Similarly, you may state colour values in your CSS code using base-16, or using numbers falling within the seemingly arbitrary range of 0 to 255. This appendix explains from the ground up why these strange numbers and ‘bases’ litter the code-scape, and why we use things like Unicode and related standards such as UTF-8.

Why Bits, Why Binary?

Throughout human history, every form of writing has come down to making distinct differentiations in a medium of some form – an impression in a clay or wax tablet; an inky mark scratched with quill on parchment. Our modern computing machines are no different, as they also represent symbols by means of variations at discrete locations within some storage medium.

Consider the clay tablet with the cuneiform writing in the sidebar. Each symbol corresponds to a unique and well-defined concept, and could represent, for example, the number of sheep a farmer sold at the market each week.

Now imagine that the farmer had used a different 'encoding', and had represented numbers by using a piece of dowel to make a cylindrical impression in the clay (rather than the wedge-shaped indentations of cuneiform), where the depth of the depression represented magnitude. The drawing in the next sidebar captures this idea, showing, say, the number of sheep sold at market on three successive weeks as three cylindrical depressions in a block of clay.

Determining the number of sheep sold in a given week would require no more than the same piece of dowel, marked along the side in regular intervals (like a little cylindrical ruler). The depth to which the dowel penetrated a given hole in the clay would then give the quantity symbolised by the hole. This seems like a reasonable system, but the fact is that the dowel could become worn and therefore shorter with repeated use, and the holes could fill with dirt or other matter, which would change their depth. Equivalently, the farmer would have to be precise when making the depressions in the first place. All such factors promote inaccuracy.

To compound this, a farmer going to market each week would choose wax over clay because it allows continuous creation of records (a given clay tablet can acquire no new records after it has been fired). However, wax would present greater risk of inaccuracy because it softens in warmer weather. That is, repeated measurements during the summer (using the depth gauge) could deepen a given hole over time to the point that it failed to represent the original quantity.

Now imagine that the farmer used a different encoding by doing no more than making a simple hole in the clay for each sheep sold – a ‘unary’ counting model. That is, the presence or absence of a hole would be all that mattered, thus rendering hole depth immaterial, thereby precluding the risk pointed out above of creeping inaccuracy. However, selling, say, nineteen sheep would require nineteen holes, and so the farmer would need to provide much clay surface-area to allow for such large numbers. Yet these would be relatively rare, and so much of that surface area would go to waste in the average case (an inefficient use of the medium). Moreover, selling hundreds of sheep over a given year would require hundreds of holes, which would make counting them a laborious and error prone process when the tax man called.

Far better would be the use of sets of holes and no-holes in the clay or wax – just two states, a ‘binary’ counting system. This too would preclude the creeping inaccuracy of the depth-of-hole approach, but it would also allow the representation of quantity in terms of the permutation of holes/no-holes in a given set. This would use the medium far more efficiently because, as the next section shows, you can represent a large range of numbers with just a handful of holes/no-holes in a given set. Moreover, counting the number of sheep sold in a given time would be considerably easier.

Thus it is, by analogy, with modern, digital methods of representing information. The characters of a given alphabet are represented by numbers, each of which is represented by a set of ‘holes/no-holes’, where a given permutation of holes/no-holes corresponds to a given number, and where a given hole/no-hole is called a ‘bit’ (a contraction of ‘Binary digIT’). In a computer's 'memory' (RAM), electrical charges take the place of physical depressions (the holes), while microscopic electronic ‘buckets’ take the place of the clay or wax.

In the case of hard disks, the presence or absence of tiny magnetic fields comprise the holes/no-holes, and a magnetic surface plays the role of the ‘clay‘, and with pre-recorded CDs and DVDs, the medium is a thin layer of reflective aluminium. Here, notably, the microscopic ‘depressions’ that constitute bits in that medium are every bit as physical as the wedge-shaped marks in a clay tablet from 6000 years ago.

Moreover, the very same concerns of wear and inaccuracy lie behind the use of the binary hole/no-hole predicate in computing systems. That is, electronic circuits degrade over time thus precluding an equivalent to the 'depth of hole' approach considered above (where that equivalent would be having to measure voltage levels very precisely). The only difference between modern information representation and the earliest writing in human history is that electricity and light work considerably faster than styli and soft clay. Ironically, however, clay tablets from the dawn of civilisation are more durable and thus longer lasting than the electronic/magnetic media we use currently – our most primitive writing is the most persistent.

This leaves just one question: what counting system could the farmer have employed such that a given permutation of holes/no-holes represented a given number? There are many permutation schemes to choose from when using binary encoding – Morse code is a familiar example – but the principal scheme that computing systems employ treats each set of holes/no-holes (by analogy: ones and zeroes) as a number expressed in base-2.

There are compelling reasons for this choice: implementing arithmetic functionality is easy when using base-2, and it permits the application of elegant mathematical techniques when solving certain classes of problem in computer programming. The next section explains the relevant details of base-2 representation, but to pre-empt that, the sidebar shows the kind of binary-encoded clay tablet our digital farmer might have created (thirteen, six and nine sheep in weeks One, Two and Three respectively).

Why Bytes, Why Eight Bits?

The collections or sets of bits with which computers work are referred to as ‘bytes’, a term coined in 1956 by noted IBM researcher Werner Buchholz. Phonetically, it suggests a little chunk of information, and it misspells the word ‘bite’ deliberately so as to avoid accidental misquotation as ‘bit’, but the concept raises the question of size; how big should each collection of bits be?

The greatest decimal number that two bits can represent is four because two states are possible in the first bit, and two are possible in the second. That is, when the first bit is zero, the second can be zero or one (two permutations of the two bits), and when the first bit is one, the second can again be zero or one (two more permutations).

Another way of saying this is as follows:

2 × 2 = 22
      = 4

It follows that grouping bits together in threes gives:

2 × 2 × 2 = 23
          = 8

…and a group of four is equivalent to 24, which is sixteen possible permutations, and so on. Over the early decades in computer science, and depending on the design of the machine in question, bytes varied in the number of bits they contained, but a byte is defined these days as being formed from eight bits.

This came about because, in the western Roman character set, there are 26 lower-case letters, 26 upper-case equivalents, ten digits, and a motley collection of punctuation and mathematical symbols, along with outliers such as ‘&’ and ‘@’. Six bits can represent a number between zero and sixty-three (sixty-four values in all), but this is too limited to be able to represent the western character set. However, seven bits corresponds to 27 = 128, which is sufficient to represent all the characters mentioned above, while leaving a number of unassigned permutations that can be used for other purposes, such as representing the end of a line of text.

Even so, seven is not a power of two, which is of significance, as a power-of-two byte size facilitates some of the elegant mathematical techniques mentioned in the previous section, and so system designers started using the next power of two beyond seven, which is eight. An eighth bit allowed machines to represent an extended range of characters, or to employ error detection in the early computer networks (the precursors of the modern Internet). Avoiding a mass of technicalities, the idea here is that you can set the eighth bit to one or zero given the cumulative state of the other seven bits. This allows a system that is in receipt of an eight-bit byte of information to compare the value of the seven data-bits with the value of the eighth ‘meta’ bit. Inconsistency between the two indicates that corruption of the data occurred during transmission, thus telling the receiver that it should ask the sender to re-send the information.

Modern networking systems now use error detection methods of greater sophistication, but the convention of eight bits to a byte became a de facto standard, and was eventually standardised formally. Do note here, however, that because there has been some variance in the size of a byte in the past, UTF-8 nomenclature (see below) uses the term ‘octet’ to indicate eight bits, thus avoiding any uncertainty,

Why 255, Why Base-16?

Eight bits to a byte is also the reason why the number 255 crops up so frequently in technical discussions of computer-related matters. That is:

2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 28
                              = 256

This allows the representation of numerical values from 0 to 255 (256 values in all), and this is why, if you assign colour values in a CSS rule using the rgb(rr,gg,bb) form, you must supply a value between 0 and 255 for the red, green and blue parameters. For example:

background-color : rgb(64,127,32);

Another way of saying this is that CSS models the RGB ‘colour space’ as requiring one byte to represent a value for the Red degree of freedom, one byte for the Green, and one for the Blue.

This gives a faithful mapping between what you say in your CSS code and what happens on a so-called ‘True Colour’ display system. Those display technologies employ four bytes to represent the colour of a given pixel: Red, Green and Blue get one byte apiece, and the fourth either goes unused, or represents a value such as transparency. This, in turn, is why it is possible to state a colour value in CSS using the alternative rgba(rr,gg,bb,aa) form, where the ‘aa’ parameter stands for ‘alpha’, and which should be a number between 0 and 1 that denotes the degree of opacity.

For example, the following:

background-color : rgba(64,64,64,0.5);

…sets the background colour of the element to which it is applied to semi-transparent charcoal-grey (which appears slate-grey due to the transparency).

The powers-of-two factor that pervades matters here also explains why you can use numbers in base-16 (‘hexadecimal’ or simply ‘hex’) to express colour values in CSS. For example:

background-color : #D0D0D0;

This rule states that the background colour should be D0 in strength for the red, blue and green components respectively, and the use of base-16 here is simply a notational convenience that arises because any value between 0 and 255 can be expressed in just two hexadecimal digits.

If you are uncertain about this, consider that when counting in decimal (base-10) we go from zero to nine, and then count ‘10’; one lot of ten, zero units. In base-16, the units go from zero to fifteen, yet we do not have a special symbol for the numbers 10 to 15, so hexadecimal notation borrows the first six letters from the alphabet to denote those values. That is, counting from zero to fifteen in base-16 goes as follows:

0 1 2 3 4 5 6 7 8 9 A B C D E F

It is only when you reach F and need to count one more do you write ‘10’, meaning ‘one lot of sixteen, zero units’.

With that in hand, the fact that 16 × 16 = 256 allows representation of any value for a given eight-bit byte using just two digits in base-16. Hexadecimal notation is simply useful shorthand when you need to express a byte-related value in CSS, and, scary as it may seem to the neophyte, it really is entirely harmless.

Finally, in this section, you may have seen the value 65,535 (or 65,536) in technical discussions. The reason for this again-seemingly-arbitrary number is that software systems sometimes need to use two bytes to represent larger values (and four, and eight, and so on). Two bytes is 16 bits, and 216 = 65,536. Similarly, a kilobyte is not exactly 1000 bytes, but is 1024 because that is the value of 210.

Why Unicode?

The use of seven bits to represent a character from the western Roman alphabet saw the definition of a standard called ‘ASCII’, which stands for the American Standard Code for Information Interchange, and which began life in 1960.

In this character encoding (as these things are known), each of the characters in the Roman alphabet are assigned one out of 128 numbers (or ‘codepoints’ as the jargon has it). For example, a space is 32, whereas an uppercase ‘A’ is a 65, a ‘B’ is 66, and a ‘C’ is 67. Similarly, a lower-case ‘a’ is 97 and a lower-case ‘b’ is 98, and so on. This worked well enough in the early decades of computing technologies in the Twentieth Century, but it defines only 95 printable characters; for example, ‘$’ is mapped to 36, but there is no British pound sign or ‘£’.

While use of the eighth bit in a byte allowed extension of ASCII encoding (as mentioned in the section before last) system manufacturers extended it in a variety of incompatible ways. These incompatibilities engendered serious problems, and so development of a standard called Unicode began in 1987. Its aim, in essence, is to assign a unique codepoint to every character in most of the writing systems ever known, while leaving room for writing systems that archaeologists have yet to discover, and room for new systems that humanity has yet to invent. It includes codepoints for systems as diverse as mathematical and musical notation, the hexagrams of the I Ching, and the symbols on the Phaistos Disc (a Minoan artefact discovered in Crete); although some have yet to make the cut, Klingon being an example.

More precisely, it assigns codepoints out of a space of 1,114,112 to writing system graphemes, where ‘grapheme’ means the concept of, say, the letter ‘A’, rather than the actual appearance of a glyph rendered in a given typeface. Unicode also encompasses things such as collation, and the rules covering situations where certain characters occur together, and it is backward compatible with ASCII to boot. That means that a capital ‘A’ is 65 in Unicode, as it is in ASCII.

Unicode is complex, as its remit demands, but do not let this deter you. To encode an exotic character (from the western point of view) in your e-book, you should look that character up in a Unicode reference, determine the number to which it is mapped (the codepoint), and then cite that number as a numeric character reference in the appropriate tract of HTML.

If you cite it as a decimal number then the syntax, where dddd indicates the number itself is:


However, if you cite the codepoint as a hexadecimal number then the syntax is a little different, and is as follows, where hhhh indicates the number:


Note that these examples assume that you are citing a codepoint from what is known as Unicode's ‘Basic Multilingual Plane’. This section of the standard covers characters from the more-common written languages, but if you need to render something truly extraneous from the higher end of Unicode's codepoint space (jokingly called the ‘astral planes’) then you should use as many digits as are cited in the reference you consult.

Do note too that, while getting the character reference right is one issue, the ability of a given user agent to render it is another. That is, a glyph for the character in question must exist within the font file that the user agent employs when presenting the content. If not, it will display a glyph that looks like this � on the user agent on which you are reading this now. This is Unicode's so-called ‘replacement character’, which indicates that no glyph exists in the typeface that the user agent is using. Its appearance differs from typeface to typeface, and this ‘ ’ (if it differs from the example in the previous sentence) may be more familiar to you.

Notably here, the current Wikipedia entry on Unicode reports that ‘fewer than a dozen fonts – sometimes described as ’pan-Unicode’ fonts – attempt to support the majority of Unicode's character repertoire’. This is entirely understandable, given that such typefaces incorporate designs for a veritable mountain of glyphs, and so, when faced with a need to encode and thus render exotic characters, you should research the capabilities of the default typefaces that the various user agents support. If they do not support the desired glyph(s), you will have to find a typeface that does, and use the​embedded-fonts facility in CSS to employ it in the presentation of your content. Alternatively, you can create an SVG graphic that renders a given glyph while avoiding costly overheads. Appendix E​addresses this.

Why UTF-8?

The original seven-bit ASCII can be represented in a computer's storage media, or transmitted across a network bit-for-bit, byte-for-byte, but Unicode does not define how codepoints should be written/transmitted (or ‘formatted’). In principle, and given that you need three bytes to represent the gamut of codepoints in Unicode directly, you could just encode each character bit-for-bit in a given medium.

However, for a large proportion of communications, certainly those based around the western languages, this would be incredibly inefficient. A simple word like ‘CAT’ requires just three bytes in ASCII, but would require nine bytes if encoded as raw Unicode, where two bytes in each triplet would be redundant.

The diagram in the sidebar shows this, by depicting the mapping from alphabetic to decimal, and then to binary. The leading two bytes of each letter in the binary representation are filled with useless zeros. The problem here is that documents in which seven-bit ASCII proliferated would be around three times as long as their single-byte equivalents. This would impinge on storage and network resources, yet 66% of each document's bit-wise contents would be redundant.

The solution is to use some form of flexible encoding, such that (in general) only the number of bytes that are needed are used. In such a scheme, a simple character sequence like ‘CAT’ should only ever consume three bytes, but inclusion within that sequence of some obscure ideograph from, say, Mandarin, may consume as many bytes as it needs to express its codepoint.

A variety of such ‘transformation’ schemes have been developed over time, but the one that is in ever-greater use on the Web, and which will encode any Unicode codepoint faithfully is ‘Universal Coded Character Set Transformation Format – 8-bit’, or simply UTF-8. In this standard, the first seven bits of a character map directly to ASCII, so a capital ‘A’, which is 65 in ASCII and Unicode, is also 65 in UTF-8. This means that ‘CAT’ in UTF-8 looks just like it does in seven-bit ASCII, and thus the highest or eighth bit will always be zero for those characters.

A value of one for that bit, however, acts as a signal to the program in question that the next byte along forms a part of the same character, and the format uses the values of other bits thereafter to indicate whether or not a third and fourth byte form part of the character too.

Character encoding is a truly complex and labyrinthine affair, and it is true to say that few programmers who wrestle with its thorny issues come away unscathed (anyone for latin5?). Thankfully, as an e-book producer, you need ensure only that, when saving your content in your word processor as plain text, or pasting it from that application, you should convert the ‘smart quotes’ to ordinary straight quotes (as Step E in Stage Three in Chapter Seven, Production​directs). Suggested search terms for help on this are:

word smart quotes convert

You should also ensure that any HTML files you create have the following at the top of the <head> element, as Chapter Three advises:

 <meta charset = 'UTF-8'/>

Finally, if you conduct your own research on UTF-8, or when using your text editor, you may come across the concept of a ‘Byte Order Mark’ or ‘BOM’. Similarly, some text editors give you the option to save a file with or without a BOM. In all cases, you can forget about BOMs with impunity, as they are of value only when processing a text file from an external source (e.g. opening and manipulating it within a text editor), and even then they come into play only when handling text that is represented using older non-UTF-8 encodings (a ‘legacy’ issue, as these things are known).

That is, a BOM is a short sequence of bytes at the beginning of a stream of text that tells a text processing program certain things it needs to know about the character encoding used in the remainder of the file. However, given that you will create new HTML and CSS files when formatting your book's content, a BOM will play no useful role in your work. It follows that, as long as you create only ASCII code (i.e. where you use numeric entities in your HTML to signify ‘exotic’ characters) and/or save your files only in UTF-8 you will never need to instruct your editor to include a BOM. This will allow you to forget about this issue.

A closing thought in this appendix: there is a certain gratification in the fact that pre-recorded CDs and DVDs utilise pits in a reflective surface that are every bit as physical as the wedge-shaped depressions that comprise clay-tablet cuneiform from the very dawn of civilisation.

That simple insight brings us full circle, yielding thus a refreshing consideration: the very first human writers could have understood binary and from there base-2 encoding, along with the sound reasons for using it. Those authors and scribes of old (most likely one and the same at the time) could have understood ASCII too and Unicode from there; and yes, even UTF-8 would have been within their grasp.

Times and technologies may change for writers, but the people that we are, and our motivations and symbolic means vary not.