From ASCII to UTF-8
Every character you see on screen is stored as a number. The mapping from numbers to characters is called a character encoding. Getting it wrong produces mojibake — garbled text like é instead of e.
ASCII
The original 7-bit encoding. Maps 128 characters (0-127): English letters, digits, punctuation, and control characters. Every byte's top bit is 0. Simple, but only covers English.
Unicode
A universal character set assigning a unique code point (e.g., U+0041 = A, U+1F600 = a smiley) to every character in every script. Unicode defines what characters exist; encodings define how to store them as bytes.
UTF-8
The dominant encoding on the web. Uses 1 to 4 bytes per character with a clever variable-length scheme:
| Byte Count | Code Point Range | Example |
|---|---|---|
| 1 byte | U+0000 — U+007F | A, z, 9 (ASCII range) |
| 2 bytes | U+0080 — U+07FF | e, n, Cyrillic |
| 3 bytes | U+0800 — U+FFFF | Chinese, Japanese, Korean |
| 4 bytes | U+10000 — U+10FFFF | Emoji, rare scripts |
UTF-16 and UTF-32
UTF-16 uses 2 or 4 bytes per character. It's used internally by Java, JavaScript, and Windows. Characters outside the Basic Multilingual Plane need surrogate pairs (two 16-bit units).
UTF-32 uses exactly 4 bytes per character. Simple indexing but wasteful — rarely used for storage or transmission.
Byte Order Mark (BOM)
A special code point (U+FEFF) placed at the start of a file to indicate encoding and byte order. Essential for UTF-16 (which needs to signal big vs. little endian), but optional and generally discouraged for UTF-8.
63 61 66 C3 A9) decoded as Latin-1 produce "café". Always declare your encoding explicitly via HTTP headers or meta tags.
Integers, Floats & Endianness
Integer Representation: Two's Complement
Modern CPUs represent signed integers using two's complement. The most significant bit is the sign bit (0 = positive, 1 = negative). To negate a number: flip all bits, then add 1.
5 in 8-bit: 00000101
-5 in 8-bit: 11111011 (flip → 11111010, +1 → 11111011)
This gives a range of -128 to +127 for 8-bit, and -2,147,483,648 to +2,147,483,647 for 32-bit. The beauty: addition and subtraction work identically for signed and unsigned values — no special hardware needed.
IEEE 754 Floating Point
Floats store real numbers as sign × mantissa × 2^exponent. A 64-bit double has 1 sign bit, 11 exponent bits, and 52 mantissa bits.
0.1 + 0.2 = 0.30000000000000004. For exact decimal math, use libraries like BigDecimal or integer cents.
Endianness
Multi-byte values can be stored with the most significant byte first (big-endian, aka network byte order) or least significant byte first (little-endian, used by x86/ARM). The integer 0x01020304:
Big-endian: 01 02 03 04
Little-endian: 04 03 02 01
Network protocols (TCP/IP) use big-endian. Most modern CPUs use little-endian. Serialization formats must specify or detect byte order to avoid silent data corruption.
Base64, Base32 & Hex
Base encodings represent arbitrary binary data using a restricted alphabet of printable characters. They're essential when binary data must travel through text-only channels (email, JSON, URLs).
Base64
Encodes 3 bytes (24 bits) into 4 characters from a 64-character alphabet (A-Z, a-z, 0-9, +, /). Adds ~33% overhead. Padding with = aligns the output to a multiple of 4 characters.
"Man" → TWFu
M = 77 = 01001101
a = 97 = 01100001
n = 110 = 01101110
010011 010110 000101 101110
T W F u
+ and / which have special meaning in URLs. URL-safe Base64 replaces them with - and _. Some implementations also omit padding. Always know which variant you're using.
Base32 & Hex
Base32 uses 32 characters (A-Z, 2-7). Case-insensitive and avoids ambiguous characters. Used in TOTP codes and some file systems. 60% overhead.
Hex (Base16) maps each byte to two characters (0-9, a-f). 100% overhead but trivial to read and debug. Used for hashes, colors, and memory addresses.
| Encoding | Alphabet Size | Overhead | Use Case |
|---|---|---|---|
| Hex | 16 | 100% | Hashes, debugging |
| Base32 | 32 | 60% | TOTP, case-insensitive contexts |
| Base64 | 64 | 33% | Email attachments, data URIs, JWTs |
Turning Objects into Bytes
Serialization converts in-memory data structures into a format that can be stored or transmitted. The choice of format affects performance, interoperability, and developer experience.
Human-readable, text-based. The lingua franca of web APIs. No schema, no comments, limited types (no integers vs. floats, no dates). ~verbose.
Verbose but powerful — supports namespaces, schemas (XSD), and transformations (XSLT). Still dominant in enterprise/SOAP. Large overhead.
Google's binary format. Schema-defined (.proto files), compact, fast. Supports schema evolution. Not human-readable.
Schema embedded with data. Popular in Hadoop/Kafka. Schema evolution built-in. Good for data pipelines and long-term storage.
Binary JSON — same data model, smaller and faster. No schema. Drop-in replacement where JSON is too slow or large.
Concise Binary Object Representation (RFC 8949). Like MessagePack but IETF-standardized. Supports tags for dates, bigints, etc. Used in WebAuthn and COSE.
Protocol Buffers Deep Dive
Protobuf assigns each field a numeric tag. The wire format encodes tag + type + value, skipping unset fields entirely. This makes it compact and allows schema evolution.
// person.proto
syntax = "proto3";
message Person {
string name = 1; // field tag 1
int32 age = 2; // field tag 2
string email = 3; // field tag 3
}
| Property | JSON | Protobuf | Avro | MessagePack |
|---|---|---|---|---|
| Format | Text | Binary | Binary | Binary |
| Schema | None | .proto file | Embedded | None |
| Human readable | Yes | No | No | No |
| Size | Large | Small | Small | Medium |
| Speed | Slow | Fast | Fast | Fast |
| Schema evolution | Ad hoc | Strong | Strong | Ad hoc |
Changing Formats Without Breaking Things
In distributed systems, producers and consumers of data are updated independently. Schema evolution rules ensure old readers can handle new data and vice versa.
Compatibility Types
| Type | Meaning | Safe Operations |
|---|---|---|
| Backward compatible | New code reads old data | Add optional fields, remove fields (with defaults) |
| Forward compatible | Old code reads new data | Remove fields, add fields (old code ignores unknowns) |
| Full compatible | Both directions work | Only add/remove optional fields with defaults |
Evolution in Practice
// v1: original schema
message Event {
string id = 1;
string type = 2;
}
// v2: backward + forward compatible
message Event {
string id = 1;
string type = 2;
int64 timestamp = 3; // NEW — old readers ignore it
// field 4 reserved // REMOVED field — tag retired
}
Making Data Smaller
Compression trades CPU time for smaller data. The right algorithm depends on whether you need speed (real-time streaming) or maximum size reduction (archival storage).
| Algorithm | Ratio | Speed | Best For |
|---|---|---|---|
| gzip / deflate | Good | Moderate | HTTP responses, general files. Universal support. |
| Brotli | Excellent | Slow compress, fast decompress | Static web assets (pre-compressed). 15-25% smaller than gzip. |
| zstd | Excellent | Fast both ways | Databases, logs, real-time pipelines. Trainable dictionaries. |
| LZ4 | Moderate | Extremely fast | In-memory caching, real-time streaming. Prioritizes speed. |
| Snappy | Moderate | Very fast | Google's internal format. Used in Bigtable, Hadoop. Speed over ratio. |
Lowest latency: LZ4 or Snappy — decompress at memory-copy speeds.
Universal compatibility: gzip — supported everywhere, good enough ratio.
Best overall: zstd — near-Brotli ratios at near-LZ4 speeds. Increasingly the default choice.
HTTP Compression
Browsers and servers negotiate compression via HTTP headers. The client advertises what it supports; the server picks the best match.
# Request
GET /api/data HTTP/1.1
Accept-Encoding: gzip, deflate, br
# Response
HTTP/1.1 200 OK
Content-Encoding: br
Vary: Accept-Encoding
Content-Type: application/json
[compressed body]
Test Yourself
Accept-Encoding: gzip, br, zstd to tell the server what it can decompress. The server responds with Content-Encoding: br (or whichever it chose). Transfer-Encoding is for hop-by-hop encoding like chunked, not end-to-end compression.