Encoding, Serialization & Compression

01 / Character Encoding

From ASCII to UTF-8

Every character you see on screen is stored as a number. The mapping from numbers to characters is called a character encoding. Getting it wrong produces mojibake — garbled text like Ã© instead of e.

ASCII

The original 7-bit encoding. Maps 128 characters (0-127): English letters, digits, punctuation, and control characters. Every byte's top bit is 0. Simple, but only covers English.

Unicode

A universal character set assigning a unique code point (e.g., U+0041 = A, U+1F600 = a smiley) to every character in every script. Unicode defines what characters exist; encodings define how to store them as bytes.

UTF-8

The dominant encoding on the web. Uses 1 to 4 bytes per character with a clever variable-length scheme:

Byte Count	Code Point Range	Example
1 byte	U+0000 — U+007F	A, z, 9 (ASCII range)
2 bytes	U+0080 — U+07FF	e, n, Cyrillic
3 bytes	U+0800 — U+FFFF	Chinese, Japanese, Korean
4 bytes	U+10000 — U+10FFFF	Emoji, rare scripts

Why UTF-8 Won

UTF-8 is backward compatible with ASCII — any valid ASCII file is already valid UTF-8. It's self-synchronizing (you can find character boundaries from any position), and it wastes no space on English text. Over 98% of the web uses UTF-8.

UTF-16 and UTF-32

UTF-16 uses 2 or 4 bytes per character. It's used internally by Java, JavaScript, and Windows. Characters outside the Basic Multilingual Plane need surrogate pairs (two 16-bit units).

UTF-32 uses exactly 4 bytes per character. Simple indexing but wasteful — rarely used for storage or transmission.

Byte Order Mark (BOM)

A special code point (U+FEFF) placed at the start of a file to indicate encoding and byte order. Essential for UTF-16 (which needs to signal big vs. little endian), but optional and generally discouraged for UTF-8.

Mojibake

Mojibake happens when bytes are decoded with the wrong encoding. Example: the UTF-8 bytes for "cafe" (63 61 66 C3 A9) decoded as Latin-1 produce "cafÃ©". Always declare your encoding explicitly via HTTP headers or meta tags.

02 / Number Encoding

Integers, Floats & Endianness

Integer Representation: Two's Complement

Modern CPUs represent signed integers using two's complement. The most significant bit is the sign bit (0 = positive, 1 = negative). To negate a number: flip all bits, then add 1.

 5 in 8-bit:  00000101
-5 in 8-bit:  11111011  (flip → 11111010, +1 → 11111011)

This gives a range of -128 to +127 for 8-bit, and -2,147,483,648 to +2,147,483,647 for 32-bit. The beauty: addition and subtraction work identically for signed and unsigned values — no special hardware needed.

IEEE 754 Floating Point

Floats store real numbers as sign × mantissa × 2^exponent. A 64-bit double has 1 sign bit, 11 exponent bits, and 52 mantissa bits.

IEEE 754 Double (64-bit) Layout

Sign (1 bit)

Exponent (11 bits)

Mantissa (52 bits)

Why 0.1 + 0.2 != 0.3

0.1 in binary is a repeating fraction (like 1/3 in decimal). It gets rounded to the nearest representable value. When you add two rounded values, the rounding errors compound: 0.1 + 0.2 = 0.30000000000000004. For exact decimal math, use libraries like BigDecimal or integer cents.

Endianness

Multi-byte values can be stored with the most significant byte first (big-endian, aka network byte order) or least significant byte first (little-endian, used by x86/ARM). The integer 0x01020304:

Big-endian:    01 02 03 04
Little-endian: 04 03 02 01

Network protocols (TCP/IP) use big-endian. Most modern CPUs use little-endian. Serialization formats must specify or detect byte order to avoid silent data corruption.

03 / Base Encoding

Base64, Base32 & Hex

Base encodings represent arbitrary binary data using a restricted alphabet of printable characters. They're essential when binary data must travel through text-only channels (email, JSON, URLs).

Base64

Encodes 3 bytes (24 bits) into 4 characters from a 64-character alphabet (A-Z, a-z, 0-9, +, /). Adds ~33% overhead. Padding with = aligns the output to a multiple of 4 characters.

Base64 Encoding Process

3 bytes input

→

Split into 4 × 6-bit groups

→

Map each to Base64 char

"Man" → TWFu

M = 77 = 01001101
a = 97 = 01100001
n = 110 = 01101110

010011 010110 000101 101110
  T      W      F      u

URL-Safe Base64

Standard Base64 uses + and / which have special meaning in URLs. URL-safe Base64 replaces them with - and _. Some implementations also omit padding. Always know which variant you're using.

Base32 & Hex

Base32 uses 32 characters (A-Z, 2-7). Case-insensitive and avoids ambiguous characters. Used in TOTP codes and some file systems. 60% overhead.

Hex (Base16) maps each byte to two characters (0-9, a-f). 100% overhead but trivial to read and debug. Used for hashes, colors, and memory addresses.

Encoding	Alphabet Size	Overhead	Use Case
Hex	16	100%	Hashes, debugging
Base32	32	60%	TOTP, case-insensitive contexts
Base64	64	33%	Email attachments, data URIs, JWTs

04 / Serialization Formats

Turning Objects into Bytes

Serialization converts in-memory data structures into a format that can be stored or transmitted. The choice of format affects performance, interoperability, and developer experience.

JSON

Human-readable, text-based. The lingua franca of web APIs. No schema, no comments, limited types (no integers vs. floats, no dates). ~verbose.

XML

Verbose but powerful — supports namespaces, schemas (XSD), and transformations (XSLT). Still dominant in enterprise/SOAP. Large overhead.

Protocol Buffers

Google's binary format. Schema-defined (.proto files), compact, fast. Supports schema evolution. Not human-readable.

Avro

Schema embedded with data. Popular in Hadoop/Kafka. Schema evolution built-in. Good for data pipelines and long-term storage.

MessagePack

Binary JSON — same data model, smaller and faster. No schema. Drop-in replacement where JSON is too slow or large.

CBOR

Concise Binary Object Representation (RFC 8949). Like MessagePack but IETF-standardized. Supports tags for dates, bigints, etc. Used in WebAuthn and COSE.

Protocol Buffers Deep Dive

Protobuf assigns each field a numeric tag. The wire format encodes tag + type + value, skipping unset fields entirely. This makes it compact and allows schema evolution.

// person.proto
syntax = "proto3";

message Person {
  string name = 1;    // field tag 1
  int32  age  = 2;    // field tag 2
  string email = 3;   // field tag 3
}

Property	JSON	Protobuf	Avro	MessagePack
Format	Text	Binary	Binary	Binary
Schema	None	.proto file	Embedded	None
Human readable	Yes	No	No	No
Size	Large	Small	Small	Medium
Speed	Slow	Fast	Fast	Fast
Schema evolution	Ad hoc	Strong	Strong	Ad hoc

05 / Schema Evolution

Changing Formats Without Breaking Things

In distributed systems, producers and consumers of data are updated independently. Schema evolution rules ensure old readers can handle new data and vice versa.

Compatibility Types

Type	Meaning	Safe Operations
Backward compatible	New code reads old data	Add optional fields, remove fields (with defaults)
Forward compatible	Old code reads new data	Remove fields, add fields (old code ignores unknowns)
Full compatible	Both directions work	Only add/remove optional fields with defaults

Golden Rules

Never reuse field tags/numbers. Once a field is removed, its tag is retired forever. Never change a field's type. Always provide defaults for new fields. In Protobuf, every proto3 field is implicitly optional with a zero-value default.

Evolution in Practice

// v1: original schema
message Event {
  string id = 1;
  string type = 2;
}

// v2: backward + forward compatible
message Event {
  string id = 1;
  string type = 2;
  int64 timestamp = 3;   // NEW — old readers ignore it
  // field 4 reserved     // REMOVED field — tag retired
}

Schema Evolution Flow

Producer v2

→

Message with new fields

→

Consumer v1 (ignores unknown tags)

06 / Compression

Making Data Smaller

Compression trades CPU time for smaller data. The right algorithm depends on whether you need speed (real-time streaming) or maximum size reduction (archival storage).

Algorithm	Ratio	Speed	Best For
gzip / deflate	Good	Moderate	HTTP responses, general files. Universal support.
Brotli	Excellent	Slow compress, fast decompress	Static web assets (pre-compressed). 15-25% smaller than gzip.
zstd	Excellent	Fast both ways	Databases, logs, real-time pipelines. Trainable dictionaries.
LZ4	Moderate	Extremely fast	In-memory caching, real-time streaming. Prioritizes speed.
Snappy	Moderate	Very fast	Google's internal format. Used in Bigtable, Hadoop. Speed over ratio.

Decision Framework

Maximum compression: Brotli (static assets) or zstd (dynamic data).
Lowest latency: LZ4 or Snappy — decompress at memory-copy speeds.
Universal compatibility: gzip — supported everywhere, good enough ratio.
Best overall: zstd — near-Brotli ratios at near-LZ4 speeds. Increasingly the default choice.

HTTP Compression

Browsers and servers negotiate compression via HTTP headers. The client advertises what it supports; the server picks the best match.

HTTP Compression Negotiation

Client: Accept-Encoding: gzip, br, zstd

→

Server picks best

→

Content-Encoding: br

# Request
GET /api/data HTTP/1.1
Accept-Encoding: gzip, deflate, br

# Response
HTTP/1.1 200 OK
Content-Encoding: br
Vary: Accept-Encoding
Content-Type: application/json

[compressed body]

Don't Double-Compress

Already-compressed formats (JPEG, PNG, MP4, ZIP) don't benefit from HTTP compression — they may even get larger. Configure your server to skip compression for these MIME types. Also avoid compressing small responses (<1KB) where the overhead exceeds the savings.

Test Yourself

Score: 0 / 10

Question 01

How many bytes does UTF-8 use to encode the emoji character U+1F600?

U+1F600 falls in the range U+10000 — U+10FFFF, which requires 4 bytes in UTF-8. Only code points up to U+007F fit in 1 byte (ASCII range).

Question 02

Why does 0.1 + 0.2 not equal 0.3 in IEEE 754 floating point?

0.1 in binary is a repeating fraction (like 1/3 in decimal). It must be rounded to fit the 52-bit mantissa. When two rounded values are added, their rounding errors compound, producing 0.30000000000000004 instead of 0.3.

Question 03

What is the approximate size overhead of Base64 encoding?

Base64 encodes 3 bytes (24 bits) into 4 characters (each representing 6 bits). So 3 bytes become 4 bytes — a 33% increase. Hex encoding, by contrast, has 100% overhead (1 byte becomes 2 characters).

Question 04

Which serialization format embeds the schema within the data itself?

Avro embeds the writer's schema with the data, so readers can always resolve differences between the writer's schema and their own. Protobuf uses separate .proto files, and JSON/MessagePack have no schema at all.

Question 05

In Protocol Buffers, why should you never reuse a retired field number?

Protobuf identifies fields by their numeric tag on the wire. If you reuse tag 4 for a new string field that was previously an int32, old messages still in storage or in-flight will have their int32 data decoded as a string, causing silent data corruption.

Question 06

Which compression algorithm offers the best balance of compression ratio and speed for dynamic data?

zstd achieves compression ratios close to Brotli while maintaining speeds close to LZ4. It also supports trainable dictionaries for small data. Brotli compresses better but is much slower for dynamic compression. LZ4 is faster but compresses less.

Question 07

What HTTP header does a client send to indicate which compression algorithms it supports?

The client sends Accept-Encoding: gzip, br, zstd to tell the server what it can decompress. The server responds with Content-Encoding: br (or whichever it chose). Transfer-Encoding is for hop-by-hop encoding like chunked, not end-to-end compression.

Question 08

What does "backward compatible" mean in the context of schema evolution?

Backward compatibility means new (updated) consumers can read data produced by old producers. Forward compatibility is the reverse — old consumers can read data from new producers. Full compatibility means both directions work.

Question 09

Which encoding is backward compatible with ASCII?

UTF-8 encodes ASCII characters (U+0000 to U+007F) as a single byte with the same value as ASCII. Any valid ASCII document is already valid UTF-8. UTF-16 and UTF-32 use different byte widths, so they are not byte-compatible with ASCII.

Question 10

In a little-endian system, how is the 32-bit integer 0x01020304 stored in memory?

Little-endian stores the least significant byte first. So 0x01020304 becomes 04 03 02 01 in memory. Big-endian (network byte order) would store it as 01 02 03 04.