Fundamentals

Encoding, Serialization & Compression

How data is represented, transformed, and shrunk — from character sets and number formats to wire protocols and compression algorithms.

01 / Character Encoding

From ASCII to UTF-8

Every character you see on screen is stored as a number. The mapping from numbers to characters is called a character encoding. Getting it wrong produces mojibake — garbled text like é instead of e.

ASCII

The original 7-bit encoding. Maps 128 characters (0-127): English letters, digits, punctuation, and control characters. Every byte's top bit is 0. Simple, but only covers English.

Unicode

A universal character set assigning a unique code point (e.g., U+0041 = A, U+1F600 = a smiley) to every character in every script. Unicode defines what characters exist; encodings define how to store them as bytes.

UTF-8

The dominant encoding on the web. Uses 1 to 4 bytes per character with a clever variable-length scheme:

Byte CountCode Point RangeExample
1 byteU+0000 — U+007FA, z, 9 (ASCII range)
2 bytesU+0080 — U+07FFe, n, Cyrillic
3 bytesU+0800 — U+FFFFChinese, Japanese, Korean
4 bytesU+10000 — U+10FFFFEmoji, rare scripts
Why UTF-8 Won
UTF-8 is backward compatible with ASCII — any valid ASCII file is already valid UTF-8. It's self-synchronizing (you can find character boundaries from any position), and it wastes no space on English text. Over 98% of the web uses UTF-8.

UTF-16 and UTF-32

UTF-16 uses 2 or 4 bytes per character. It's used internally by Java, JavaScript, and Windows. Characters outside the Basic Multilingual Plane need surrogate pairs (two 16-bit units).

UTF-32 uses exactly 4 bytes per character. Simple indexing but wasteful — rarely used for storage or transmission.

Byte Order Mark (BOM)

A special code point (U+FEFF) placed at the start of a file to indicate encoding and byte order. Essential for UTF-16 (which needs to signal big vs. little endian), but optional and generally discouraged for UTF-8.

Mojibake
Mojibake happens when bytes are decoded with the wrong encoding. Example: the UTF-8 bytes for "cafe" (63 61 66 C3 A9) decoded as Latin-1 produce "café". Always declare your encoding explicitly via HTTP headers or meta tags.
02 / Number Encoding

Integers, Floats & Endianness

Integer Representation: Two's Complement

Modern CPUs represent signed integers using two's complement. The most significant bit is the sign bit (0 = positive, 1 = negative). To negate a number: flip all bits, then add 1.

 5 in 8-bit:  00000101
-5 in 8-bit:  11111011  (flip → 11111010, +1 → 11111011)

This gives a range of -128 to +127 for 8-bit, and -2,147,483,648 to +2,147,483,647 for 32-bit. The beauty: addition and subtraction work identically for signed and unsigned values — no special hardware needed.

IEEE 754 Floating Point

Floats store real numbers as sign × mantissa × 2^exponent. A 64-bit double has 1 sign bit, 11 exponent bits, and 52 mantissa bits.

IEEE 754 Double (64-bit) Layout
Sign (1 bit)
Exponent (11 bits)
Mantissa (52 bits)
Why 0.1 + 0.2 != 0.3
0.1 in binary is a repeating fraction (like 1/3 in decimal). It gets rounded to the nearest representable value. When you add two rounded values, the rounding errors compound: 0.1 + 0.2 = 0.30000000000000004. For exact decimal math, use libraries like BigDecimal or integer cents.

Endianness

Multi-byte values can be stored with the most significant byte first (big-endian, aka network byte order) or least significant byte first (little-endian, used by x86/ARM). The integer 0x01020304:

Big-endian:    01 02 03 04
Little-endian: 04 03 02 01

Network protocols (TCP/IP) use big-endian. Most modern CPUs use little-endian. Serialization formats must specify or detect byte order to avoid silent data corruption.

03 / Base Encoding

Base64, Base32 & Hex

Base encodings represent arbitrary binary data using a restricted alphabet of printable characters. They're essential when binary data must travel through text-only channels (email, JSON, URLs).

Base64

Encodes 3 bytes (24 bits) into 4 characters from a 64-character alphabet (A-Z, a-z, 0-9, +, /). Adds ~33% overhead. Padding with = aligns the output to a multiple of 4 characters.

Base64 Encoding Process
3 bytes input
Split into 4 × 6-bit groups
Map each to Base64 char
"Man" → TWFu

M = 77 = 01001101
a = 97 = 01100001
n = 110 = 01101110

010011 010110 000101 101110
  T      W      F      u
URL-Safe Base64
Standard Base64 uses + and / which have special meaning in URLs. URL-safe Base64 replaces them with - and _. Some implementations also omit padding. Always know which variant you're using.

Base32 & Hex

Base32 uses 32 characters (A-Z, 2-7). Case-insensitive and avoids ambiguous characters. Used in TOTP codes and some file systems. 60% overhead.

Hex (Base16) maps each byte to two characters (0-9, a-f). 100% overhead but trivial to read and debug. Used for hashes, colors, and memory addresses.

EncodingAlphabet SizeOverheadUse Case
Hex16100%Hashes, debugging
Base323260%TOTP, case-insensitive contexts
Base646433%Email attachments, data URIs, JWTs
04 / Serialization Formats

Turning Objects into Bytes

Serialization converts in-memory data structures into a format that can be stored or transmitted. The choice of format affects performance, interoperability, and developer experience.

JSON

Human-readable, text-based. The lingua franca of web APIs. No schema, no comments, limited types (no integers vs. floats, no dates). ~verbose.

XML

Verbose but powerful — supports namespaces, schemas (XSD), and transformations (XSLT). Still dominant in enterprise/SOAP. Large overhead.

Protocol Buffers

Google's binary format. Schema-defined (.proto files), compact, fast. Supports schema evolution. Not human-readable.

Avro

Schema embedded with data. Popular in Hadoop/Kafka. Schema evolution built-in. Good for data pipelines and long-term storage.

MessagePack

Binary JSON — same data model, smaller and faster. No schema. Drop-in replacement where JSON is too slow or large.

CBOR

Concise Binary Object Representation (RFC 8949). Like MessagePack but IETF-standardized. Supports tags for dates, bigints, etc. Used in WebAuthn and COSE.

Protocol Buffers Deep Dive

Protobuf assigns each field a numeric tag. The wire format encodes tag + type + value, skipping unset fields entirely. This makes it compact and allows schema evolution.

// person.proto
syntax = "proto3";

message Person {
  string name = 1;    // field tag 1
  int32  age  = 2;    // field tag 2
  string email = 3;   // field tag 3
}
PropertyJSONProtobufAvroMessagePack
FormatTextBinaryBinaryBinary
SchemaNone.proto fileEmbeddedNone
Human readableYesNoNoNo
SizeLargeSmallSmallMedium
SpeedSlowFastFastFast
Schema evolutionAd hocStrongStrongAd hoc
05 / Schema Evolution

Changing Formats Without Breaking Things

In distributed systems, producers and consumers of data are updated independently. Schema evolution rules ensure old readers can handle new data and vice versa.

Compatibility Types

TypeMeaningSafe Operations
Backward compatibleNew code reads old dataAdd optional fields, remove fields (with defaults)
Forward compatibleOld code reads new dataRemove fields, add fields (old code ignores unknowns)
Full compatibleBoth directions workOnly add/remove optional fields with defaults
Golden Rules
Never reuse field tags/numbers. Once a field is removed, its tag is retired forever. Never change a field's type. Always provide defaults for new fields. In Protobuf, every proto3 field is implicitly optional with a zero-value default.

Evolution in Practice

// v1: original schema
message Event {
  string id = 1;
  string type = 2;
}

// v2: backward + forward compatible
message Event {
  string id = 1;
  string type = 2;
  int64 timestamp = 3;   // NEW — old readers ignore it
  // field 4 reserved     // REMOVED field — tag retired
}
Schema Evolution Flow
Producer v2
Message with new fields
Consumer v1 (ignores unknown tags)
06 / Compression

Making Data Smaller

Compression trades CPU time for smaller data. The right algorithm depends on whether you need speed (real-time streaming) or maximum size reduction (archival storage).

AlgorithmRatioSpeedBest For
gzip / deflateGoodModerateHTTP responses, general files. Universal support.
BrotliExcellentSlow compress, fast decompressStatic web assets (pre-compressed). 15-25% smaller than gzip.
zstdExcellentFast both waysDatabases, logs, real-time pipelines. Trainable dictionaries.
LZ4ModerateExtremely fastIn-memory caching, real-time streaming. Prioritizes speed.
SnappyModerateVery fastGoogle's internal format. Used in Bigtable, Hadoop. Speed over ratio.
Decision Framework
Maximum compression: Brotli (static assets) or zstd (dynamic data).
Lowest latency: LZ4 or Snappy — decompress at memory-copy speeds.
Universal compatibility: gzip — supported everywhere, good enough ratio.
Best overall: zstd — near-Brotli ratios at near-LZ4 speeds. Increasingly the default choice.

HTTP Compression

Browsers and servers negotiate compression via HTTP headers. The client advertises what it supports; the server picks the best match.

HTTP Compression Negotiation
Client: Accept-Encoding: gzip, br, zstd
Server picks best
Content-Encoding: br
# Request
GET /api/data HTTP/1.1
Accept-Encoding: gzip, deflate, br

# Response
HTTP/1.1 200 OK
Content-Encoding: br
Vary: Accept-Encoding
Content-Type: application/json

[compressed body]
Don't Double-Compress
Already-compressed formats (JPEG, PNG, MP4, ZIP) don't benefit from HTTP compression — they may even get larger. Configure your server to skip compression for these MIME types. Also avoid compressing small responses (<1KB) where the overhead exceeds the savings.

Test Yourself

Score: 0 / 10
Question 01
How many bytes does UTF-8 use to encode the emoji character U+1F600?
U+1F600 falls in the range U+10000 — U+10FFFF, which requires 4 bytes in UTF-8. Only code points up to U+007F fit in 1 byte (ASCII range).
Question 02
Why does 0.1 + 0.2 not equal 0.3 in IEEE 754 floating point?
0.1 in binary is a repeating fraction (like 1/3 in decimal). It must be rounded to fit the 52-bit mantissa. When two rounded values are added, their rounding errors compound, producing 0.30000000000000004 instead of 0.3.
Question 03
What is the approximate size overhead of Base64 encoding?
Base64 encodes 3 bytes (24 bits) into 4 characters (each representing 6 bits). So 3 bytes become 4 bytes — a 33% increase. Hex encoding, by contrast, has 100% overhead (1 byte becomes 2 characters).
Question 04
Which serialization format embeds the schema within the data itself?
Avro embeds the writer's schema with the data, so readers can always resolve differences between the writer's schema and their own. Protobuf uses separate .proto files, and JSON/MessagePack have no schema at all.
Question 05
In Protocol Buffers, why should you never reuse a retired field number?
Protobuf identifies fields by their numeric tag on the wire. If you reuse tag 4 for a new string field that was previously an int32, old messages still in storage or in-flight will have their int32 data decoded as a string, causing silent data corruption.
Question 06
Which compression algorithm offers the best balance of compression ratio and speed for dynamic data?
zstd achieves compression ratios close to Brotli while maintaining speeds close to LZ4. It also supports trainable dictionaries for small data. Brotli compresses better but is much slower for dynamic compression. LZ4 is faster but compresses less.
Question 07
What HTTP header does a client send to indicate which compression algorithms it supports?
The client sends Accept-Encoding: gzip, br, zstd to tell the server what it can decompress. The server responds with Content-Encoding: br (or whichever it chose). Transfer-Encoding is for hop-by-hop encoding like chunked, not end-to-end compression.
Question 08
What does "backward compatible" mean in the context of schema evolution?
Backward compatibility means new (updated) consumers can read data produced by old producers. Forward compatibility is the reverse — old consumers can read data from new producers. Full compatibility means both directions work.
Question 09
Which encoding is backward compatible with ASCII?
UTF-8 encodes ASCII characters (U+0000 to U+007F) as a single byte with the same value as ASCII. Any valid ASCII document is already valid UTF-8. UTF-16 and UTF-32 use different byte widths, so they are not byte-compatible with ASCII.
Question 10
In a little-endian system, how is the 32-bit integer 0x01020304 stored in memory?
Little-endian stores the least significant byte first. So 0x01020304 becomes 04 03 02 01 in memory. Big-endian (network byte order) would store it as 01 02 03 04.