How Docker image layers actually work under the hood
I used Docker for over a year before I understood what layers actually are. I knew the basics: images are built from Dockerfiles, containers run from images, put the stuff that changes least at the top of your Dockerfile. But I had no idea why any of that worked the way it did.
Once I dug into the internals, a lot of things clicked. Why builds are sometimes instant and sometimes painfully slow. Why 20 containers can run from the same image without eating all your disk space. Why deleting a file in a Dockerfile does not make your image smaller. This post is everything I learned.
Every layer is just a tar archive
When Docker builds an image, each instruction in your Dockerfile produces a layer. A layer is not some special Docker format. It is a tar archive containing the filesystem changes that instruction introduced. New files, modified files, deleted files. That is it.
Take this Dockerfile:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y curl
COPY app.js /app/app.js
RUN chmod +x /app/app.jsThis produces four layers:
- The base Ubuntu filesystem (from
ubuntu:22.04) - The files added and modified by installing curl
- The
app.jsfile - The permission change on
app.js
Each layer only contains the diff from the previous state. Layer 2 does not contain the entire Ubuntu filesystem plus curl. It contains only the new and changed files that apt-get install produced. Layer 3 contains only app.js. Layer 4 contains only the updated metadata for app.js.
You can see this yourself:
docker history my-imageThis shows each layer, its size, and the command that created it. The sizes make a lot more sense once you realize each one is just the delta.
OverlayFS: stacking layers into a filesystem
So we have a stack of tar archives, each containing filesystem diffs. Docker needs to combine them into a single coherent filesystem that a container can use. This is where union filesystems come in.
Docker uses OverlayFS (merged into the Linux kernel in version 3.18). Before that, it used AUFS, which was never accepted into the mainline kernel and required patched distributions. OverlayFS being built into the kernel is a big part of why Docker became so portable.
OverlayFS works with four directories:
- lowerdir: One or more read-only directories stacked on top of each other. These are your image layers.
- upperdir: A single writable directory. This is where changes get written.
- workdir: An internal scratch directory OverlayFS uses for atomic operations.
- merged: The unified view that combines everything. This is what gets mounted as the container's root filesystem.
graph BT
L1[Layer 1: Base OS<br/>lowerdir - read only] --> M[Merged View<br/>Container Root Filesystem]
L2[Layer 2: apt install curl<br/>lowerdir - read only] --> M
L3[Layer 3: COPY app.js<br/>lowerdir - read only] --> M
U[Container Writable Layer<br/>upperdir - read/write] --> M
The mount command looks something like this:
mount -t overlay overlay \
-o lowerdir=/layer3:/layer2:/layer1,\
upperdir=/container-rw,\
workdir=/container-work \
/container-mergedWhen a process inside the container reads a file, OverlayFS checks the upperdir first. If the file is not there, it walks through the lowerdirs from top to bottom. The first match wins. When a process writes a file, the write goes to the upperdir. The lowerdirs are never modified.
This is how Docker stacks image layers. Each layer becomes a lowerdir, ordered from most recent to oldest. The container gets a thin upperdir for its own writes. The merged directory becomes the container's root filesystem.
How file operations work in the overlay
Reading a file: OverlayFS looks up through the layers. If the file exists in a lower layer, it is read directly from there. No copying happens. Fast and efficient.
Creating a new file: Written directly to the upperdir. Nothing interesting here.
Modifying an existing file: This is where it gets interesting. OverlayFS performs a "copy-up" operation. The entire file is copied from the lower layer to the upper layer, then the modification is applied to the upper copy. This happens even if you only change a single byte. After the copy-up, all future reads and writes of that file go to the upper layer copy at native filesystem speed.
Deleting a file: The file in the lower layer cannot be removed (it is read-only). Instead, OverlayFS creates a special "whiteout" file in the upper layer. This is a character device with 0/0 major/minor numbers. The whiteout masks the file below, making it invisible in the merged view. The original file still physically exists in the lower layer.
Deleting a directory: An "opaque whiteout" is created. This is a directory in the upper layer with a special extended attribute (trusted.overlay.opaque) that tells OverlayFS to hide everything in the corresponding lower directory.
This explains a common Dockerfile gotcha. If you do this:
RUN apt-get install -y some-big-package
RUN apt-get remove -y some-big-packageYour image is not smaller. The first layer contains all the files from the package. The second layer contains whiteout files that hide them. Both layers are part of the image. The data is still there, it is just invisible. To actually save space, you need to install and remove in a single RUN instruction so the files never make it into a committed layer.
Layer limits and performance
Modern kernels (5.11+) support up to 500 lower layers in OverlayFS. Docker's overlay2 driver practically warns around 128 layers. Images with many layers have slower container startup because the overlay mount takes longer to set up.
There is a clever implementation detail here. The kernel has a page-size limit (~4096 bytes) on mount option strings. With many layers, the colon-separated list of lowerdir paths can exceed this. Docker works around it by creating shortened symbolic links in /var/lib/docker/overlay2/l/ that map to the actual layer directories. The mount options use these short symlinks instead of the full paths.
Content-addressable storage: identifying layers by their content
Before Docker 1.10 (released February 2016), layers were identified by randomly generated UUIDs. This meant that two identical layers built on different machines had different IDs. No deduplication, no integrity verification. Not great.
Docker 1.10 switched to content-addressable storage. Every layer is now identified by the SHA256 hash of its content. Identical content always produces the same hash, regardless of when or where it was built.
There are two types of layer hashes, and the distinction matters:
DiffID: The SHA256 hash of the layer's uncompressed tar archive. This is what appears in the image configuration and is used locally to identify layers. Example: sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.
Distribution digest: The SHA256 hash of the compressed tar archive (gzip or zstd). This is what registries use and what appears in image manifests. Since compression can vary, the compressed hash differs from the DiffID even though they represent the same layer content.
ChainIDs: layers in context
A layer's identity depends on what is beneath it. The same set of file changes applied on top of different base layers should not be considered the same layer. Docker handles this with ChainIDs, computed recursively:
ChainID(L0) = DiffID(L0)
ChainID(L0, L1) = SHA256(ChainID(L0) + " " + DiffID(L1))
ChainID(L0, L1, L2) = SHA256(ChainID(L0, L1) + " " + DiffID(L2))
This means a layer is only shared if it has the same content and the same parent chain. Which is exactly the right behavior for a layered filesystem.
Deduplication in practice
Content-addressable storage enables deduplication at every level:
On disk: Docker stores each unique layer once under /var/lib/docker/overlay2/. If you have 20 microservice images all based on node:20-alpine, the ~50MB base layers are stored exactly once. Without deduplication, that is roughly 950MB wasted on disk.
During pulls: When you docker pull an image, Docker checks which layers you already have locally. Only missing layers are downloaded. If you already have python:3.12-slim and pull an image based on it, you download only the new layers on top.
In registries: When pushing, Docker checks each layer against the registry with a HEAD request. If the blob already exists, the upload is skipped. Registries even support cross-repository blob mounting. If myregistry/app-a already has a layer and you push myregistry/app-b with the same layer, the registry can mount the existing blob without re-uploading.
You can see shared vs unique storage on your Docker host:
docker system df -vThe "SHARED SIZE" column shows how much of each image's storage is shared with other images.
The integrity chain
Content addressing also provides integrity verification for free. Every layer's hash is computed from its content. The image manifest references layers by their digests. The manifest itself has a digest. Changing a single byte in any layer changes its hash, which changes the manifest hash, which changes the image digest. It is a Merkle DAG (directed acyclic graph) from top to bottom. If the top-level hash matches, you know nothing underneath has been tampered with.
The OCI image specification
Docker images used to be a Docker-specific format. Today they follow the Open Container Initiative (OCI) Image Specification (currently version 1.1). This means images built with Docker work with Podman, containerd, CRI-O, and any other OCI-compliant runtime.
An OCI image consists of three main components wired together by content descriptors:
Image manifest
The manifest ties everything together. It points to the image configuration and lists all the layers in order:
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:b5b2b2c507a0...",
"size": 7023
},
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:9834876dcfb0...",
"size": 32654
},
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:3c3a4604a545...",
"size": 16724
}
]
}Each entry uses a content descriptor with three fields: the media type, the SHA256 digest, and the byte size. The layers array is ordered bottom to top. Base layer first, subsequent layers stacked on top.
Layer compression can be gzip (most common), zstd (newer, better compression ratio and 3-5x faster decompression), or uncompressed tar.
Image index (for multi-architecture images)
For images that support multiple architectures (like amd64 and arm64), an image index sits above the manifest. It lists multiple manifests, each tagged with a platform:
{
"schemaVersion": 2,
"manifests": [
{
"digest": "sha256:e692...",
"platform": { "architecture": "amd64", "os": "linux" }
},
{
"digest": "sha256:5b0b...",
"platform": { "architecture": "arm64", "os": "linux" }
}
]
}When you docker pull nginx, Docker checks the index and downloads the manifest matching your architecture. This is why the same image tag works on an x86 server and an ARM-based Mac.
Image configuration
The configuration contains the runtime settings (environment variables, entrypoint, exposed ports) and critically, the ordered list of DiffIDs:
{
"architecture": "amd64",
"os": "linux",
"rootfs": {
"type": "layers",
"diff_ids": [
"sha256:4fc242d58285...",
"sha256:a3ed95caeb02..."
]
},
"config": {
"Env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
"Cmd": ["/bin/sh"],
"WorkingDir": "/app"
},
"history": [
{ "created_by": "/bin/sh -c #(nop) ADD file:... in /" },
{ "created_by": "/bin/sh -c apt-get update && apt-get install -y curl" }
]
}The history array is useful for debugging. It records the exact Dockerfile instruction that produced each layer, including whether the instruction was metadata-only (like ENV or LABEL) and did not produce a filesystem layer.
Not every Dockerfile instruction creates a filesystem layer. Instructions like ENV, LABEL, CMD, EXPOSE, and WORKDIR only modify the image configuration. They are marked with "empty_layer": true in the history. This is why docker history shows some layers with a size of 0B.
Copy-on-write: sharing without copying
Copy-on-write (CoW) is the mechanism that makes it practical to run dozens or hundreds of containers from the same image simultaneously.
When Docker starts a container, it does not copy the image. It creates a thin writable layer on top of the image's read-only layers. The container sees the full merged filesystem through OverlayFS. Reads go through to the image layers. Writes land in the container's own layer.
What this looks like on disk
When docker run creates a container, Docker creates a new directory under /var/lib/docker/overlay2/<container-id>/:
/var/lib/docker/overlay2/<container-id>/
├── diff/ # The writable upper layer (starts empty)
├── work/ # OverlayFS workdir
├── merged/ # The union mount point
├── lower # File listing paths to image layer directories
└── link # Shortened symlink ID
The diff/ directory starts essentially empty. Every file the container creates or modifies ends up here. Every file the container deletes produces a whiteout here.
The copy-up cost
The copy-up operation is the most important performance characteristic to understand. When a container first modifies a file from an image layer:
- The entire file is copied from the lower layer to the upper layer. Not just the modified blocks. The whole file.
- The modification is applied to the upper copy.
- All subsequent reads and writes go to the upper copy at native speed.
This means the first write to a large file from a base layer can be slow. Writing a single byte to a 500MB log file copies 500MB to the writable layer. After that, it is fast. But that initial copy-up is file-level, not block-level.
Since Linux kernel 4.13, OverlayFS supports metadata-only copy-up for operations that only change file metadata (like chmod or chown). In these cases the file data stays in the lower layer and only the metadata is copied up, avoiding the full file copy.
Performance in practice
Some benchmarks paint a clear picture:
- For sequential write-heavy workloads, OverlayFS can be two orders of magnitude slower than direct volume mounts because copy-up converts sequential I/O into random I/O.
- For metadata-heavy workloads (creating many small files), OverlayFS can actually outperform volume mounts. Creating 500 files was measured at 173ms on OverlayFS vs 834ms on a volume mount.
- For reads, OverlayFS is excellent. Multiple containers reading the same file from a shared image layer share a single page cache entry in kernel memory. This makes high-density deployments very memory-efficient.
The practical takeaway: use volumes for write-heavy workloads. Databases, logs, build artifacts. These bypass OverlayFS entirely and write directly to the host filesystem. The container's writable layer should be for small, ephemeral changes.
Sharing math
If you run 100 containers from a 500MB image:
- Without CoW: 100 copies = 50GB of disk usage.
- With CoW: 1 copy of the read-only image (500MB) + 100 thin writable layers (starting near 0). Total: ~500MB until containers start writing data.
This is why you can run many containers on modest hardware. The read-only image layers are shared, and each container only pays for its own changes.
Ephemeral by design
When you remove a container with docker rm, the writable layer directory is deleted. All data not stored in volumes is gone. This is intentional. Containers are disposable. If you need data to persist, use volumes.
You can inspect what a container has changed before removing it:
docker diff my-containerThis shows files that were added (A), changed (C), or deleted (D) in the writable layer.
Build cache: why Dockerfile order matters so much
Layer caching is the single most impactful performance optimization in Docker builds. Understanding the cache invalidation rules turns 10-minute builds into 5-second builds.
How the cache works
When Docker builds an image, it processes each Dockerfile instruction sequentially. For each instruction, it checks:
- Does a cached layer exist from a previous build?
- Does the cached layer have the same parent chain (identical ChainID of all preceding layers)?
- Does the instruction match?
If all three are true, Docker reuses the cached layer instantly. If any check fails, the cache is "busted" for this instruction and every instruction after it. This cascade is the key to understanding cache behavior.
Instruction-specific cache rules
RUN commands: The cache key is the literal command string. RUN apt-get update is cached based on the text "apt-get update", not on whether new packages are available upstream. Same string = cache hit, regardless of what the command would actually produce if re-run. This is why you should combine apt-get update and apt-get install into a single RUN instruction. A cached apt-get update followed by a new apt-get install will use stale package lists.
COPY and ADD: Docker computes a checksum of the file contents being copied. Not timestamps, just content. If any file's content has changed, the cache is invalidated. This is smart. Touching a file without changing its content does not bust the cache. Actually changing the content does.
ARG: Changing a build argument's value invalidates the cache for all subsequent instructions that reference it.
ENV, LABEL, WORKDIR: These modify image metadata. If the value changes, the cache busts for this instruction and everything after.
The ordering rule
Since cache invalidation cascades downward, instruction order determines cache efficiency. The rule: put things that change least frequently at the top, and things that change most frequently at the bottom.
The classic example for a Node.js app:
FROM node:20-alpine
WORKDIR /app
# These change rarely (only when dependencies change)
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install
# This changes on every code edit
COPY . .
RUN pnpm buildIf you only changed application code, Docker reuses the cached layers for everything up to and including pnpm install. Only COPY . . and pnpm build need to run. On a project with hundreds of dependencies, this turns a 2-minute build into a 10-second build.
If you had written it the other way around:
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm install -g pnpm && pnpm install
RUN pnpm buildEvery code change invalidates COPY . ., which cascades and forces pnpm install to run again. Every single build reinstalls all dependencies. Slow and wasteful.
BuildKit cache mounts
BuildKit (the default builder since Docker 23.0) adds cache mounts that persist across builds even when layers are invalidated:
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txtThe /root/.cache/pip directory survives between builds. Even if the layer is rebuilt because requirements.txt changed, pip only downloads new or updated packages. The same pattern works for apt, npm, Go modules, and basically any package manager with a local cache.
Remote cache for CI/CD
Locally, Docker keeps its build cache on disk. But in CI/CD, every build typically starts with an empty cache. BuildKit solves this with remote cache backends:
docker buildx build \
--cache-from type=registry,ref=myregistry/myapp:cache \
--cache-to type=registry,ref=myregistry/myapp:cache \
-t myregistry/myapp:latest .This pushes cache metadata to a registry and pulls it on subsequent builds. Supports registries, local directories, S3, Azure Blob, and GitHub Actions cache. This makes CI builds nearly as fast as local builds after the first run.
Multi-stage builds: keeping production images small
Multi-stage builds solve a fundamental problem: building an application requires tools and dependencies that should not exist in the production image. Compilers, build tools, dev dependencies, test frameworks. All dead weight at runtime.
Before multi-stage builds (introduced in Docker 17.05), people used shell scripts to build in one container, extract the artifact, and build a second image from it. Multi-stage builds make this a single Dockerfile.
How they work
# Stage 1: Build
FROM node:20 AS builder
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install
COPY . .
RUN pnpm build
# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]Each FROM starts a new stage with a fresh layer stack. COPY --from=builder copies files from the builder stage's filesystem into the current stage. When the build finishes, only the final stage's layers become part of the output image. The builder stage's layers (including all of node:20, all the source code, and all dev dependencies) are discarded.
BuildKit is smart about this. If you run docker build --target=production, it will not even execute stages that are not needed for the target.
Size reductions are dramatic
The numbers are striking, especially for compiled languages:
Go: A single-stage build with golang:1.22 produces an image around 800MB-1.1GB (the entire Go toolchain). Multi-stage with a scratch or distroless runtime: 5-15MB. That is a 99% reduction. The final image contains only the compiled binary.
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o server .
FROM scratch
COPY --from=builder /app/server /server
CMD ["/server"]The -ldflags="-s -w" strips the symbol table and debug info, shaving off a few more megabytes. FROM scratch means the final image is literally just the binary. No shell, no package manager, no OS. This also reduces the attack surface to almost zero.
Rust: Similar to Go. Build stage with the full Rust toolchain (~1.4GB), production stage with just the binary (~5MB).
Node.js: Build stage with node:20 (~1.1GB), production stage with node:20-alpine and only production dependencies (~150MB). Around 87% smaller.
Python: Build stage with build tools for compiled dependencies (~1GB), production stage with python:3.12-slim and only runtime packages (~85-180MB). Around 82-91% smaller.
Java: Maven build stage, then a JRE-only runtime stage. Spring Boot's layered JAR format makes this even better by separating dependency layers from application code, improving cache hits on subsequent builds.
The three-stage pattern
For maximum cache efficiency, some projects use three stages:
# Stage 1: Install production dependencies only
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install --prod
# Stage 2: Build with all dependencies
FROM node:20 AS builder
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install
COPY . .
RUN pnpm build
# Stage 3: Production runtime
FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
CMD ["node", "dist/index.js"]The final image has production dependencies from stage 1 and built artifacts from stage 2. No source code, no dev dependencies, no build tools.
Putting it all together
Here is what happens when you run docker build and then docker run, with everything we have covered:
Build time:
- Docker reads the Dockerfile and processes each instruction.
- For each instruction, it checks the build cache. If the parent chain and instruction match a cached layer, it reuses it.
- When cache misses, the instruction runs in a temporary container. The filesystem changes are captured as a tar archive (the layer diff).
- Each layer gets a content-addressable DiffID (SHA256 of the uncompressed tar).
- Docker produces an image configuration (with the ordered list of DiffIDs) and an image manifest (with the compressed layer digests).
- The manifest's SHA256 digest becomes the image ID.
Run time:
- Docker creates a new directory for the container's writable layer.
- It sets up an OverlayFS mount: image layers as lowerdirs, container writable layer as upperdir.
- The merged directory becomes the container's root filesystem.
- The container process starts and sees a normal filesystem. Reads go through to image layers. Writes land in the thin writable layer via copy-on-write.
- Multiple containers from the same image share the read-only layers and even share page cache entries in kernel memory.
Push/pull time:
- Layers are compressed (gzip or zstd) and uploaded/downloaded as blobs identified by their compressed digest.
- The registry deduplicates: layers that already exist are not re-uploaded.
- Pulls only download layers not already present locally.
- Cross-repository blob mounting avoids re-uploading shared layers even across different image names on the same registry.
Understanding this pipeline changed how I write Dockerfiles. Ordering instructions for cache efficiency, using multi-stage builds, keeping the writable layer thin, and using volumes for write-heavy data are not arbitrary best practices. They follow directly from how the system works under the hood.
Sources
Related posts
Automating workflows with n8n
How I use n8n as a self-hosted alternative to Zapier for connecting services and automating repetitive tasks.
Self-hosting with Docker Compose: lessons learned
Practical patterns and mistakes from running self-hosted services with Docker Compose.
Containers vs virtual machines: when to use which
A practical comparison of Docker containers and virtual machines, and how I use both in my homelab.
Enjoying the blog? Subscribe via RSS to get new posts in your reader.
Subscribe via RSS