What HuggingFace actually is
I've been downloading GGUF files from HuggingFace for months. The workflow is always the same. See a model mentioned somewhere, search for it on the Hub, find the GGUF repo (usually one by Bartowski or TheBloke), pick the right quantization level, download, load it into Ollama. Done.
That was my entire interaction with a platform used by 13 million people.
I never clicked on "Spaces" in the top nav. Never looked at Datasets. I knew there was a Python library called transformers but I'd never installed it. HuggingFace, to me, was a download button with a search bar on top.
I figured it was GitHub but for AI models. The files were in repos. There were commits and branches. People could fork things. GitHub for AI. Simple mental model.
Then I actually sat down and looked at what the platform is. What it does, where it came from, why it's shaped the way it is. And that mental model fell apart fast.
It started as a chatbot for teenagers
The first thing that surprised me was the origin story.
HuggingFace was founded in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf. Their first product was a chatbot aimed at teenagers. Not a model hosting platform. Not a developer tool. A chatbot app. It was named after the 🤗 emoji, had around 100,000 daily active users at its peak, and was doing fine as a consumer product.
The pivot happened in 2018. The team had been building NLP technology to power the chatbot and realized the tech underneath was more interesting than the product on top. The chatbot had decent traction, but improvements to the underlying model quality weren't translating into user growth. The technology was advancing faster than the consumer product could absorb. So they open-sourced the Transformers library, a Python package that made it easy to use pre-trained language models with a few lines of code. That library changed the company's trajectory entirely. Within months, the ML research community was building on it, and HuggingFace started evolving into a platform rather than a product.
By August 2023, HuggingFace raised $235 million in a Series D round at a $4.5 billion valuation. The investor list reads like a roll call of the biggest names in tech: Google, Amazon, Nvidia, Intel, AMD, Qualcomm, Salesforce. By late 2024, annual revenue hit $130 million.
And the platform itself, as of early 2026: over 2 million public models, more than 500,000 datasets, over a million hosted applications, and 13 million users. These are not "GitHub for AI" numbers. GitHub hosts code. HuggingFace hosts models, the data they trained on, live applications running those models, compute infrastructure to serve them, and the libraries that tie it all together.
The comparison is not wrong. It's just incomplete by about four dimensions.
The mental model that actually fits
Here's how I think about it now: imagine if GitHub, npm, Vercel, and Docker Hub merged into one platform, but for machine learning.
Models are the artifacts (like packages on npm). Datasets are the training data (no real equivalent on GitHub). Spaces are the deployed apps (like Vercel). The Inference API is the compute layer. And the libraries, transformers, datasets, huggingface_hub, are the developer tools that stitch it all together.
No single comparison captures it because nothing else in software works this way. In traditional software development, you use GitHub for code, npm or PyPI for packages, Vercel or Netlify for deployment, and AWS or GCP for compute. They're separate services with separate accounts, separate billing, and no inherent connection between them. HuggingFace collapsed all of those layers into one platform for ML. A model repo on the Hub links to the dataset it was trained on, the Space that demos it, and the Inference API that serves it. Everything is connected by default.
That's what HuggingFace built. And most people who use it, myself included until recently, interact with about 5% of it.
The Hub is built on Git, but not the Git you're used to
The part I was already familiar with, the model repository, sits on top of the Hub. The Hub uses Git for version control, which is why repos look familiar. Commits, branches, diffs, pull requests. Standard Git operations work.
But there's a fundamental problem with using Git for AI: the files are enormous. A single model can be 4 to 16 GB at typical quantization levels. Full-precision weights for a 7B parameter model run about 14 GB. Git was designed for source code measured in kilobytes. It was never meant to handle files measured in gigabytes.
HuggingFace initially solved this with Git LFS (Large File Storage). LFS stores the actual file contents in Amazon S3 and puts small pointer files in the Git repo. The pointer contains a SHA-256 hash and the file size. When you download, Git LFS fetches the real file from S3 using that pointer.
This worked but hit limits at the scale HuggingFace reached. In August 2024, they acquired XetHub, a Seattle-based startup, and started building a replacement storage backend called Xet. The key difference: Git LFS deduplicates at the file level. Change one byte in a 10 GB file and you re-upload the entire 10 GB. Xet deduplicates at the chunk level. Change one byte and only the modified chunks get uploaded. For workflows like iterating on model checkpoints or appending rows to a dataset, the savings are significant.
Xet was deployed in January 2025 and became the default for new repos by May 2025. By October 2025, over 77 petabytes of data across 6 million repositories had been migrated to the new backend.
Model cards are not READMEs
On GitHub, a repo has a README. On HuggingFace, a model has a model card: a Markdown file with a YAML frontmatter section containing structured metadata.
---
language: en
license: apache-2.0
tags:
- text-generation
- llama
base_model: meta-llama/Llama-3.1-8B
datasets:
- teknium/OpenHermes-2.5
metrics:
- perplexity
---This is not just for humans to read. The YAML powers the platform's search and filtering. When you filter models by task, license, base model, or language on the Hub, you're querying this structured metadata. It's what makes it possible to find a specific Apache-licensed, English-language, text-generation model based on Llama 3.1 in a catalog of over 2 million.
The base_model field is especially useful. It tells you what a model was fine-tuned from, so you can trace the lineage. A model card might show that a coding assistant was fine-tuned from Llama 3.1 8B, which was itself a Meta release. The datasets field tells you what training data was used. The license field tells you whether you can use it commercially. All of this is machine-readable, which means tools and integrations can parse it automatically.
I'd looked at model cards hundreds of times without thinking about what the YAML section at the top was doing. I was reading the benchmark tables and skipping right past the infrastructure that makes the whole search and filtering experience work.
The CLI I should have been using
The huggingface_hub SDK hit version 1.0 in October 2025. It comes with a new hf CLI that replaced the older huggingface-cli:
# Download a specific GGUF file
hf download bartowski/Llama-3.1-8B-Instruct-GGUF --include "*.Q4_K_M.gguf"
# Upload files to a repo
hf upload my-org/my-model ./weights.safetensors
# Log in to the Hub
hf auth loginThis SDK gets 113.5 million downloads per month. It's a dependency for over 200,000 repositories on GitHub and 3,000 packages on PyPI. I'd been downloading model files through my browser the entire time when there was a proper CLI for it.
The GGUF supply chain I didn't know about
The thing I actually use HuggingFace for, downloading GGUF files, turns out to be more interesting than I assumed.
graph TD
A[Research Lab releases model] --> B[Full-precision weights on Hub]
B --> C[Community downloads original]
C --> D[Quantize to GGUF variants]
D --> E[Upload GGUF repos to Hub]
E --> F[Users download via CLI or browser]
F --> G[Run locally in Ollama / llama.cpp]
HuggingFace doesn't create quantized models. The community does. When a lab like Meta or Mistral releases a new model, it goes up on the Hub in its original format: full-precision or bfloat16 safetensors files. These are big. Most people running models locally don't use them directly.
What happens next is a community-driven supply chain. Within hours of a major model release, community members download the original weights, quantize them to GGUF format at various quality levels, and upload the results as new repos on the Hub. A single model often ends up with a dozen or more quantization variants. If you've seen filenames like Q4_K_M, Q5_K_S, or Q8_0 when downloading from the Hub, those aren't random strings. They describe how aggressively the model weights have been compressed.
The naming follows a pattern: the number (Q4, Q5, Q8) is the bit depth. Lower means smaller files but more quality loss. The letter after the underscore (K for k-quant, I for importance-based) indicates the quantization method. The final letter (S, M, L) is the size variant within that method. So Q4_K_M means 4-bit k-quant at medium size, which for most people is the sweet spot between file size and output quality. I wrote about the VRAM requirements for different quantization levels recently, and this naming scheme is how you match a model to your hardware.
TheBloke (Tom Jobbins) pioneered the mass-quantization pattern. For a long stretch, if you were downloading GGUF files from HuggingFace, there was a good chance they came from TheBloke's repos. He quantized hundreds of models to GGUF, GPTQ, and AWQ formats, each with multiple quantization levels so users could pick the right tradeoff for their hardware. His model repos became the de facto standard. People would search for "TheBloke [model name] GGUF" rather than navigating the Hub directly.
When TheBloke became less active, Bartowski stepped in as the primary GGUF quantizer. Bartowski uses llama.cpp releases with imatrix calibration datasets, a technique where the quantization process is guided by a calibration dataset to figure out which weights matter most and should be preserved at higher precision. This helps maintain model quality at lower bit depths where naive quantization would cause noticeable degradation. Unsloth also contributes heavily, sometimes producing 25 or more GGUF variants for a single model to cover every possible hardware configuration.
Every GGUF I've ever downloaded was created by someone in this community, not by the people who trained the model. The platform provides the storage and discoverability. The community provides the labor. It's an informal supply chain running on volunteer effort and HuggingFace infrastructure. The GGUF format that makes local inference practical for people like me was popularized through this ecosystem, and the models I pull through Ollama trace their lineage back through it.
Datasets, Spaces, and Inference: the platform I'd been ignoring
These are the three major components I'd never explored. Each one turned out to be bigger than I expected.
Datasets
This is the one that surprised me most. HuggingFace hosts over 500,000 public datasets in more than 8,000 languages. Not just text. Audio, images, video, and robotics data. The robotics category has been one of the fastest growing areas on the platform, expanding from around a thousand datasets to tens of thousands in under two years.
The datasets Python library loads these with Apache Arrow as the backend, which means zero-serialization-cost memory mapping. You can work with datasets larger than your RAM because the library streams and maps data from disk instead of loading it all into memory at once. Loading a dataset is about as simple as importing it:
from datasets import load_dataset
dataset = load_dataset("teknium/OpenHermes-2.5")
print(dataset["train"][0])That pulls a dataset used to train some of the most popular open-source models, and you can start working with it immediately. The library integrates natively with PyTorch, TensorFlow, NumPy, Pandas, and Polars, so it slots into whatever ML workflow you're already using.
I'd thought of HuggingFace as a place for models. The dataset ecosystem is just as large and arguably just as important. Models are useless without training data. Having both under one roof, with model card metadata linking models to the datasets they were trained on, creates a connection that doesn't exist when code lives on GitHub and data lives somewhere else.
Spaces
Spaces is the platform for hosting ML applications. Over a million of them. You build a demo with Gradio (a Python UI framework designed for ML models), Streamlit, or plain Docker. The code lives in a Git repo on the Hub. Push to it and the app auto-deploys. Free CPU tier for basic demos, paid GPU upgrades for anything that needs to run inference.
This is where people build model comparison leaderboards, text-to-image demos, chatbot interfaces, and community evaluation challenges. The Open LLM Leaderboard, which benchmarks open-source language models against standardized tests, runs as a Space. So do many of the interactive demos that let you try a new model without downloading anything. When a new image generation model drops and someone tweets a link where you can try it immediately, that's usually a Gradio app running on Spaces.
The social layer matters too. Every model, dataset, and Space on the Hub has a discussion tab where the community can ask questions, report issues, and share results. Models have like counts and download stats. There's a trending page. It's more like a social network for ML artifacts than a static file host.
I'd never clicked "Spaces" in the top nav. I had no idea there were a million live applications and an entire community layer running on the same platform where I download model files.
Inference API
The Inference API is how HuggingFace makes money. Two tiers.
Serverless: free and rate-limited at a few hundred requests per hour ($9/month for a PRO account with higher limits). It runs on shared infrastructure and dynamically loads whatever model you request. Good for testing.
Dedicated Inference Endpoints: you pick the model, the hardware, the cloud provider, and you get a private API endpoint. Fully managed by HuggingFace but running on hardware reserved for you. This is the production tier, for teams that want to run an open-source model as a service without managing the infrastructure themselves.
The business model is interesting. Hosting models on the Hub is free. Hosting datasets is free. Spaces on free-tier hardware is free. HuggingFace gave away the storage and community layer, and monetized the compute. It's the same playbook as GitHub (free repos, paid features and enterprise), but applied to ML infrastructure. HuggingFace's $130 million in revenue (as of late 2024) comes largely from these compute services. The models are the hook. The compute is the business.
The libraries are the glue
The Transformers library is what started the company's pivot and it's still the core of the developer experience. The Pipeline API is the simplest entry point:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love running models locally")
# [{'label': 'POSITIVE', 'score': 0.9998}]Three lines of Python. The library handles downloading the model, tokenizing the input, running inference, and formatting the output. It supports text classification, named entity recognition, question answering, translation, summarization, image classification, object detection, speech recognition, and more. Point it at a task and a model, and it handles the rest.
This is what most ML practitioners interact with when they use HuggingFace. For someone like me who downloads GGUFs and runs them through Ollama, the Transformers library exists in a parallel universe. I don't need it because Ollama and llama.cpp handle inference for me at a lower level. But it's the library that drives HuggingFace's adoption in research and production ML. It's the reason the platform has 13 million users instead of a few hundred thousand. Without the Transformers library, HuggingFace would be a file hosting service. With it, the platform became the default place where the ML community shares and builds on each other's work.
Then there's smolagents, which launched just days ago (December 31, 2024). It's a lightweight agent framework in about 1,000 lines of code. You can build an agent with built-in tool access using any LLM backend: local models through transformers or Ollama, cloud models through LiteLLM. The fact that HuggingFace is building agent infrastructure on top of its model hosting signals where the platform is heading next. Not just storing models, but providing the tools to build autonomous systems with them.
I still just download GGUFs
My actual HuggingFace workflow hasn't changed. I still browse model cards, check the quantization options, download the GGUF, and load it into Ollama. Same as before.
But now when I do that, I understand what's underneath. A file quantized by a community volunteer using imatrix calibration, hosted on a storage backend that deduplicates at the byte level, discoverable through structured YAML metadata, sitting on a platform that started as a chatbot for teenagers and became the infrastructure layer for open-source AI.
It's not GitHub for AI. That comparison gets the shape right but misses the scale. GitHub hosts code. HuggingFace hosts the models, the data that trained them, the apps that use them, the compute to run them, and the community that converts them into formats the rest of us can actually use.
Next time you run ollama pull, think about where that model actually came from. A research lab trained it on a dataset that's probably hosted on the Hub. Someone in the community quantized the weights and uploaded the GGUF. The model card tells you the license, the training data, the base model it was fine-tuned from, and the benchmarks it was tested against. The path from a lab's GPU cluster to your local machine runs through a much bigger ecosystem than I realized. And almost all of it runs through HuggingFace.
Sources
Related posts
What agentic coding actually looks like
Agentic coding changed how I build software. Not in the way the hype suggests.
Hermes Agent by Nous Research: the AI agent that actually cares about security
What Hermes Agent is, how it compares to OpenClaw on security and usability, and why it earned my trust.
How I would design an ad platform for LLMs
A technical breakdown of how a middleware ad layer for LLM APIs could work, why the economics demand it, and whether it should exist at all.
Enjoying the blog? Subscribe via RSS to get new posts in your reader.
Subscribe via RSS