Over the past weeks I’ve been building something that started as a small utility and slowly turned into a proper backend project. The goal is simple: create a self hosted archive server that allows me to import and preserve files from different machines in a structured and deterministic way.
The Problem: Scattered Data and Hidden Duplicates
The motivation came from a familiar situation. Old hard drives, random folders copied over the years, backups with slightly different names, and no clear overview of what is actually stored.
I wanted a system where:
- Every file has a stable identity
- Duplicates are handled automatically
- Historical data can be imported in batches
- The archive itself becomes the single source of truth
Instead of relying on folder structures and naming conventions, I wanted something more deterministic.
Content Based Storage
At the core of the system is content based storage.
When a file is imported, the server computes a BLAKE3 hash of the file contents. That hash becomes the file’s identity. If the same file is uploaded again from another machine, even under a different name, it is detected immediately and stored only once.
Deduplication happens at the storage layer, not by comparing filenames.
This means the archive does not care about where a file came from. It only cares about what the file actually is.
A Custom Object Format
Each imported file is stored as a custom container file on disk.
The container includes:
- A small binary header
- Embedded metadata in JSON
- The raw file bytes
Embedding metadata directly inside the object file was a deliberate choice. It ensures that even if the database layer is lost or corrupted, the archive can be reconstructed by scanning the stored objects. The database acts as an index, not as the only source of truth.
That separation between storage and indexing makes the system more robust.
Metadata and Indexing with MongoDB
For metadata indexing I integrated MongoDB using the official C++ driver.
Every imported file gets a document in the archive database. The content hash is indexed as unique, which enforces deduplication at the database level as well.
The import pipeline performs an upsert operation. If the file already exists, nothing breaks and no duplicate metadata is created. If it does not exist, a new document is inserted.
This design keeps the storage layer and metadata layer aligned without complex logic.
Architecture and Build System
The project is structured into modules.
There is a core library responsible for:
- Hashing
- Object storage
- Restore logic
- Metadata extraction
On top of that sits a server layer that handles MongoDB integration and will later expose an HTTP API.
The entire system is built with CMake and compiles on both Windows and Linux. Integrating the Mongo C++ driver through vcpkg and configuring CMake correctly took more effort than expected, but it reinforced how important a clean and reproducible build setup is in backend systems.
Current State and Next Steps
Right now the system supports:
- Importing individual files
- Deduplicated storage
- Metadata indexing in MongoDB
- Restoring files by hash
The next step is to support recursive folder imports and something I call import sessions. An import session would track which machine and which drive a batch of files came from, along with timestamps and totals. That would make it possible to look back and see exactly when a specific external drive was archived.
Why Build This?
This project is not about building another cloud storage platform. It is about understanding the full stack of a storage system and having direct control over how my own historical data is stored and indexed.
Building it in C++ forces careful thinking about file IO, atomic writes, binary formats, database integration and cross platform build systems.
There is something satisfying about knowing that every file in the archive is uniquely identified by its actual content. No guesswork. No manual organization tricks. Just a deterministic system built from first principles.