Building My Own Archive Server in C++ Part 2

Over the past months the archive server project has evolved from a simple storage utility into a much larger system focused on long term preservation, deduplication, indexing and recovery of personal data.

What originally started as a backend focused experiment has now grown into a complete self hosted archive platform with a graphical desktop client, chunked uploads, folder importing, metadata indexing and restore workflows. The main goal of the project remains the same: creating a deterministic archive where files are identified by their actual content instead of filenames or folder structures.

The Original Goal

The motivation behind the project came from dealing with years of scattered files across old hard drives, NAS systems and backups.

I wanted a system where:

Every file has a stable identity
Duplicate files are automatically detected
Historical data can be imported in large batches
Files can always be reconstructed even if metadata is lost
The archive itself becomes the source of truth

Instead of organizing files manually through folders and naming conventions, the system focuses entirely on content based storage.

Content Based Storage

The archive server still uses BLAKE3 hashing as the foundation of the storage layer.

When a file is imported, the server computes a hash of the file contents. That hash becomes the permanent identity of the object. If the same file is uploaded again from another machine or drive, the archive immediately detects it and avoids storing duplicate data.

This means deduplication happens entirely at the storage layer rather than by comparing filenames or metadata.

The archive does not care where a file came from. It only cares about the contents of the file itself.

Custom Object Storage Format

One of the most important design decisions in the project has been the custom archive object format.

Each stored object contains:

A binary header
Embedded JSON metadata
The raw file contents

The metadata is intentionally embedded directly inside the stored archive object instead of relying entirely on MongoDB.

This allows the archive to be reconstructed by scanning the object storage directly if the database is lost or corrupted. MongoDB acts as an index and query layer rather than the only source of truth.

That separation between storage and indexing has become one of the core architectural principles of the project.

MongoDB Integration

The backend uses MongoDB together with the official MongoDB C++ driver for metadata indexing and querying.

Each archived object is represented as a document containing:

Content hash
Original filenames
File size
Import timestamps
Folder relationships
Additional metadata

The content hash is indexed as unique which guarantees deduplication at the database level as well.

The import pipeline uses upsert operations so repeated imports never create duplicate metadata entries.

Recursive Folder Imports

One of the largest additions since the original version has been support for recursive folder imports.

The archive can now import entire directory structures while preserving relative paths and metadata. This makes it possible to archive large external drives and historical collections in a single operation.

Folder imports introduced several architectural challenges:

Tracking import state
Handling interrupted uploads
Preserving directory structures
Managing duplicate detection efficiently
Associating files with import sessions

The system now supports importing nested folders through both the backend API and the graphical desktop client.

Import Sessions

The project now includes a concept called import sessions.

An import session tracks metadata about a larger archive operation such as:

Source machine
Imported folders
Start and end timestamps
Number of files processed
Total imported size
Failed or skipped files

This makes the archive significantly more useful historically because it becomes possible to see exactly when and from where files entered the system.

HTTP Server and API

The backend has evolved into a proper HTTP based archive service.

The server is written in C++ using cpp-httplib together with JSON serialization through nlohmann/json.

Current API functionality includes:

Starting uploads
Chunked upload handling
Upload finalization
Recursive folder importing
File restoration
Metadata queries
Search functionality
Folder browsing
Archive statistics

Chunked uploads were added to support large files and improve reliability during transfers.

Graphical Desktop Client

One of the biggest changes in the project is the addition of a native desktop GUI client.

The client is built using:

Dear ImGui
GLFW
OpenGL

The interface provides:

File explorer views
Drag and drop uploads
Upload progress tracking
Search functionality
Restore workflows
Archive statistics
Folder browsing

This transformed the project from a backend experiment into something much closer to a usable archive platform.

Restore System

The restore pipeline has also improved significantly.

Files and folders can now be restored directly from the archive while preserving their original structure.

The restore system streams object contents directly from storage and reconstructs files using the embedded metadata stored inside each archive object.

Cross Platform Development

The project is designed to compile and run on both Linux and Windows.

The build system uses CMake together with vcpkg for dependency management.

Cross platform support introduced several challenges including:

UTF-8 filename handling
Filesystem path compatibility
MongoDB driver integration
Build reproducibility
Native drag and drop handling
Platform specific file APIs

A large amount of development time has gone into making the project behave consistently across operating systems.

Current State

The archive server currently supports:

Content based deduplicated storage
Recursive folder imports
Chunked uploads
MongoDB metadata indexing
Restore functionality
Search and browsing
Import session tracking
Native graphical desktop client
Cross platform builds

The project has moved far beyond the original prototype and is now approaching a fully usable self hosted archival system.

Future Plans

There are still several areas planned for future development.

Some of the next larger goals include:

Multi user support
Authentication and permissions
Background indexing workers
Object compression
Incremental synchronization
Better metadata extraction
Media previews
Snapshot style restore points
Remote replication between archive servers

Long term, the goal is not to compete with cloud storage platforms but to build a deterministic archival system focused on long term preservation and ownership of personal data.

Why Continue Building It?

One of the most rewarding aspects of the project has been understanding the deeper layers of how storage systems actually work.

Building the system in C++ forces careful thinking about:

File IO
Atomic operations
Binary formats
Memory management
Network communication
Cross platform compatibility
Database synchronization

The archive server has become both a practical tool and a long term systems programming project.

There is something satisfying about knowing that every archived file is uniquely identified by its actual content and can always be reconstructed from the storage layer itself without relying entirely on external metadata.