Building My Own Archive Server in C++ Part 2


#Projects

Over the past months the archive server project has evolved from a simple storage utility into a much larger system focused on long term preservation, deduplication, indexing and recovery of personal data.

What originally started as a backend focused experiment has now grown into a complete self hosted archive platform with a graphical desktop client, chunked uploads, folder importing, metadata indexing and restore workflows. The main goal of the project remains the same: creating a deterministic archive where files are identified by their actual content instead of filenames or folder structures.

The Original Goal

The motivation behind the project came from dealing with years of scattered files across old hard drives, NAS systems and backups.

I wanted a system where:

  • Every file has a stable identity
  • Duplicate files are automatically detected
  • Historical data can be imported in large batches
  • Files can always be reconstructed even if metadata is lost
  • The archive itself becomes the source of truth

Instead of organizing files manually through folders and naming conventions, the system focuses entirely on content based storage.

Content Based Storage

The archive server still uses BLAKE3 hashing as the foundation of the storage layer.

When a file is imported, the server computes a hash of the file contents. That hash becomes the permanent identity of the object. If the same file is uploaded again from another machine or drive, the archive immediately detects it and avoids storing duplicate data.

This means deduplication happens entirely at the storage layer rather than by comparing filenames or metadata.

The archive does not care where a file came from. It only cares about the contents of the file itself.

Custom Object Storage Format

One of the most important design decisions in the project has been the custom archive object format.

Each stored object contains:

  • A binary header
  • Embedded JSON metadata
  • The raw file contents

The metadata is intentionally embedded directly inside the stored archive object instead of relying entirely on MongoDB.

This allows the archive to be reconstructed by scanning the object storage directly if the database is lost or corrupted. MongoDB acts as an index and query layer rather than the only source of truth.

That separation between storage and indexing has become one of the core architectural principles of the project.

MongoDB Integration

The backend uses MongoDB together with the official MongoDB C++ driver for metadata indexing and querying.

Each archived object is represented as a document containing:

  • Content hash
  • Original filenames
  • File size
  • Import timestamps
  • Folder relationships
  • Additional metadata

The content hash is indexed as unique which guarantees deduplication at the database level as well.

The import pipeline uses upsert operations so repeated imports never create duplicate metadata entries.

Recursive Folder Imports

One of the largest additions since the original version has been support for recursive folder imports.

The archive can now import entire directory structures while preserving relative paths and metadata. This makes it possible to archive large external drives and historical collections in a single operation.

Folder imports introduced several architectural challenges:

  • Tracking import state
  • Handling interrupted uploads
  • Preserving directory structures
  • Managing duplicate detection efficiently
  • Associating files with import sessions

The system now supports importing nested folders through both the backend API and the graphical desktop client.

Import Sessions

The project now includes a concept called import sessions.

An import session tracks metadata about a larger archive operation such as:

  • Source machine
  • Imported folders
  • Start and end timestamps
  • Number of files processed
  • Total imported size
  • Failed or skipped files

This makes the archive significantly more useful historically because it becomes possible to see exactly when and from where files entered the system.

HTTP Server and API

The backend has evolved into a proper HTTP based archive service.

The server is written in C++ using cpp-httplib together with JSON serialization through nlohmann/json.

Current API functionality includes:

  • Starting uploads
  • Chunked upload handling
  • Upload finalization
  • Recursive folder importing
  • File restoration
  • Metadata queries
  • Search functionality
  • Folder browsing
  • Archive statistics

Chunked uploads were added to support large files and improve reliability during transfers.

Graphical Desktop Client

One of the biggest changes in the project is the addition of a native desktop GUI client.

The client is built using:

  • Dear ImGui
  • GLFW
  • OpenGL

The interface provides:

  • File explorer views
  • Drag and drop uploads
  • Upload progress tracking
  • Search functionality
  • Restore workflows
  • Archive statistics
  • Folder browsing

This transformed the project from a backend experiment into something much closer to a usable archive platform.

Restore System

The restore pipeline has also improved significantly.

Files and folders can now be restored directly from the archive while preserving their original structure.

The restore system streams object contents directly from storage and reconstructs files using the embedded metadata stored inside each archive object.

Cross Platform Development

The project is designed to compile and run on both Linux and Windows.

The build system uses CMake together with vcpkg for dependency management.

Cross platform support introduced several challenges including:

  • UTF-8 filename handling
  • Filesystem path compatibility
  • MongoDB driver integration
  • Build reproducibility
  • Native drag and drop handling
  • Platform specific file APIs

A large amount of development time has gone into making the project behave consistently across operating systems.

Current State

The archive server currently supports:

  • Content based deduplicated storage
  • Recursive folder imports
  • Chunked uploads
  • MongoDB metadata indexing
  • Restore functionality
  • Search and browsing
  • Import session tracking
  • Native graphical desktop client
  • Cross platform builds

The project has moved far beyond the original prototype and is now approaching a fully usable self hosted archival system.

Future Plans

There are still several areas planned for future development.

Some of the next larger goals include:

  • Multi user support
  • Authentication and permissions
  • Background indexing workers
  • Object compression
  • Incremental synchronization
  • Better metadata extraction
  • Media previews
  • Snapshot style restore points
  • Remote replication between archive servers

Long term, the goal is not to compete with cloud storage platforms but to build a deterministic archival system focused on long term preservation and ownership of personal data.

Why Continue Building It?

One of the most rewarding aspects of the project has been understanding the deeper layers of how storage systems actually work.

Building the system in C++ forces careful thinking about:

  • File IO
  • Atomic operations
  • Binary formats
  • Memory management
  • Network communication
  • Cross platform compatibility
  • Database synchronization

The archive server has become both a practical tool and a long term systems programming project.

There is something satisfying about knowing that every archived file is uniquely identified by its actual content and can always be reconstructed from the storage layer itself without relying entirely on external metadata.