This commit is contained in:
Connor Yoh 2025-09-10 20:24:54 +01:00
parent 3b3a2df392
commit 4e5789e8f4
3 changed files with 248 additions and 206 deletions

View File

@ -2,275 +2,318 @@
## Overview ## Overview
Stirling PDF implements a comprehensive file history tracking system that embeds metadata directly into PDF documents using the PDF keywords field. This system tracks tool operations, version progression, and file lineage through the processing pipeline. Stirling PDF implements a client-side file history system using IndexedDB storage. File metadata, including version history and tool chains, are stored as `StirlingFileStub` objects that travel alongside the actual file data. This enables comprehensive version tracking, tool history, and file lineage management without modifying PDF content.
## PDF Metadata Format ## Storage Architecture
### Storage Mechanism ### IndexedDB-Based Storage
File history is stored in the PDF **Keywords** field as a JSON string with the prefix `stirling-history:`. File history is stored in the browser's IndexedDB using the `fileStorage` service, providing:
- **Persistent storage**: Survives browser sessions and page reloads
- **Large capacity**: Supports files up to 100GB+ with full metadata
- **Fast queries**: Optimized for file browsing and history lookups
- **Type safety**: Structured TypeScript interfaces
### Metadata Structure ### Core Data Structures
```typescript ```typescript
interface PDFHistoryMetadata { interface StirlingFileStub extends BaseFileMetadata {
stirlingHistory: { id: FileId; // Unique file identifier (UUID)
originalFileId: string; // UUID of the root file in the version chain quickKey: string; // Deduplication key: name|size|lastModified
parentFileId?: string; // UUID of the immediate parent file thumbnailUrl?: string; // Generated thumbnail blob URL
versionNumber: number; // Version number (1, 2, 3, etc.) processedFile?: ProcessedFileMetadata; // PDF page data and processing results
toolChain: ToolOperation[]; // Array of applied tool operations
formatVersion: '1.0'; // Metadata format version // File Metadata
}; name: string;
size: number;
type: string;
lastModified: number;
createdAt: number;
// Version Control
isLeaf: boolean; // True if this is the latest version
versionNumber?: number; // Version number (1, 2, 3, etc.)
originalFileId?: string; // UUID of the root file in version chain
parentFileId?: string; // UUID of immediate parent file
// Tool History
toolHistory?: ToolOperation[]; // Complete sequence of applied tools
} }
interface ToolOperation { interface ToolOperation {
toolName: string; // Tool identifier (e.g., 'compress', 'sanitize') toolName: string; // Tool identifier (e.g., 'compress', 'sanitize')
timestamp: number; // When the tool was applied timestamp: number; // When the tool was applied
parameters?: Record<string, any>; // Tool-specific parameters (optional) }
interface StoredStirlingFileRecord extends StirlingFileStub {
data: ArrayBuffer; // Actual file content
fileId: FileId; // Duplicate for indexing
} }
``` ```
### Standard PDF Metadata Fields Used ## Version Management System
The system uses industry-standard PDF document information fields:
- **Creator**: Set to "Stirling-PDF" (identifies the application)
- **Producer**: Set to "Stirling-PDF" (identifies the PDF library/processor)
- **Title, Author, Subject, CreationDate**: Automatically preserved by pdf-lib during processing
- **Keywords**: Enhanced with Stirling history data while preserving user keywords
**Date Handling Strategy**:
- **PDF CreationDate**: Preserved automatically (document creation date)
- **File.lastModified**: Source of truth for "when file was last changed" (original upload time or tool processing time)
- **No duplication**: Single timestamp approach using File.lastModified for all UI displays
### Example PDF Document Information
```
PDF Document Info:
Title: "User Document Title" (preserved from original)
Author: "Document Author" (preserved from original)
Creator: "Stirling-PDF"
Producer: "Stirling-PDF"
CreationDate: "2025-01-01T10:30:00Z" (preserved from original)
Keywords: ["user-keyword", "stirling-history:{\"stirlingHistory\":{\"originalFileId\":\"abc123\",\"versionNumber\":2,\"toolChain\":[{\"toolName\":\"compress\",\"timestamp\":1756825614618},{\"toolName\":\"sanitize\",\"timestamp\":1756825631545}],\"formatVersion\":\"1.0\"}}"]
File System:
lastModified: 1756825631545 (tool processing time - source of truth for "when file was last changed")
```
## Version Numbering System
### Version Progression ### Version Progression
- **v0**: Original uploaded file (no Stirling PDF processing) - **v1**: Original uploaded file (first version)
- **v1**: First tool applied to original file - **v2**: First tool applied to original
- **v2**: Second tool applied (inherits from v1) - **v3**: Second tool applied (inherits from v2)
- **v3**: Third tool applied (inherits from v2) - **v4**: Third tool applied (inherits from v3)
- **etc.** - **etc.**
### Version Relationships ### Leaf Node System
Only the latest version of each file family is marked as `isLeaf: true`:
- **Leaf files**: Show in default file list, available for tool processing
- **History files**: Hidden by default, accessible via history expansion
### File Relationships
``` ```
document.pdf (v0) document.pdf (v1, isLeaf: false)
↓ compress ↓ compress
document.pdf (v1: compress) document.pdf (v2, isLeaf: false)
↓ sanitize ↓ sanitize
document.pdf (v2: compress → sanitize) document.pdf (v3, isLeaf: true) ← Current active version
↓ ocr
document.pdf (v3: compress → sanitize → ocr)
``` ```
## File Lineage Tracking
### Original File ID
The `originalFileId` remains constant throughout the entire version chain, enabling grouping of all versions of the same logical document.
### Parent-Child Relationships
Each processed file references its immediate parent via `parentFileId`, creating a complete audit trail.
### Tool Chain
The `toolChain` array maintains the complete sequence of tool operations applied to reach the current version.
## Implementation Architecture ## Implementation Architecture
### Frontend Components ### 1. FileStorage Service (`fileStorage.ts`)
#### 1. PDF Metadata Service (`pdfMetadataService.ts`) **Core Methods:**
- **PDF-lib Integration**: Uses pdf-lib for metadata injection/extraction
- **Caching**: ContentCache with 10-minute TTL for performance
- **Encryption Support**: Handles encrypted PDFs with `ignoreEncryption: true`
**Key Methods:**
```typescript ```typescript
// Inject metadata into PDF // Store file with complete metadata
injectHistoryMetadata(pdfBytes: ArrayBuffer, originalFileId: string, parentFileId?: string, toolChain: ToolOperation[], versionNumber: number): Promise<ArrayBuffer> async storeStirlingFile(stirlingFile: StirlingFile, stub: StirlingFileStub): Promise<void>
// Extract metadata from PDF // Load file with metadata
extractHistoryMetadata(pdfBytes: ArrayBuffer): Promise<PDFHistoryMetadata | null> async getStirlingFile(id: FileId): Promise<StirlingFile | null>
async getStirlingFileStub(id: FileId): Promise<StirlingFileStub | null>
// Create new version with incremented number // Query operations
createNewVersion(pdfBytes: ArrayBuffer, parentFileId: string, toolOperation: ToolOperation): Promise<ArrayBuffer> async getLeafStirlingFileStubs(): Promise<StirlingFileStub[]>
async getAllStirlingFileStubs(): Promise<StirlingFileStub[]>
// Version management
async markFileAsProcessed(fileId: FileId): Promise<boolean> // Set isLeaf = false
async markFileAsLeaf(fileId: FileId): Promise<boolean> // Set isLeaf = true
``` ```
#### 2. File History Utilities (`fileHistoryUtils.ts`) ### 2. File Context Integration
- **FileContext Integration**: Links PDF metadata with React state management
- **Version Management**: Handles version grouping and latest version filtering
- **Tool Integration**: Prepares files for tool processing with history injection
**Key Functions:** **FileContext** manages runtime state with `StirlingFileStub[]` in memory:
```typescript ```typescript
// Extract history from File and update FileRecord interface FileContextState {
extractFileHistory(file: File, record: FileRecord): Promise<FileRecord> files: {
ids: FileId[];
// Inject history before tool processing byId: Record<FileId, StirlingFileStub>;
injectHistoryForTool(file: File, sourceFileRecord: FileRecord, toolName: string, parameters?): Promise<File> };
}
// Group files by original ID for version management
groupFilesByOriginal(fileRecords: FileRecord[]): Map<string, FileRecord[]>
// Get only latest version of each file group
getLatestVersions(fileRecords: FileRecord[]): FileRecord[]
``` ```
#### 3. Tool Operation Integration (`useToolOperation.ts`) **Key Operations:**
- **Automatic Injection**: All tool operations automatically inject history metadata - `addFiles()`: Stores new files with initial metadata
- **Version Progression**: Reads current version from PDF and increments appropriately - `addStirlingFileStubs()`: Loads existing files from storage with preserved metadata
- **Universal Support**: Works with single-file, multi-file, and custom tool patterns - `consumeFiles()`: Processes files through tools, creating new versions
### Data Flow ### 3. Tool Operation Integration
``` **Tool Processing Flow:**
1. User uploads PDF → No history (v0) 1. **Input**: User selects files (marked as `isLeaf: true`)
2. Tool processing begins → prepareFilesWithHistory() injects current state 2. **Processing**: Backend processes files and returns results
3. Backend processes PDF → Returns processed file with embedded history 3. **History Creation**: New `StirlingFileStub` created with:
4. FileContext adds result → extractFileHistory() reads embedded metadata - Incremented version number
5. UI displays file → Shows version badges and tool chain - Updated tool history
- Parent file reference
4. **Storage**: Both parent (marked `isLeaf: false`) and child (marked `isLeaf: true`) stored
5. **UI Update**: FileContext updated with new file state
**Child Stub Creation:**
```typescript
export function createChildStub(
parentStub: StirlingFileStub,
operation: { toolName: string; timestamp: number },
resultingFile: File,
thumbnail?: string
): StirlingFileStub {
return {
id: createFileId(),
name: resultingFile.name,
size: resultingFile.size,
type: resultingFile.type,
lastModified: resultingFile.lastModified,
quickKey: createQuickKey(resultingFile),
createdAt: Date.now(),
isLeaf: true,
// Version Control
versionNumber: (parentStub.versionNumber || 1) + 1,
originalFileId: parentStub.originalFileId || parentStub.id,
parentFileId: parentStub.id,
// Tool History
toolHistory: [...(parentStub.toolHistory || []), operation],
thumbnailUrl: thumbnail
};
}
``` ```
## UI Integration ## UI Integration
### File Manager ### File Manager History Display
- **Version Toggle**: Switch between "Latest Only" and "All Versions" views
- **Version Badges**: v0, v1, v2 indicators on file items
- **History Dropdown**: Version timeline with restore functionality
- **Tool Chain Display**: Complete processing history in file details panel
### Active Files Workbench **FileManager** (`FileManager.tsx`) provides:
- **Version Metadata**: Version number in file metadata line (e.g., "PDF file - 3 Pages - v2") - **Default View**: Shows only leaf files (`isLeaf: true`)
- **Tool Chain Overlay**: Bottom overlay showing tool sequence (e.g., "compress → sanitize") - **History Expansion**: Click to show all versions of a file family
- **Real-time Updates**: Immediate display after tool processing - **History Groups**: Nested display using `FileHistoryGroup.tsx`
## Storage and Persistence **FileListItem** (`FileListItem.tsx`) displays:
- **Version Badges**: v1, v2, v3 indicators
- **Tool Chain**: Complete processing history in tooltips
- **History Actions**: "Show/Hide History" toggle, "Restore" for history files
### PDF Metadata ### FileManagerContext Integration
- **Embedded in PDF**: History travels with the document across downloads/uploads
- **Keywords Field**: Uses standard PDF metadata field for maximum compatibility
- **Multiple Keywords**: System handles multiple history entries and extracts latest version
### IndexedDB Storage **File Selection Flow:**
- **Client-side Persistence**: FileMetadata includes extracted history information
- **Lazy Loading**: History extracted when files are accessed from storage
- **Batch Processing**: Large collections processed in batches of 5 to prevent memory issues
### Memory Management
- **ContentCache**: 10-minute TTL, 50-file capacity for metadata extraction results
- **Cleanup**: Automatic cache eviction and expired entry removal
- **Large File Support**: No artificial size limits (supports 100GB+ PDFs)
## Tool Configuration
### Filename Preservation
Most tools preserve the original filename to maintain file identity:
**No Prefix (Filename Preserved):**
- compress, repair, sanitize, addPassword, removePassword, changePermissions, removeCertificateSign, unlockPdfForms, ocr, addWatermark
**With Prefix (Different Content):**
- split (`split_` - creates multiple files)
- convert (`converted_` - changes file format)
### Configuration Pattern
```typescript ```typescript
export const toolOperationConfig = { // Recent files (from storage)
toolType: ToolType.singleFile, onRecentFileSelect: (stirlingFileStubs: StirlingFileStub[]) => void
operationType: 'toolName', // Calls: actions.addStirlingFileStubs(stirlingFileStubs, options)
endpoint: '/api/v1/category/tool-endpoint',
filePrefix: '', // Empty for filename preservation // New uploads
buildFormData: buildToolFormData, onFileUpload: (files: File[]) => void
defaultParameters // Calls: actions.addFiles(files, options)
```
**History Management:**
```typescript
// Toggle history visibility
const { expandedFileIds, onToggleExpansion } = useFileManagerContext();
// Restore history file to current
const handleAddToRecents = (file: StirlingFileStub) => {
fileStorage.markFileAsLeaf(file.id); // Make this version current
}; };
``` ```
### Metadata Preservation Strategy ## Data Flow
The system uses a **minimal touch approach** for PDF metadata:
```typescript ### New File Upload
// Only modify necessary fields, let pdf-lib preserve everything else ```
pdfDoc.setCreator('Stirling-PDF'); 1. User uploads files → addFiles()
pdfDoc.setProducer('Stirling-PDF'); 2. Generate thumbnails and page count
pdfDoc.setKeywords([...existingKeywords, historyKeyword]); 3. Create StirlingFileStub with isLeaf: true, versionNumber: 1
4. Store both StirlingFile + StirlingFileStub in IndexedDB
// File.lastModified = Date.now() for processed files (source of truth) 5. Dispatch to FileContext state
// PDF internal dates (CreationDate, etc.) preserved automatically by pdf-lib
``` ```
**Benefits:** ### Tool Processing
- **Automatic Preservation**: pdf-lib preserves Title, Author, Subject, CreationDate without explicit re-setting ```
- **No Duplication**: File.lastModified is single source of truth for "when file changed" 1. User selects tool + files → useToolOperation()
- **Simpler Code**: Minimal metadata operations reduce complexity and bugs 2. API processes files → returns processed File objects
- **Better Performance**: Fewer PDF reads/writes during processing 3. createChildStub() for each result:
- Parent marked isLeaf: false
- Child created with isLeaf: true, incremented version
4. Store all files with updated metadata
5. Update FileContext with new state
```
## Error Handling and Resilience ### File Loading (Recent Files)
```
1. User selects from FileManager → onRecentFileSelect()
2. addStirlingFileStubs() with preserved metadata
3. Load actual StirlingFile data from storage
4. Files appear in workbench with complete history intact
```
## Performance Optimizations
### Metadata Regeneration
When loading files from storage, missing `processedFile` data is regenerated:
```typescript
// In addStirlingFileStubs()
const needsProcessing = !record.processedFile ||
!record.processedFile.pages ||
record.processedFile.pages.length === 0;
if (needsProcessing) {
const result = await generateThumbnailWithMetadata(stirlingFile);
record.processedFile = createProcessedFile(result.pageCount, result.thumbnail);
}
```
### Memory Management
- **Blob URL Tracking**: Automatic cleanup of thumbnail URLs
- **Lazy Loading**: Files loaded from storage only when needed
- **LRU Caching**: File objects cached in memory with size limits
## File Deduplication
### QuickKey System
Files are deduplicated using `quickKey` format:
```typescript
const quickKey = `${file.name}|${file.size}|${file.lastModified}`;
```
This prevents duplicate uploads while allowing different versions of the same logical file.
## Error Handling
### Graceful Degradation ### Graceful Degradation
- **Extraction Failures**: Files display normally without history if metadata extraction fails - **Storage Failures**: Files continue to work without persistence
- **Encrypted PDFs**: System handles encrypted documents with `ignoreEncryption` option - **Metadata Issues**: Missing metadata regenerated on demand
- **Corrupted Metadata**: Invalid history metadata is silently ignored with fallback to basic file info - **Version Conflicts**: Automatic version number resolution
### Performance Considerations ### Recovery Scenarios
- **Caching**: Metadata extraction results are cached to avoid re-parsing - **Corrupted Storage**: Automatic cleanup and re-initialization
- **Batch Processing**: Large file collections processed in controlled batches - **Missing Files**: Stubs cleaned up automatically
- **Async Extraction**: History extraction doesn't block file operations - **Version Mismatches**: Automatic version chain reconstruction
## Developer Guidelines ## Developer Guidelines
### Adding History to New Tools ### Adding File History to New Components
1. **Set `filePrefix: ''`** in tool configuration to preserve filenames
2. **Use existing patterns**: Tool operations automatically inherit history injection 1. **Use FileContext Actions**:
3. **Custom processors**: Must handle history injection manually if using custom response handlers ```typescript
const { actions } = useFileActions();
await actions.addFiles(files); // For new uploads
await actions.addStirlingFileStubs(stubs); // For existing files
```
2. **Preserve Metadata When Processing**:
```typescript
const childStub = createChildStub(parentStub, {
toolName: 'compress',
timestamp: Date.now()
}, processedFile, thumbnail);
```
3. **Handle Storage Operations**:
```typescript
await fileStorage.storeStirlingFile(stirlingFile, stirlingFileStub);
const stub = await fileStorage.getStirlingFileStub(fileId);
```
### Testing File History ### Testing File History
1. **Upload a PDF**: Should show no version (v0), original File.lastModified preserved
2. **Apply any tool**: Should show v1 with tool name, File.lastModified updated to processing time
3. **Apply another tool**: Should show v2 with tool chain sequence
4. **Check file manager**: Version toggle, history dropdown, standard PDF metadata should all work
5. **Check workbench**: Tool chain overlay should appear on thumbnails
### Backend Tool Monitoring 1. **Upload files**: Should show v1, marked as leaf
The system automatically logs metadata preservation: 2. **Apply tool**: Should create v2, mark v1 as non-leaf
- **Success**: `✅ METADATA PRESERVED: Tool 'ocr' correctly preserved all PDF metadata` 3. **Check FileManager**: History should show both versions
- **Issues**: `⚠️ METADATA LOSS: Tool 'compress' did not preserve PDF metadata: CreationDate modified, Author stripped` 4. **Restore old version**: Should mark old version as leaf
5. **Check storage**: Both versions should persist in IndexedDB
This helps identify which backend tools need to be updated to preserve standard PDF metadata fields.
### Debugging
Enable development mode logging to see:
- History injection: `📄 Injected PDF history metadata`
- History extraction: `📄 History extraction completed`
- Version progression: Version number increments and tool chain updates
- Metadata issues: Warnings for tools that strip PDF metadata
## Future Enhancements ## Future Enhancements
### Possible Extensions ### Potential Improvements
- **Branching**: Support for parallel processing branches from same source - **Branch History**: Support for parallel processing branches
- **Diff Tracking**: Track specific changes made by each tool - **History Export**: Export complete version history as JSON
- **User Attribution**: Add user information to tool operations - **Conflict Resolution**: Handle concurrent modifications
- **Timestamp Precision**: Enhanced timestamp tracking for audit trails - **Cloud Sync**: Sync history across devices
- **Export Options**: Export complete processing history as JSON/XML - **Compression**: Compress historical file data
### Compatibility ### API Extensions
- **PDF Standard Compliance**: Uses standard PDF Keywords field for broad compatibility - **Batch Operations**: Process multiple version chains simultaneously
- **Backwards Compatibility**: PDFs without history metadata work normally - **Search Integration**: Search within tool history and file metadata
- **Future Versions**: Format version field enables future metadata schema evolution - **Analytics**: Track usage patterns and tool effectiveness
--- ---
**Last Updated**: January 2025 **Last Updated**: January 2025
**Format Version**: 1.0 **Implementation**: Stirling PDF Frontend v2
**Implementation**: Stirling PDF Frontend v2 **Storage Version**: IndexedDB with fileStorage service

View File

@ -28,7 +28,7 @@ const FileHistoryGroup: React.FC<FileHistoryGroupProps> = ({
// Sort history files by version number (oldest first, excluding the current leaf file) // Sort history files by version number (oldest first, excluding the current leaf file)
const sortedHistory = historyFiles const sortedHistory = historyFiles
.filter(file => file.id !== leafFile.id) // Exclude the leaf file itself .filter(file => file.id !== leafFile.id) // Exclude the leaf file itself
.sort((a, b) => (a.versionNumber || 1) - (b.versionNumber || 1)); .sort((a, b) => (b.versionNumber || 1) - (a.versionNumber || 1));
if (!isExpanded || sortedHistory.length === 0) { if (!isExpanded || sortedHistory.length === 0) {
return null; return null;

View File

@ -21,7 +21,6 @@ const FileListArea: React.FC<FileListAreaProps> = ({
recentFiles, recentFiles,
filteredFiles, filteredFiles,
selectedFilesSet, selectedFilesSet,
fileGroups,
expandedFileIds, expandedFileIds,
loadedHistoryFiles, loadedHistoryFiles,
onFileSelect, onFileSelect,
@ -72,7 +71,7 @@ const FileListArea: React.FC<FileListAreaProps> = ({
isHistoryFile={false} // All files here are leaf files isHistoryFile={false} // All files here are leaf files
isLatestVersion={true} // All files here are the latest versions isLatestVersion={true} // All files here are the latest versions
/> />
<FileHistoryGroup <FileHistoryGroup
leafFile={file} leafFile={file}
historyFiles={historyFiles} historyFiles={historyFiles}