Hangarx

File Processing

Supported File Types

CategoryExtensionsMax SizeProcessing
DocumentsPDF, DOC, DOCX, TXT, RTF50 MBText extraction + OCR
SpreadsheetsXLS, XLSX, CSV25 MBTabular data parsing
PresentationsPPT, PPTX25 MBSlide text extraction
ImagesJPG, PNG, GIF, WEBP, TIFF10 MBOCR + visual analysis
CodeJS, TS, PY, GO, JAVA, etc.5 MBSyntax-aware parsing
DataJSON, XML, YAML, MD5 MBStructured parsing

Extraction Pipeline

1

Upload

File uploaded to Vercel Blob Storage (temporary, 24h TTL)

2

Detection

File type detection via magic numbers

3

Extraction

Content extracted using appropriate processor

4

Chunking

RecursiveCharacterTextSplitter (2000 tokens, 200 overlap)

5

Embedding

Vector embedding via sentence-transformers/all-MiniLM-L6-v2

Content Extractors

// Content extraction by file type
PDF      → pdf-parse + Tesseract OCR (scanned)
Office   → mammoth (DOCX) / xlsx-populate
Images   → Huggingface Vision API
           └─ Model: microsoft/trocr-large-printed
Code     → tree-sitter for syntax trees
General  → raw text extraction

OCR Capabilities

Engine

Tesseract.js + Huggingface Vision

Languages

100+ languages supported

Printed Text

95%+ accuracy

Handwriting

85%+ accuracy

Special handling: Table structure recognition, handwriting detection, receipt/invoice parsing, ID document extraction.

Multi-modal Processing

When processing images, Sprout combines multiple analysis techniques:

// Image + Text combined analysis
1. Extract text via OCR
2. Generate caption via Salesforce/blip-image-captioning-base
3. Detect objects via facebook/detr-resnet-50
4. Combine context for AI model
5. Response includes visual and textual insights