File Processing
Supported File Types
| Category | Extensions | Max Size | Processing |
|---|---|---|---|
| Documents | PDF, DOC, DOCX, TXT, RTF | 50 MB | Text extraction + OCR |
| Spreadsheets | XLS, XLSX, CSV | 25 MB | Tabular data parsing |
| Presentations | PPT, PPTX | 25 MB | Slide text extraction |
| Images | JPG, PNG, GIF, WEBP, TIFF | 10 MB | OCR + visual analysis |
| Code | JS, TS, PY, GO, JAVA, etc. | 5 MB | Syntax-aware parsing |
| Data | JSON, XML, YAML, MD | 5 MB | Structured parsing |
Extraction Pipeline
1
Upload
File uploaded to Vercel Blob Storage (temporary, 24h TTL)
2
Detection
File type detection via magic numbers
3
Extraction
Content extracted using appropriate processor
4
Chunking
RecursiveCharacterTextSplitter (2000 tokens, 200 overlap)
5
Embedding
Vector embedding via sentence-transformers/all-MiniLM-L6-v2
Content Extractors
// Content extraction by file type
PDF → pdf-parse + Tesseract OCR (scanned)
Office → mammoth (DOCX) / xlsx-populate
Images → Huggingface Vision API
└─ Model: microsoft/trocr-large-printed
Code → tree-sitter for syntax trees
General → raw text extractionOCR Capabilities
Engine
Tesseract.js + Huggingface Vision
Languages
100+ languages supported
Printed Text
95%+ accuracy
Handwriting
85%+ accuracy
Special handling: Table structure recognition, handwriting detection, receipt/invoice parsing, ID document extraction.
Multi-modal Processing
When processing images, Sprout combines multiple analysis techniques:
// Image + Text combined analysis 1. Extract text via OCR 2. Generate caption via Salesforce/blip-image-captioning-base 3. Detect objects via facebook/detr-resnet-50 4. Combine context for AI model 5. Response includes visual and textual insights