LLM Models

OneNode uses a two-step embedding architecture to enable true multimodal search across text and images.

The Two-Step Process

OneNode's unique approach uses two specialized models working in sequence:

1

Vision Model: Visual → Text

Converts images into detailed text descriptions that capture visual content, context, and relationships.

Example

Input: [Image of a red Tesla in parking lot]
Vision Model Output:
"A red Tesla Model 3 electric sedan parked in an outdoor parking lot with white painted lines. The vehicle features sleek aerodynamic design, chrome door handles, and LED headlights."
2

Embedding Model: Text → Vectors

Converts all text (original + vision-generated) into semantic vectors for mathematical comparison.

Example

Text Input 1: "I bought a red Tesla last month"
Text Input 2: "A red Tesla Model 3 electric sedan parked..." (from vision)
Result: Both get similar embedding vectors → semantically related

Step-by-Step: Document Processing

Here's exactly what happens when you store multimodal data:

1
Submit Document: Upload document with Text and Image objects
2
Vision Processing: Each Image object → detailed text description via vision model
3
Text Consolidation: Original text + vision descriptions = unified text format
4
Embedding Generation: All text → semantic vectors via embedding model
5
Unified Search: Single query finds content across all modalities
🤔 Why Two Steps?

Simplicity: One embedding model handles all final processing
Interpretability: You can see the text description that caused a match
Extensibility: Add new modalities by converting them to text
Efficiency: Reuses mature text processing infrastructure

Learn More About Specific Models

Dive deeper into the specific models available in OneNode and learn how to optimize them for your use cases.


How can we improve this documentation?

Got a question? Email us and we'll get back to you within 24 hours.