Text
Overview
OneNode uses vector embeddings to understand the meaning of text beyond simple keyword matching. The Text
class provides semantic indexing capabilities with a fluent builder pattern, enabling powerful contextual and conceptual search across your text content.
Note: The Text
class is designed specifically for semantic search. For simple text storage without search capabilities, use regular string fields instead.
Key Features:
- Semantic Indexing: Enable intelligent text understanding with the fluent
.enableIndex()
method. - Automatic Chunking: Large text is intelligently split into smaller pieces for efficient embeddings.
- Asynchronous Processing: Embeddings are generated in the background without blocking your application.
- Contextual Search: Find content based on meaning and context, not just keywords.
- Server Defaults: Unspecified parameters automatically use optimized server-side defaults.
Basic Usage
The Text
class should be used with the .enableIndex()
method to enable semantic search capabilities:
from onenode import Text
# Step 1: Create Text instance
bio_text = Text("Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.")
# Step 2: Enable indexing
bio_text.enable_index()
# Step 3: Use in document
{
"field_name": bio_text
}
This creates a Text
object with semantic indexing enabled using server defaults for the embedding model and chunking strategy.
Configuration Reference
Note: All configuration parameters are completely optional and recommended only for advanced users. OneNode automatically uses optimized defaults that work well for most use cases.
Parameter | Type | Description | Default |
---|---|---|---|
emb_model | string | Embedding model to use | Server optimized |
max_chunk_size | number | Maximum chunk size in characters | Server optimized |
chunk_overlap | number | Character overlap between chunks | Server optimized |
separators | string[] | Text splitting patterns | Server optimized |
is_separator_regex | boolean | Enable regex in separators | false |
keep_separator | boolean | Preserve separators in chunks | false |
Advanced Customization
The following examples show how to customize Text indexing behavior for specific use cases. These configurations are optional and should only be used when you need specific behavior.
Embedding Model
Specify a specific embedding model for quality, speed, or cost optimization:
from onenode import Text, Models
# Using a specific embedding model for higher quality embeddings
content_text = Text("Research paper abstract on machine learning algorithms and their applications in healthcare.")
# Configure with a high-quality embedding model
emb_config = {
"emb_model": Models.TextToEmbedding.OpenAI.TEXT_EMBEDDING_3_LARGE
}
content_text.enable_index(**emb_config)
# Use in document
{
"abstract": content_text
}
Chunk Size
Control chunk size for different content types:
from onenode import Text
# For short content - smaller chunks for precise matching
short_content = Text("Product description: High-quality wireless headphones with noise cancellation.")
short_config = {
"max_chunk_size": 100 # Smaller chunks for short content
}
short_content.enable_index(**short_config)
# For long articles - larger chunks to maintain context
long_article = Text("""
Long article content here with multiple paragraphs discussing
various aspects of artificial intelligence, machine learning,
and their applications across different industries...
""")
long_config = {
"max_chunk_size": 800 # Larger chunks for long content
}
long_article.enable_index(**long_config)
# Use in documents
{
"product_description": short_content,
"article_content": long_article
}
Chunk Overlap
Configure overlap between chunks to preserve context:
from onenode import Text
# High overlap for better context preservation
technical_doc = Text("Technical documentation with interconnected concepts and cross-references between sections.")
high_overlap_config = {
"max_chunk_size": 300,
"chunk_overlap": 50 # High overlap to preserve context
}
technical_doc.enable_index(**high_overlap_config)
# Low overlap for distinct content sections
news_article = Text("News article with clear paragraph separations and distinct topics in each section.")
low_overlap_config = {
"max_chunk_size": 300,
"chunk_overlap": 10 # Low overlap for distinct sections
}
news_article.enable_index(**low_overlap_config)
# Use in documents
{
"technical_documentation": technical_doc,
"news_content": news_article
}
Custom Separators
Define how text should be split for structured content:
from onenode import Text
# Custom separators for structured content
structured_content = Text("""
Section 1: Introduction
This is the introduction section.
Section 2: Methods
This section describes the methods used.
Section 3: Results
Here are the results of our study.
""")
section_config = {
"separators": ["Section \d+:", "\n\n"], # Split by sections and paragraphs
"max_chunk_size": 200
}
structured_content.enable_index(**section_config)
# Different separators for code documentation
code_doc = Text("""
### Function: processData()
This function processes input data.
### Function: validateInput()
This function validates user input.
### Function: generateReport()
This function generates the final report.
""")
code_config = {
"separators": ["### Function:", "\n\n"], # Split by function headers
"max_chunk_size": 150
}
code_doc.enable_index(**code_config)
# Use in documents
{
"research_paper": structured_content,
"api_documentation": code_doc
}
Regex Separators
Use regex patterns for complex text splitting:
from onenode import Text
# Using regex patterns for complex splitting
email_content = Text("""
From: alice@example.com
Subject: Project Update
Date: 2024-01-15
Hello team,
Here's the weekly project update...
From: bob@example.com
Subject: Meeting Notes
Date: 2024-01-16
Meeting summary from today...
""")
# Use regex to split by email headers
regex_config = {
"separators": ["^From: .+@.+\..+$"], # Regex pattern for email headers
"is_separator_regex": True, # Enable regex mode
"max_chunk_size": 300
}
email_content.enable_index(**regex_config)
# Use in document
{
"email_thread": email_content
}
Preserve Separators
Control whether to keep or remove separator text in chunks:
from onenode import Text
# Keep separators for context preservation
dialogue_content = Text("""
Speaker A: What are your thoughts on AI development?
Speaker B: I think it's progressing rapidly.
Speaker A: Do you see any concerns?
Speaker B: Yes, particularly around ethics and safety.
""")
# Keep speaker labels for context
dialogue_config = {
"separators": ["Speaker [AB]:"],
"keep_separator": True, # Keep the speaker labels in chunks
"max_chunk_size": 100
}
dialogue_content.enable_index(**dialogue_config)
# Remove separators for cleaner chunks
content_with_headers = Text("""
=== Chapter 1 ===
This is the content of chapter 1.
=== Chapter 2 ===
This is the content of chapter 2.
""")
clean_config = {
"separators": ["=== Chapter \d+ ==="],
"keep_separator": False, # Remove chapter headers from chunks
"max_chunk_size": 150
}
content_with_headers.enable_index(**clean_config)
# Use in documents
{
"interview_transcript": dialogue_content,
"book_content": content_with_headers
}
Start with server defaults and only customize when you have specific requirements. You can combine any parameters for your use case.
After Processing
Once your document is saved and processed, you can access key properties of the Text
object. Focus on these essential properties:
# After processing, access key properties of your Text
documents = collection.find({"_id": "document_id"})
document = documents[0] # find() returns a list
text_obj = document["field_name"]
# Access the original text
print(text_obj.text)
# Output: "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing."
# Access the chunks (most important for understanding how search works)
print(text_obj.chunks)
# Output: [
# "Alice is a data scientist with expertise in AI and machine learning.",
# "She has led several projects in natural language processing."
# ]
# Check if indexing is enabled
print(text_obj.index_enabled)
# Output: True
Semantic Search Targeting Chunks
Important: Semantic search targets individual chunks, not the entire Text object. This means you get precise matches even from long documents, making search more accurate and relevant.
# Semantic search targets individual chunks, not the whole text
# This allows precise matching even in long documents
# Insert a document with long text content
article_text = Text("""
Machine learning has revolutionized data science in recent years.
Companies are now able to extract valuable insights from large datasets.
Natural language processing enables computers to understand human language.
Deep learning models can process complex patterns in data.
""").enable_index()
collection.insert([{
"title": "AI Article",
"content": article_text
}])
# Use collection.query() for semantic search - it targets individual chunks
results = collection.query("language processing")
# The search will match the specific chunk containing "language processing"
# rather than returning the entire long text
for match in results:
print(f"Matched chunk: {match.chunk}")
print(f"From document: {match.document['title']}")
print(f"Score: {match.score}")
# Access the full Text object from the document if needed
text_obj = match.document["content"]
print(f"Total chunks in text: {len(text_obj.chunks)}")
Nested Fields
Text
objects can be used in nested structures:
from onenode import Text
# Step 1: Create Text instance for bio
bio_text = Text("Bob has over a decade of experience in AI, focusing on neural networks and deep learning.")
# Step 2: Enable indexing
bio_text.enable_index()
# Step 3: Use in nested document structure
{
"profile": {
"name": "Bob",
"bio": bio_text
}
}
Once your text is indexed, explore powerful search capabilities and learn about related operations to get the most out of your indexed content.
Best Practices
- Use Text for semantic search only: For simple text storage, use regular string fields instead.
- Always use .enableIndex(): Required to enable semantic search features.
- Start with defaults: Server defaults work well for most use cases.
- Customize sparingly: Only adjust parameters when you have specific requirements.
Got question? Email us and we'll get back to you within 24 hours.
Share Your Thoughts
Your feedback helps us improve our documentation. Let us know what you think!