Text

Overview

OneNode uses vector embeddings to understand the meaning of text beyond simple keyword matching. The Text class provides semantic indexing capabilities with a fluent builder pattern, enabling powerful contextual and conceptual search across your text content.

Note: The Text class is designed specifically for semantic search. For simple text storage without search capabilities, use regular string fields instead.

Key Features:

  • Semantic Indexing: Enable intelligent text understanding with the fluent .enableIndex() method.
  • Automatic Chunking: Large text is intelligently split into smaller pieces for efficient embeddings.
  • Asynchronous Processing: Embeddings are generated in the background without blocking your application.
  • Contextual Search: Find content based on meaning and context, not just keywords.
  • Server Defaults: Unspecified parameters automatically use optimized server-side defaults.

Basic Usage

The Text class should be used with the .enableIndex() method to enable semantic search capabilities:

from onenode import Text

# Step 1: Create Text instance
bio_text = Text("Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.")

# Step 2: Enable indexing
bio_text.enable_index()

# Step 3: Use in document
{
  "field_name": bio_text
}

This creates a Text object with semantic indexing enabled using server defaults for the embedding model and chunking strategy.


Configuration Reference

Note: All configuration parameters are completely optional and recommended only for advanced users. OneNode automatically uses optimized defaults that work well for most use cases.

ParameterTypeDescriptionDefault
emb_modelstringEmbedding model to useServer optimized
max_chunk_sizenumberMaximum chunk size in charactersServer optimized
chunk_overlapnumberCharacter overlap between chunksServer optimized
separatorsstring[]Text splitting patternsServer optimized
is_separator_regexbooleanEnable regex in separatorsfalse
keep_separatorbooleanPreserve separators in chunksfalse

Advanced Customization

The following examples show how to customize Text indexing behavior for specific use cases. These configurations are optional and should only be used when you need specific behavior.

Embedding Model

Specify a specific embedding model for quality, speed, or cost optimization:

from onenode import Text, Models

# Using a specific embedding model for higher quality embeddings
content_text = Text("Research paper abstract on machine learning algorithms and their applications in healthcare.")

# Configure with a high-quality embedding model
emb_config = {
    "emb_model": Models.TextToEmbedding.OpenAI.TEXT_EMBEDDING_3_LARGE
}

content_text.enable_index(**emb_config)

# Use in document
{
    "abstract": content_text
}

Chunk Size

Control chunk size for different content types:

from onenode import Text

# For short content - smaller chunks for precise matching
short_content = Text("Product description: High-quality wireless headphones with noise cancellation.")

short_config = {
    "max_chunk_size": 100  # Smaller chunks for short content
}

short_content.enable_index(**short_config)

# For long articles - larger chunks to maintain context
long_article = Text("""
Long article content here with multiple paragraphs discussing 
various aspects of artificial intelligence, machine learning, 
and their applications across different industries...
""")

long_config = {
    "max_chunk_size": 800  # Larger chunks for long content
}

long_article.enable_index(**long_config)

# Use in documents
{
    "product_description": short_content,
    "article_content": long_article
}

Chunk Overlap

Configure overlap between chunks to preserve context:

from onenode import Text

# High overlap for better context preservation
technical_doc = Text("Technical documentation with interconnected concepts and cross-references between sections.")

high_overlap_config = {
    "max_chunk_size": 300,
    "chunk_overlap": 50  # High overlap to preserve context
}

technical_doc.enable_index(**high_overlap_config)

# Low overlap for distinct content sections
news_article = Text("News article with clear paragraph separations and distinct topics in each section.")

low_overlap_config = {
    "max_chunk_size": 300,
    "chunk_overlap": 10  # Low overlap for distinct sections
}

news_article.enable_index(**low_overlap_config)

# Use in documents
{
    "technical_documentation": technical_doc,
    "news_content": news_article
}

Custom Separators

Define how text should be split for structured content:

from onenode import Text

# Custom separators for structured content
structured_content = Text("""
Section 1: Introduction
This is the introduction section.

Section 2: Methods
This section describes the methods used.

Section 3: Results
Here are the results of our study.
""")

section_config = {
    "separators": ["Section \d+:", "\n\n"],  # Split by sections and paragraphs
    "max_chunk_size": 200
}

structured_content.enable_index(**section_config)

# Different separators for code documentation
code_doc = Text("""
### Function: processData()
This function processes input data.

### Function: validateInput()
This function validates user input.

### Function: generateReport()
This function generates the final report.
""")

code_config = {
    "separators": ["### Function:", "\n\n"],  # Split by function headers
    "max_chunk_size": 150
}

code_doc.enable_index(**code_config)

# Use in documents
{
    "research_paper": structured_content,
    "api_documentation": code_doc
}

Regex Separators

Use regex patterns for complex text splitting:

from onenode import Text

# Using regex patterns for complex splitting
email_content = Text("""
From: alice@example.com
Subject: Project Update
Date: 2024-01-15

Hello team,
Here's the weekly project update...

From: bob@example.com  
Subject: Meeting Notes
Date: 2024-01-16

Meeting summary from today...
""")

# Use regex to split by email headers
regex_config = {
    "separators": ["^From: .+@.+\..+$"],  # Regex pattern for email headers
    "is_separator_regex": True,  # Enable regex mode
    "max_chunk_size": 300
}

email_content.enable_index(**regex_config)

# Use in document
{
    "email_thread": email_content
}

Preserve Separators

Control whether to keep or remove separator text in chunks:

from onenode import Text

# Keep separators for context preservation
dialogue_content = Text("""
Speaker A: What are your thoughts on AI development?
Speaker B: I think it's progressing rapidly.
Speaker A: Do you see any concerns?
Speaker B: Yes, particularly around ethics and safety.
""")

# Keep speaker labels for context
dialogue_config = {
    "separators": ["Speaker [AB]:"],
    "keep_separator": True,  # Keep the speaker labels in chunks
    "max_chunk_size": 100
}

dialogue_content.enable_index(**dialogue_config)

# Remove separators for cleaner chunks
content_with_headers = Text("""
=== Chapter 1 ===
This is the content of chapter 1.

=== Chapter 2 ===  
This is the content of chapter 2.
""")

clean_config = {
    "separators": ["=== Chapter \d+ ==="],
    "keep_separator": False,  # Remove chapter headers from chunks
    "max_chunk_size": 150
}

content_with_headers.enable_index(**clean_config)

# Use in documents
{
    "interview_transcript": dialogue_content,
    "book_content": content_with_headers
}
💡 Pro Tip

Start with server defaults and only customize when you have specific requirements. You can combine any parameters for your use case.


After Processing

Once your document is saved and processed, you can access key properties of the Text object. Focus on these essential properties:

# After processing, access key properties of your Text
documents = collection.find({"_id": "document_id"})
document = documents[0]  # find() returns a list
text_obj = document["field_name"]

# Access the original text
print(text_obj.text)
# Output: "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing."

# Access the chunks (most important for understanding how search works)
print(text_obj.chunks)
# Output: [
#   "Alice is a data scientist with expertise in AI and machine learning.",
#   "She has led several projects in natural language processing."
# ]

# Check if indexing is enabled
print(text_obj.index_enabled)
# Output: True

Semantic Search Targeting Chunks

Important: Semantic search targets individual chunks, not the entire Text object. This means you get precise matches even from long documents, making search more accurate and relevant.

# Semantic search targets individual chunks, not the whole text
# This allows precise matching even in long documents

# Insert a document with long text content
article_text = Text("""
Machine learning has revolutionized data science in recent years. 
Companies are now able to extract valuable insights from large datasets. 
Natural language processing enables computers to understand human language. 
Deep learning models can process complex patterns in data.
""").enable_index()

collection.insert([{
    "title": "AI Article",
    "content": article_text
}])

# Use collection.query() for semantic search - it targets individual chunks
results = collection.query("language processing")

# The search will match the specific chunk containing "language processing"
# rather than returning the entire long text
for match in results:
    print(f"Matched chunk: {match.chunk}")
    print(f"From document: {match.document['title']}")
    print(f"Score: {match.score}")
    
    # Access the full Text object from the document if needed
    text_obj = match.document["content"]
    print(f"Total chunks in text: {len(text_obj.chunks)}")

Nested Fields

Text objects can be used in nested structures:

from onenode import Text

# Step 1: Create Text instance for bio
bio_text = Text("Bob has over a decade of experience in AI, focusing on neural networks and deep learning.")

# Step 2: Enable indexing
bio_text.enable_index()

# Step 3: Use in nested document structure
{
  "profile": {
    "name": "Bob",
    "bio": bio_text
  }
}

Learn More About Text Search

Once your text is indexed, explore powerful search capabilities and learn about related operations to get the most out of your indexed content.

Best Practices

  • Use Text for semantic search only: For simple text storage, use regular string fields instead.
  • Always use .enableIndex(): Required to enable semantic search features.
  • Start with defaults: Server defaults work well for most use cases.
  • Customize sparingly: Only adjust parameters when you have specific requirements.

Got question? Email us and we'll get back to you within 24 hours.