Deprecated: Function get_magic_quotes_gpc() is deprecated in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 99

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 619

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1169

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176
8000 GitHub - mosuka/sage
Nothing Special   »   [go: up one dir, main page]

Skip to content

mosuka/sage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sage

Crates.io Documentation License: MIT

A fast, featureful full-text search library for Rust, inspired by the Lucene and Lucene alternatives.

✨ Features

  • Pure Rust Implementation - Memory-safe and fast performance
  • Flexible Text Analysis - Configurable tokenization, stemming, and filtering pipeline
  • Multiple Storage Backends - File system, memory-mapped files, and in-memory storage
  • Advanced Query Types - Term, phrase, range, boolean, fuzzy, wildcard, and geographic queries
  • Vector Search - HNSW-based approximate nearest neighbor search with multiple distance metrics
  • Text Embeddings - Built-in support for generating embeddings with Candle (local BERT models) and OpenAI
  • Multimodal Search - Cross-modal search with CLIP models for text-to-image and image-to-image similarity
  • BM25 Scoring - Industry-standard relevance scoring with customizable parameters
  • Spell Correction - Built-in spell checking and query suggestion system
  • Faceted Search - Multi-dimensional search with facet aggregation
  • Real-time Search - Near real-time search with background index optimization
  • SIMD Acceleration - Optimized vector operations for improved performance

πŸš€ Quick Start

Add Sage to your Cargo.toml:

[dependencies]
sage = "0.1"

Basic Usage

use sage::prelude::*;
use tempfile::TempDir;

fn main() -> Result<()> {
    // Create a temporary directory for the index
    let temp_dir = TempDir::new().unwrap();
    
    // Define a schema
    let mut schema = Schema::new();
    schema.add_field("title", Box::new(TextField::new().stored(true).indexed(true)))?;
    schema.add_field("content", Box::new(TextField::new().indexed(true)))?;
    
    // Create a search engine
    let mut engine = SearchEngine::create_in_dir(
        temp_dir.path(), 
        schema, 
        IndexConfig::default()
    )?;
    
    // Add documents
    let documents = vec![
        Document::builder()
            .add_text("title", "Rust Programming")
            .add_text("content", "Rust is a systems programming language")
            .build(),
        Document::builder()
            .add_text("title", "Python Guide") 
            .add_text("content", "Python is a versatile programming language")
            .build(),
    ];
    
    engine.add_documents(documents)?;
    engine.commit()?;
    
    // Search documents
    let query = TermQuery::new("content".to_string(), "programming".to_string());
    let results = engine.search(&query, 10)?;
    
    println!("Found {} matches", results.total_hits);
    for hit in results.hits {
        println!("Score: {:.2}, Document: {:?}", hit.score, hit.document);
    }
    
    Ok(())
}

πŸ—οΈ Architecture

Sage is built with a modular architecture:

Core Components

  • Schema & Fields - Define document structure with typed fields (text, numeric, boolean, geographic)
  • Analysis Pipeline - Configurable text processing with tokenizers, filters, and stemmers
  • Storage Layer - Pluggable storage backends with transaction support
  • Index Structure - Inverted index with posting lists and term dictionaries
  • Query Engine - Flexible query system supporting multiple query types
  • Search Engine - High-level interface combining indexing and search operations

Field Types

// Text field for full-text search
TextField::new().stored(true).indexed(true)

// Numeric field for range queries
NumericField::new().indexed(true)

// Boolean field for filtering
BooleanField::new().indexed(true)

// ID field for exact matches
IdField::new()

// Geographic field for spatial queries  
GeoField::new().indexed(true)

// Vector field for similarity search
VectorField::new(128).indexed(true) // 128-dimensional vectors

Query Types

// Term query
TermQuery::new("field".to_string(), "term".to_string())

// Phrase query
PhraseQuery::new("field".to_string(), vec!["hello".to_string(), "world".to_string()])

// Range query
RangeQuery::new("price".to_string(), Some(100.0), Some(500.0))

// Boolean query
BooleanQuery::new()
    .add_must(TermQuery::new("category".to_string(), "book".to_string()))
    .add_should(TermQuery::new("author".to_string(), "tolkien".to_string()))

// Fuzzy query
FuzzyQuery::new("title".to_string(), "progamming".to_string(), 2) // max edit distance: 2

// Wildcard query  
WildcardQuery::new("filename".to_string(), "*.pdf".to_string())

// Geographic query
GeoQuery::within_radius("location".to_string(), 40.7128, -74.0060, 10.0) // NYC, 10km radius

🎯 Advanced Features

Vector Search with Text Embeddings

Sage supports semantic search using text embeddings. You can use local BERT models via Candle or OpenAI's API.

Using Candle (Local BERT Models)

[dependencies]
sage = { version = "0.1", features = ["embeddings-candle"] }
use sage::embedding::{CandleTextEmbedder, TextEmbedder};
use sage::vector::*;

// Initialize embedder with a sentence-transformers model
let embedder = CandleTextEmbedder::new("sentence-transformers/all-MiniLM-L6-v2")?;

// Generate embeddings for documents
let documents = vec![
    "Rust is a systems programming language",
    "Python is great for data science",
    "Machine learning with neural networks",
];

let vectors = embedder.embed_batch(&documents).await?;

// Build vector index
let config = VectorIndexBuildConfig {
    dimension: embedder.dimension(),
    distance_metric: DistanceMetric::Cosine,
    index_type: VectorIndexType::Flat,
    normalize_vectors: true,
    ..Default::default()
};

let mut builder = VectorIndexBuilderFactory::create_builder(config)?;
let doc_vectors: Vec<(u64, Vector)> = documents
    .iter()
    .enumerate()
    .zip(vectors.iter())
    .map(|((idx, _), vec)| (idx as u64, vec.clone()))
    .collect();

builder.add_vectors(doc_vectors)?;
builder.finalize()?;

// Search with query embedding
let query_vector = embedder.embed("programming languages").await?;
// Perform similarity search...

Using OpenAI Embeddings

[dependencies]
sage = { version = "0.1", features = ["embeddings-openai"] }
use sage::embedding::{OpenAIEmbedder, TextEmbedder};

// Initialize with API key
let embedder = OpenAIEmbedder::new(
    "your-api-key",
    "text-embedding-3-small"
)?;

// Generate embeddings
let vector = embedder.embed("your text here").await?;

Multimodal Search (Text + Images)

Sage supports cross-modal search using CLIP (Contrastive Language-Image Pre-Training) models, enabling semantic search across text and images. This allows you to:

  • Text-to-Image Search: Find images using natural language queries
  • Image-to-Image Search: Find visually similar images using an image query
  • Semantic Understanding: Search based on content meaning, not just keywords

Setup

Add the embeddings-multimodal feature to your Cargo.toml:

[dependencies]
sage = { version = "0.1", features = ["embeddings-multimodal"] }

Text-to-Image Search Example

use sage::embedding::{CandleMultimodalEmbedder, TextEmbedder, ImageEmbedder};
use sage::vector::index::{VectorIndexBuildConfig, VectorIndexBuilderFactory};

// Initialize CLIP embedder (automatically downloads model from HuggingFace)
let embedder = CandleMultimodalEmbedder::new("openai/clip-vit-base-patch32")?;

// Create vector index with CLIP's embedding dimension (512)
let config = VectorIndexBuildConfig {
    dimension: embedder.dimension(), // 512 for CLIP ViT-Base-Patch32
7440

    distance_metric: DistanceMetric::Cosine,
    index_type: VectorIndexType::HNSW,
    ..Default::default()
};
let mut builder = VectorIndexBuilderFactory::create_builder(config)?;

// Index your image collection
let mut image_vectors = Vec::new();
for (id, image_path) in image_paths.iter().enumerate() {
    let vector = embedder.embed_image(image_path).await?;
    image_vectors.push((id as u64, vector));
}
builder.add_vectors(image_vectors)?;
let index = builder.finalize()?;

// Search images using natural language
let query_vector = embedder.embed("a photo of a cat playing").await?;
let results = index.search(&query_vector, 10)?;

Image-to-Image Search Example

// Find visually similar images using an image as query
let query_image_vector = embedder.embed_image("query.jpg").await?;
let similar_images = index.search(&query_image_vector, 5)?;

How It Works

  1. CLIP Model: Uses OpenAI's CLIP model which maps both text and images into the same 512-dimensional vector space
  2. Automatic Download: Models are automatically downloaded from HuggingFace on first use
  3. GPU Acceleration: Automatically uses GPU if available (via Candle)
  4. Shared Embedding Space: Text and image embeddings can be directly compared using cosine similarity

Supported Models

Currently supports CLIP ViT-Base-Patch32 architecture:

  • Model: openai/clip-vit-base-patch32
  • Embedding Dimension: 512
  • Image Size: 224x224

Complete Examples

See working examples with detailed explanations:

Run the examples:

# Text-to-image search
cargo run --example text_to_image_search --features embeddings-multimodal

# Image-to-image search
cargo run --example image_to_image_search --features embeddings-multimodal -- query.jpg

Faceted Search

use sage::search::facet::*;

// Configure faceted search
let mut search_request = SearchRequest::new(query)
    .add_facet("category".to_string(), FacetRequest::terms(10))
    .add_facet("price".to_string(), FacetRequest::range(vec![
        (0.0, 50.0),
        (50.0, 100.0), 
        (100.0, f64::INFINITY)
    ]));

let results = engine.search_with_facets(&search_request)?;

// Access facet results
for facet in results.facets {
    println!("Facet: {}", facet.field);
    for bucket in facet.buckets {
        println!("  {}: {} documents", bucket.label, bucket.count);
    }
}

Spell Correction

use sage::spelling::*;

// Create spell corrector
let corrector = SpellCorrector::new()
    .max_edit_distance(2)
    .min_word_frequency(5);

// Check and suggest corrections
if let Some(suggestion) = corrector.suggest("progamming")? {
    println!("Did you mean: '{}'?", suggestion.suggestion);
}

Custom Analysis Pipeline

use sage::analysis::*;

// Create custom analyzer
let analyzer = Analyzer::new()
    .tokenizer(Box::new(RegexTokenizer::new(r"\w+")?))
    .add_filter(Box::new(LowercaseFilter::new()))
    .add_filter(Box::new(StopWordFilter::english()))
    .add_filter(Box::new(PorterStemmer::new()));

// Use in field definition
let field = TextField::new()
    .analyzer(analyzer)
    .stored(true)
    .indexed(true);

πŸ“Š Performance

Sage is designed for high performance:

  • SIMD Acceleration - Uses wide instruction sets for vector operations
  • Memory-Mapped I/O - Efficient file access with minimal memory overhead
  • Parallel Processing - Multi-threaded indexing and search operations
  • Incremental Updates - Real-time document addition without full reindexing
  • Index Optimization - Background merge operations for optimal search performance

Benchmarks

On a modern machine with SSD storage:

  • Indexing: ~50,000 documents/second
  • Search: ~100,000 queries/second
  • Memory Usage: ~50MB per 1M documents
  • Index Size: ~60% of original document size

πŸ› οΈ Development

Building from Source

git clone https://github.com/mosuka/sage.git
cd sage
cargo build --release

Running Tests

cargo test

Running Benchmarks

cargo bench

Checking Code Quality

cargo clippy
cargo fmt --check

πŸ“– Examples

Sage includes numerous examples demonstrating various features:

Lexical Search Examples

  • term_query - Basic term-based search
  • phrase_query - Multi-word phrase matching
  • boolean_query - Combining queries with AND/OR/NOT logic
  • fuzzy_query - Fuzzy string matching with edit distance
  • wildcard_query - Pattern matching with wildcards
  • range_query - Numeric and date range queries
  • geo_query - Geographic location-based search
  • field_specific_search - Search within specific fields
  • lexical_search - Full lexical search example
  • query_parser - Parsing user query strings

Vector Search Examples

  • vector_search - Semantic text search using CandleTextEmbedder
  • embedding_with_candle - Local BERT model embeddings
  • embedding_with_openai - OpenAI API embeddings
  • dynamic_embedder_switching - Switch between embedding providers

Advanced Features

  • parallel_search - Parallel search execution
  • schemaless_indexing - Dynamic schema management
  • synonym_graph_filter - Synonym expansion in queries
  • keyword_based_intent_classifier - Intent classification
  • ml_based_intent_classifier - ML-powered intent detection
  • document_parser - Parsing various document formats
  • document_converter - Converting between document formats

Run any example with:

cargo run --example <example_name>

# For embedding examples, use feature flags:
cargo run --example vector_search --features embeddings-candle
cargo run --example embedding_with_openai --features embeddings-openai

πŸ”§ Feature Flags

Sage uses feature flags to enable optional functionality:

[dependencies]
# Default features only
sage = "0.1"

# With Candle embeddings (local BERT models)
sage = { version = "0.1", features = ["embeddings-candle"] }

# With OpenAI embeddings
sage = { version = "0.1", features = ["embeddings-openai"] }

# With all embedding features
sage = { version = "0.1", features = ["embeddings-all"] }

Available features:

  • embeddings-candle - Local text embeddings using Candle and BERT models
  • embeddings-openai - OpenAI API-based text embeddings
  • embeddings-all - All embedding providers

πŸ“š Documentation

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under either of

at your option.

πŸ™ Acknowledgments

  • Inspired by the Lucene and Lucene alternatives.
  • Built with the excellent Rust ecosystem

πŸ“§ Contact


Sage - Fast, featureful full-text search for Rust πŸ¦€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0