Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/alibaba/zvec/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Zvec provides local embedding models that run without API calls using the Sentence Transformers library. These models run on your hardware (CPU/GPU) and work offline after initial download. Location: python/zvec/extension/sentence_transformer_embedding_function.py

Installation

pip install sentence-transformers

# For ModelScope (recommended for users in China)
pip install modelscope

DefaultLocalDenseEmbedding

Local dense embedding using all-MiniLM-L6-v2 model (or Chinese-optimized alternative).

Constructor

from zvec.extension import DefaultLocalDenseEmbedding

DefaultLocalDenseEmbedding(
    model_source: Literal["huggingface", "modelscope"] = "huggingface",
    device: Optional[str] = None,
    normalize_embeddings: bool = True,
    batch_size: int = 32,
    **kwargs
)

Parameters

model_source
Literal['huggingface', 'modelscope']
default:"huggingface"
Model source:
  • "huggingface": Use Hugging Face Hub (default, for international users)
  • "modelscope": Use ModelScope (recommended for users in China)
device
Optional[str]
default:"None"
Device to run the model on:
  • "cpu": CPU inference
  • "cuda": NVIDIA GPU
  • "mps": Apple Silicon GPU
  • None: Automatic detection
normalize_embeddings
bool
default:"True"
Whether to normalize embeddings to unit length (L2 normalization). Useful for cosine similarity.
batch_size
int
default:"32"
Batch size for encoding.

Properties

  • dimension (int): Always 384 for both models
  • model_name (str): “all-MiniLM-L6-v2” (HF) or “iic/nlp_gte_sentence-embedding_chinese-small” (MS)

Methods

embed()

def embed(self, input: str) -> DenseVectorType:
    """Generate dense embedding vector for the input text."""
Parameters:
  • input (str): Input text string to embed. Maximum length typically 128-512 tokens.
Returns:
  • DenseVectorType: List of floats representing the embedding vector (384 dimensions).
Raises:
  • TypeError: If input is not a string
  • ValueError: If input is empty
  • RuntimeError: If model inference fails

Usage Examples

Basic Usage (Hugging Face)

from zvec.extension import DefaultLocalDenseEmbedding

emb_func = DefaultLocalDenseEmbedding()
vector = emb_func.embed("Hello, world!")
print(len(vector))  # 384
print(isinstance(vector, list))  # True

ModelScope (For Users in China)

# Recommended for users in China
emb_func = DefaultLocalDenseEmbedding(model_source="modelscope")
vector = emb_func.embed("你好,世界!")  # Works well with Chinese
print(len(vector))  # 384

Alternative: Hugging Face Mirror

import os

# Use HF mirror for users in China
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

emb_func = DefaultLocalDenseEmbedding()  # Uses HF mirror
vector = emb_func.embed("Hello, world!")

GPU Acceleration

# Use CUDA GPU
emb_func = DefaultLocalDenseEmbedding(device="cuda")
vector = emb_func.embed("Machine learning is fascinating")

# Normalized vectors have unit length
import numpy as np
print(np.linalg.norm(vector))  # 1.0

Semantic Similarity

import numpy as np

emb_func = DefaultLocalDenseEmbedding()

v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")
v3 = emb_func.embed("Python programming")

similarity_high = np.dot(v1, v2)  # Similar sentences
similarity_low = np.dot(v1, v3)   # Different topics

print(similarity_high > similarity_low)  # True

DefaultLocalSparseEmbedding

Local sparse embedding using SPLADE (SParse Lexical AnD Expansion) model. Generates sparse, interpretable representations ideal for lexical matching and hybrid search.

Constructor

from zvec.extension import DefaultLocalSparseEmbedding

DefaultLocalSparseEmbedding(
    model_source: Literal["huggingface", "modelscope"] = "huggingface",
    device: Optional[str] = None,
    encoding_type: Literal["query", "document"] = "query",
    **kwargs
)

Parameters

model_source
Literal['huggingface', 'modelscope']
default:"huggingface"
Model source (ModelScope support may vary for SPLADE models).
device
Optional[str]
default:"None"
Device to run the model on ("cpu", "cuda", "mps", or None).
encoding_type
Literal['query', 'document']
default:"query"
Encoding type:
  • "query": Optimize for search queries (default)
  • "document": Optimize for indexed documents

Properties

  • model_name (str): “naver/splade-cocondenser-ensembledistil”
  • model_source (str): The model source being used

Methods

embed()

def embed(self, input: str) -> SparseVectorType:
    """Generate sparse embedding vector for the input text."""
Parameters:
  • input (str): Input text string to embed.
Returns:
  • SparseVectorType: Dictionary mapping dimension index to weight. Only non-zero dimensions included. Sorted by indices.
Raises:
  • TypeError: If input is not a string
  • ValueError: If input is empty
  • RuntimeError: If model inference fails

Cache Management

SPLADE models are cached at class level to save memory when using multiple instances:
# Clear all cached models
DefaultLocalSparseEmbedding.clear_cache()

# Get cache information
info = DefaultLocalSparseEmbedding.get_cache_info()
print(f"Cached models: {info['cached_models']}")

# Remove specific model from cache
removed = DefaultLocalSparseEmbedding.remove_from_cache(device="cuda")

Usage Examples

Basic Usage

from zvec.extension import DefaultLocalSparseEmbedding

# Query embedding
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_emb.embed("machine learning algorithms")

print(type(query_vec))  # <class 'dict'>
print(len(query_vec))   # ~150-200 non-zero dimensions

Memory-Efficient Dual Encoders

# Both instances share the same model (~200MB total, not 400MB)
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")

query_vec = query_emb.embed("what causes aging fast")
doc_vec = doc_emb.embed(
    "UV-A light causes tanning, skin aging, and cataracts..."
)

Asymmetric Retrieval

query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")

query_vec = query_emb.embed("machine learning")
doc_vec = doc_emb.embed("Machine learning is a subset of AI")

# Calculate similarity (dot product)
similarity = sum(
    query_vec.get(k, 0) * doc_vec.get(k, 0)
    for k in set(query_vec) | set(doc_vec)
)
print(f"Similarity: {similarity}")

Inspecting Sparse Dimensions

query_vec = query_emb.embed("machine learning")

# Sorted by indices
print(list(query_vec.items())[:5])
# [(10, 0.45), (23, 0.87), (56, 0.32), (89, 1.12), (120, 0.65)]

# Sort by weight to find top terms
top_terms = sorted(query_vec.items(), key=lambda x: x[1], reverse=True)[:5]
for idx, weight in top_terms:
    print(f"Dimension {idx}: {weight:.3f}")
# Dimension 1023: 1.450
# Dimension 245: 1.230
# Dimension 8901: 0.980

Hybrid Retrieval

Combine dense and sparse embeddings for optimal search:
from zvec.extension import (
    DefaultLocalDenseEmbedding,
    DefaultLocalSparseEmbedding
)

# Dense for semantic similarity
dense_emb = DefaultLocalDenseEmbedding()

# Sparse for lexical matching
sparse_emb = DefaultLocalSparseEmbedding()

query = "deep learning neural networks"

# Get both embeddings
dense_vec = dense_emb.embed(query)   # [0.1, -0.3, 0.5, ...]
sparse_vec = sparse_emb.embed(query)  # {12: 0.8, 45: 1.2, ...}

# Combine scores for hybrid retrieval
# final_score = α * dense_score + (1-α) * sparse_score

Model Information

Dense Model (all-MiniLM-L6-v2)

  • Dimensions: 384
  • Model Size: ~50-80MB
  • Speed: ~1000 sentences/sec (CPU), ~10000 (GPU)
  • Cache: ~/.cache/torch/sentence_transformers/
  • Best For: General-purpose semantic similarity

Dense Model (ModelScope Chinese)

  • Model: iic/nlp_gte_sentence-embedding_chinese-small
  • Dimensions: 384
  • Cache: ~/.cache/modelscope/hub/
  • Best For: Chinese text processing

Sparse Model (SPLADE)

  • Model: naver/splade-cocondenser-ensembledistil
  • Dimensions: ~30,000 (vocabulary size)
  • Non-zero values: ~100-200 per text
  • Model Size: ~100MB
  • Best For: Lexical matching, hybrid search

Best Practices

First Download: On first run, models are downloaded automatically. Ensure you have:
  • Stable internet connection
  • ~200MB free disk space
  • Write permissions to cache directory
For Users in China: Use ModelScope or HF mirror to avoid connection issues:
# Option 1: ModelScope
emb = DefaultLocalDenseEmbedding(model_source="modelscope")

# Option 2: HF Mirror
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
emb = DefaultLocalDenseEmbedding()
GPU Memory: Dense models require ~200MB GPU memory, sparse models ~300MB. Monitor GPU usage when using CUDA.

Comparison

FeatureDense (all-MiniLM)Sparse (SPLADE)
Output FormatList (384 floats)Dict (~150 non-zero)
Model Size~80MB~100MB
Inference SpeedFastMedium
Best ForSemantic similarityKeyword matching
InterpretabilityLowHigh
Memory (per vector)1.5KB1-2KB

Error Handling

try:
    vector = emb_func.embed("")  # Empty string
except ValueError as e:
    print(f"Error: {e}")
    # Error: Input text cannot be empty or whitespace only

try:
    vector = emb_func.embed(123)  # Wrong type
except TypeError as e:
    print(f"Error: {e}")
    # Error: Expected 'input' to be str, got int

Notes

  • Requires Python 3.10, 3.11, or 3.12
  • No API keys or authentication required
  • Works offline after initial download
  • First call slower due to model loading
  • GPU provides 5-10x speedup over CPU
  • Models stay in memory for subsequent calls

See Also