Configuration

ReadabilityOptions control extraction.

The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.

#![allow(unused)]
fn main() {
use lectito::{MediaRetention, ReadabilityOptions};

let options = ReadabilityOptions {
    char_threshold: 800,
    nb_top_candidates: 8,
    content_selector: Some("article".to_string()),
    site_profiles: Vec::new(),
    media_retention: MediaRetention::Article,
    ..ReadabilityOptions::default()
};
}

Fields:

FieldDefaultMeaning
max_elems_to_parseNoneReject documents above this element count.
nb_top_candidates5Number of high-scoring candidates to consider.
char_threshold500Minimum extracted text length for an accepted attempt.
content_selectorNoneCSS selector to force as the content root.
site_profiles[]TOML site profiles for host-scoped extraction hints.
mobile_viewport_widthSome(480)Width used by recovery rules for mobile snapshots.
classes_to_preserve[]Class names kept during cleanup.
keep_classesfalseKeep all class attributes.
disable_json_ldfalseSkip JSON-LD metadata extraction.
link_density_modifier0.0Adjust link-density cleanup tolerance.
media_retentionArticleControl figure/image/media retention.

Prefer content_selector when you already know the page shape. It bypasses root scoring for that document, then runs the normal cleanup pipeline.

When content_selector is not set, Lectito still tries a small list of common article-body containers such as #article-body and .entry-content before generic scoring. That catches many large publisher pages without site-specific profiles.

Use site_profiles when you want URL-scoped extraction hints, removal selectors, and metadata hints. Profiles are attempted before generic scoring, but weak profile output falls back to the generic extractor.

Use max_elems_to_parse as a guardrail for untrusted input. It rejects very large documents before extraction work continues.

Use media_retention when output fidelity matters. Article keeps body figures and images by default; None removes media; Conservative is text-first; All keeps media that remains in the selected article subtree.

ReadableOptions controls is_probably_readable.

Lower min_content_length for short posts or documentation pages. Raise min_score when you want the quick check to reject borderline pages.

#![allow(unused)]
fn main() {
use lectito::ReadableOptions;

let options = ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
};
}