Lectito
Lectito is a Rust library and CLI tool for extracting readable article content from HTML.
Most web pages contain way more than the text a reader came for, like ads, navigation, related links, comment areas, tracking markup, hidden elements, and presentation wrappers. Lectito tries to identify the main content root and return a smaller document that is useful for reading, storage, search, and conversion.
It returns:
- cleaned article HTML
- Markdown
- plain text
- page metadata
- extraction diagnostics
Lectito is parser-first. The core API accepts HTML and an optional base URL. URL fetching exists in the CLI for convenience, but the library does not require network access.
This keeps the library usable in environments that already have HTML available: crawlers, browser extensions, desktop apps, mobile apps, tests, and offline archives.
Main APIs
#![allow(unused)] fn main() { use lectito::{extract, ReadabilityOptions}; let html = r#"<article><h1>Title</h1><p>Article text.</p></article>"#; let article = extract(html, Some("https://example.com/post"), &ReadabilityOptions::default())?; if let Some(article) = article { println!("{}", article.markdown); } Ok::<(), lectito::Error>(()) }
Use extract_with_diagnostics when tuning extraction or debugging a bad page.
Use is_probably_readable before extraction when you only need a quick yes/no
answer.
Installation
Lectito is split into a core library and a CLI. Use the library when your application already has HTML. Use the CLI for local inspection, shell scripts, and quick conversions.
Library
Add lectito to your Rust project:
[dependencies]
lectito = "0.1"
For local development against this workspace:
[dependencies]
lectito = { path = "crates/core" }
The Rust crate name is lectito.
The core crate has no runtime service requirement. It parses the string you pass in and returns an article result.
CLI
Install the CLI from crates.io:
cargo install lectito-cli
For local development against this workspace:
cargo install --path crates/cli
The binary is named lectito.
lectito --help
The CLI can read from a file, stdin, or a URL. URL support is a command-line convenience; it is not part of the core library contract.
Fixture helpers are workspace-only and are not part of the published CLI package.
For local fixture inspection, run the unpublished workspace helper:
cargo run -p lectito-fixtures --bin lectito-fixture -- sample-name
License
Lectito is licensed under MPL-2.0.
Quick Start
Extract From HTML
Start with extract for normal use. It takes the source HTML, an optional base
URL, and ReadabilityOptions. The base URL lets Lectito resolve relative links,
images, and metadata URLs in the extracted output.
use lectito::{extract, ReadabilityOptions}; fn main() -> Result<(), lectito::Error> { let html = r#" <html> <head><title>Example</title></head> <body> <article> <h1>Example</h1> <p>This is the article body.</p> </article> </body> </html> "#; let article = extract(html, Some("https://example.com/article"), &ReadabilityOptions::default())?; if let Some(article) = article { println!("{:?}", article.title); println!("{}", article.markdown); } Ok(()) }
extract returns Ok(None) when no useful article content is found.
That is different from an error. An empty or navigation-only page can be parsed
successfully and still have no article.
Check Readability
Use is_probably_readable when you only need to decide whether a page is worth
running through full extraction. It is faster and returns a boolean.
#![allow(unused)] fn main() { use lectito::{is_probably_readable, ReadableOptions}; let readable = is_probably_readable(html, &ReadableOptions::default())?; Ok::<(), lectito::Error>(()) }
CLI
The CLI mirrors the library. The root command extracts content, and readable
performs the quick readability check.
lectito article.html
lectito https://example.com/article --format json --pretty
lectito readable article.html
CLI Usage
The CLI is designed for inspecting extraction behavior and converting documents from the terminal.
The root command extracts article content. The CLI also has these subcommands:
readable: check whether a document looks readableinspect: print extraction metadata and scoring detailsllms: fetch, parse, and expandllms.txtfiles
Extract
Pass a URL, an AT URI, a file path, or - for stdin. Markdown with TOML
frontmatter is the default output.
lectito article.html
lectito https://example.com/article
lectito at://did:plc:abc123/site.standard.document/xyz
lectito - < article.html
When a fetched page advertises rel="site.standard.document", the CLI resolves
the ATProto record and uses the record content when it can render it. Direct
at:// inputs are supported for renderable site.standard.document records.
If a normal web URL cannot be resolved through Standard.site, the CLI extracts
from the fetched HTML.
Output formats:
Use HTML, text, or JSON when Markdown is not the right output for the next tool.
lectito article.html --format html
lectito article.html --format text
lectito article.html --format json --pretty
lectito article.html --frontmatter=false
lectito article.html --output article.md
Useful options:
The defaults work for most article pages. Tune these flags when a page is too short, too broad, or has a known content container.
lectito article.html --char-threshold 800
lectito article.html --nb-top-candidates 8
lectito article.html --content-selector article
lectito article.html --base-url https://example.com/post --site-profile example.com.toml
lectito article.html --max-elems-to-parse 10000
lectito article.html --media article
lectito article.html --media none
lectito article.html --keep-classes --preserve-class language-rust
--content-selector is the strongest extraction hint. Use it when you know the
article root for a page or fixture. Without that flag, the CLI still tries
common article-body containers before falling back to generic scoring.
--media accepts none, conservative, article, or all. The default is
article, which keeps figures/images that appear to be part of the article body.
--site-profile can be repeated. Each file must be a TOML site profile. User
profiles take precedence over bundled profiles for the same host.
--disable-json-ld turns off JSON-LD metadata extraction and the JSON-LD
article-body fast path. Use it when structured data is stale or misleading.
Diagnostics are written to stderr after the main output to keep keep stdout usable for the extracted article while still showing debug information in the terminal.
lectito article.html --diagnostic-format pretty
lectito article.html --diagnostic-format json
--inspect prints a compact extraction summary to stderr while keeping article
output on stdout:
lectito article.html --inspect
Full extraction has a timeout so unusually large or hostile pages do not hang the command:
lectito article.html --timeout 10
Readable
readable checks whether the document appears to contain enough article-like
text. It does not return extracted content.
lectito readable article.html
lectito readable --stdin < article.html
lectito readable https://example.com/article
lectito readable article.html --json --pretty
lectito readable article.html --timeout 10
Thresholds:
lectito readable article.html --min-content-length 140 --min-score 20
Inspect
inspect prints extraction metadata and scoring details without printing the
article body.
lectito inspect article.html
lectito inspect https://example.com/article
lectito inspect article.html --json --pretty
llms.txt
Use the llms subcommands when a site publishes an llms.txt file or when
you want to bundle its linked resources into one Markdown context file.
lectito llms fetch https://example.com
lectito llms parse https://example.com/llms.txt --pretty
lectito llms expand https://example.com/llms.txt --output llms-full.txt
lectito llms generate https://example.com/docs/ --output llms.txt
lectito llms generate https://example.com/docs/ --output llms.txt --full llms-full.txt
lectito llms generate --sitemap https://example.com/sitemap.xml --output llms.txt
lectito llms generate https://example.com --discover --output llms.txt
fetch resolves a bare site URL to /llms.txt. parse prints structured JSON.
expand reads the linked resources, keeps Markdown resources as-is, and runs
HTML resources through Lectito before adding them to the bundle. generate
crawls same-origin links from a seed page and writes a new llms.txt index. It
uses canonical links for generated entries when pages publish them, includes
HTTP Last-Modified or sitemap lastmod values in notes, and ranks accepted
pages so likely entry points appear first. Pass --full (or --full-output) to
write the expanded Markdown context while generating the index.
Links in the special Optional section are skipped unless you pass
--include-optional:
lectito llms expand https://example.com/llms.txt --include-optional
Keep generated files small by limiting crawl depth and page count:
lectito llms generate https://example.com/docs/ --max-depth 1 --max-pages 10
lectito llms generate --sitemap https://example.com/sitemap.xml --max-pages 50
Filter generated entries and add a delay between page fetches:
lectito llms generate --sitemap https://example.com/sitemap.xml \
--filter /docs/ \
--filter '!/docs/archive/' \
--filter '!*/drafts/*' \
--delay 250
Remote generation checks robots.txt before fetching page URLs. It evaluates
rules as Lectito by default:
lectito llms generate https://example.com/docs/ --robots-agent Lectito
lectito llms generate https://example.com/docs/ --ignore-robots
See the llms.txt guide for the expected file shape and the tradeoffs.
Exit Codes
0: article extracted, or readability check returned true1: no article was extracted, or readability check returned false2: input, file, or network error3: extraction, readability, configuration, or timeout error
llms.txt
llms.txt is a Markdown file that gives language models and agent tools a
curated entry point for a site. Sites usually publish it at /llms.txt.
Lectito supports the practical parts of the convention:
- fetching a site's
llms.txt - parsing its sections and links
- expanding linked pages into one Markdown context file
- crawling a bounded set of pages to generate an
llms.txtindex
It does not treat llms.txt as access control. Use robots.txt, HTTP
authorization, and normal server controls for that.
File Shape
A small file looks like this:
# Example Docs
> Documentation for Example's public API.
Use the current API reference when generated examples disagree with older blog
posts.
## Docs
- [Quick start](https://example.com/docs/quick-start.md): First integration
steps.
- [API reference](https://example.com/docs/api.md): Endpoint and object
reference.
## Optional
- [Changelog](https://example.com/docs/changelog.md)
Lectito expects:
- one H1 title
- an optional blockquote summary
- optional notes before the first H2
- H2 sections containing Markdown links
The Optional section has special handling. lectito llms expand skips those
links by default so the generated context stays smaller.
Fetch
Fetch a site's llms.txt:
lectito llms fetch https://example.com
For bare site URLs, Lectito requests /llms.txt. Explicit URLs are used as
given:
lectito llms fetch https://example.com/docs/llms.txt
You can write the result to a file:
lectito llms fetch https://example.com --output llms.txt
Parse
Parse an llms.txt file into JSON:
lectito llms parse llms.txt --pretty
This is useful for checking whether section names, optional links, and notes are being read as expected.
Expand
Expand linked resources into one Markdown file:
lectito llms expand llms.txt --output llms-full.txt
Lectito keeps Markdown resources unchanged. When a linked resource looks like
HTML, Lectito extracts the readable article and inserts the extracted Markdown.
For remote links, Lectito checks the HTTP Content-Type header before falling
back to URL suffixes and simple Markdown markers.
Each resource is separated and labeled:
---
# Source: Quick start
URL: https://example.com/docs/quick-start.md
Notes: First integration steps.
...
Use --include-optional to include the Optional section:
lectito llms expand llms.txt --include-optional --output llms-full.txt
Use --max-links when you want a smaller bundle:
lectito llms expand llms.txt --max-links 10
Generate
Generate an llms.txt file from a seed page:
lectito llms generate https://example.com/docs/ --output llms.txt
The crawler is intentionally bounded. For URL seeds, Lectito follows same-origin links only. For local HTML files, it follows relative local links. Assets such as images, stylesheets, scripts, PDFs, archives, and feeds are skipped.
To write the expanded context at the same time, pass --full:
lectito llms generate https://example.com/docs/ \
--output llms.txt \
--full llms-full.txt
--full-output is the same option with a more explicit name.
You can also generate from a sitemap:
lectito llms generate --sitemap https://example.com/sitemap.xml \
--output llms.txt
Or discover sitemaps from a URL seed:
lectito llms generate https://example.com --discover \
--output llms.txt
Discovery reads Sitemap: lines from robots.txt. When no sitemap is listed
there, Lectito tries /sitemap.xml.
Sitemap indexes are supported. Lectito reads child sitemaps up to
--max-sitemaps, then fetches page URLs up to --max-pages:
lectito llms generate --sitemap https://example.com/sitemap.xml \
--max-sitemaps 10 \
--max-pages 100 \
--output llms.txt
Remote sitemap generation keeps sitemap and page URLs on the same origin as the sitemap input. Local sitemap files may list any absolute page URL.
By default, generation fetches up to 25 pages and follows links up to depth 2:
lectito llms generate https://example.com/docs/ \
--max-pages 10 \
--max-depth 1
Use --filter for the common path and glob cases. Prefix a pattern with ! to
exclude it:
lectito llms generate --sitemap https://example.com/sitemap.xml \
--filter /docs/ \
--filter '!/docs/archive/' \
--filter '!*/drafts/*'
Patterns that start with / match URL paths. Plain path values are prefixes.
Path patterns with * or ? are globs. Other glob patterns match the full URL.
Use --delay to wait between page fetches:
lectito llms generate https://example.com/docs/ --delay 250
Remote generation checks robots.txt before fetching page URLs. Lectito keeps
the existing browser-like user agent for HTTP requests, but evaluates robots
rules as Lectito unless you pass another token:
lectito llms generate https://example.com/docs/ \
--robots-agent LectitoDocsBot
Use --ignore-robots only when you explicitly want to bypass those checks:
lectito llms generate https://example.com/docs/ --ignore-robots
Only pages that produce readable article content are included. Each accepted page becomes one link in the generated file. Lectito uses the extracted title as the link label, switches to a page's canonical URL when one is available, and uses the extracted excerpt as the link note.
Remote generation also reads Last-Modified response headers. Sitemap
generation reads lastmod values. When either value is present, Lectito adds it
to the generated note and uses it as a small ranking signal. Ranking favors
likely entry points such as docs roots, guides, API references, and pages with
useful notes. Archive-like URLs are pushed down.
Set the generated title, summary, or section name when the defaults are too generic:
lectito llms generate https://example.com/docs/ \
--title "Example Docs" \
--summary "Public documentation for Example." \
--section "Guides" \
--output llms.txt
When To Use It
Use llms.txt when you want agents to start from a small, curated list of
important pages. It works well for docs, public APIs, policy pages, and small
knowledge bases.
Do not expect every model provider or search engine to read it. The reliable use case is explicit: a developer, tool, or agent asks Lectito to fetch or expand the file.
Basic Usage
Use extract when you want article content.
The function does not fetch the page. Pass it the HTML you want parsed. This is usually cleaner in applications because networking, caching, cookies, and browser rendering are application concerns.
#![allow(unused)] fn main() { use lectito::{extract, ReadabilityOptions}; let options = ReadabilityOptions::default(); let article = extract(html, Some("https://example.com/post"), &options)?; match article { Some(article) => println!("{}", article.text_content), None => eprintln!("no article content found"), } Ok::<(), lectito::Error>(()) }
The base URL is optional. Pass it when the document contains relative links, images, or metadata URLs.
Raw HTML Limits
Lectito parses the HTML string you pass in. It does not run JavaScript, keep a
browser session, submit forms, attach cookies, or fetch authenticated resources.
For pages that build their article body on the client, capture rendered HTML in
your crawler or browser automation layer before calling extract.
The CLI fetches URLs as a convenience, but it has the same raw-HTML boundary. If a site needs login state, consent flows, or browser-specific state, fetch that page in your own application (or a browser) and pass the resulting HTML through stdin or the Rust API.
When extraction succeeds, Lectito returns Some(Article). When the page parses
but does not contain a useful article, it returns None. Reserve error handling
for invalid base URLs, configured size limits, and serialization failures.
Article Output
Article contains the extracted content in several forms:
#![allow(unused)] fn main() { if let Some(article) = article { println!("{}", article.content); println!("{}", article.markdown); println!("{}", article.text_content); } }
Use extract_with_diagnostics when you need to see how extraction chose a root.
Diagnostics are meant for development and regression work. Most application code
should call extract.
#![allow(unused)] fn main() { use lectito::{extract_with_diagnostics, ReadabilityOptions}; let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?; if let Some(article) = report.article { println!("{}", article.markdown); } eprintln!("{:?}", report.diagnostics.outcome); Ok::<(), lectito::Error>(()) }
Configuration
ReadabilityOptions control extraction.
The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.
#![allow(unused)] fn main() { use lectito::{MediaRetention, ReadabilityOptions}; let options = ReadabilityOptions { char_threshold: 800, nb_top_candidates: 8, content_selector: Some("article".to_string()), site_profiles: Vec::new(), media_retention: MediaRetention::Article, ..ReadabilityOptions::default() }; }
Fields:
| Field | Default | Meaning |
|---|---|---|
max_elems_to_parse | None | Reject documents above this element count. |
nb_top_candidates | 5 | Number of high-scoring candidates to consider. |
char_threshold | 500 | Minimum extracted text length for an accepted attempt. |
content_selector | None | CSS selector to force as the content root. |
site_profiles | [] | TOML site profiles for host-scoped extraction hints. |
mobile_viewport_width | Some(480) | Width used by recovery rules for mobile snapshots. |
classes_to_preserve | [] | Class names kept during cleanup. |
keep_classes | false | Keep all class attributes. |
disable_json_ld | false | Skip JSON-LD metadata extraction. |
link_density_modifier | 0.0 | Adjust link-density cleanup tolerance. |
media_retention | Article | Control figure/image/media retention. |
Prefer content_selector when you already know the page shape. It bypasses
root scoring for that document, then runs the normal cleanup pipeline.
When content_selector is not set, Lectito still tries a small list of common
article-body containers such as #article-body and .entry-content before
generic scoring. That catches many large publisher pages without site-specific
profiles.
Use site_profiles when you want URL-scoped extraction hints, removal
selectors, and metadata hints. Profiles are attempted before generic scoring,
but weak profile output falls back to the generic extractor.
Use max_elems_to_parse as a guardrail for untrusted input. It rejects very
large documents before extraction work continues.
Use media_retention when output fidelity matters. Article keeps body figures
and images by default; None removes media; Conservative is text-first; All
keeps media that remains in the selected article subtree.
ReadableOptions controls is_probably_readable.
Lower min_content_length for short posts or documentation pages. Raise
min_score when you want the quick check to reject borderline pages.
#![allow(unused)] fn main() { use lectito::ReadableOptions; let options = ReadableOptions { min_content_length: 140, min_score: 20.0, }; }
Output Formats
Lectito produces all output formats during extraction.
The formats come from the same cleaned article root. That means callers can store HTML for fidelity, use Markdown for display or editing, and use plain text for search without running extraction multiple times.
#![allow(unused)] fn main() { let article = extract(html, base_url, &ReadabilityOptions::default())?.unwrap(); let html = article.content; let markdown = article.markdown; let text = article.text_content; }
HTML
content is cleaned article HTML. Scripts, styles, navigation, sidebars, and
other page chrome are removed where possible. Relative URLs are resolved when a
base URL is provided.
Use HTML when you need the closest representation of the extracted article. It keeps images, links, tables, inline markup, and other structure that can be lost in plain text.
Markdown
markdown is generated from the cleaned article HTML. It preserves common
reader content:
- headings
- paragraphs
- links and images
- lists
- blockquotes
- code blocks
- tables
- math
- footnotes
Markdown cleanup also strips zero-width break hints, drops empty links, keeps images intact, and removes duplicate title headings before rendering.
The CLI Markdown output includes TOML frontmatter:
lectito article.html
Markdown is useful when the next step is a reader view, note-taking system, static archive, or editor. It is also easier to diff in tests than HTML.
Plain Text
text_content is normalized article text. Use it for indexing, previews, and
readability checks.
Plain text should not be treated as a rendering format. It discards links, images, and most document structure.
JSON
The CLI can serialize the article:
lectito article.html --format json --pretty
JSON is the best CLI format when another program needs metadata and content together.
Quality Expectations
| Output | Best use | Expect | Do not expect |
|---|---|---|---|
| Markdown | Reader views, notes, archives, editing | Good preservation of headings, paragraphs, links, images, lists, blockquotes, code, tables, math, and footnotes. | Byte-for-byte source fidelity or every custom widget. |
| HTML | Rendering or post-processing extracted articles | The closest structural view of the cleaned article root, with links and media kept according to options. | A complete sanitizer policy or the original page layout. |
| Text | Search, previews, indexing, basic summaries | Normalized article text with block boundaries for headings, paragraphs, lists, code, and definition lists. | A rich rendering format with links, images, or full table structure. |
| JSON | Programmatic CLI integrations | Metadata plus HTML, Markdown, text, length, and source-related fields in one object. | Stable values for publisher metadata when source pages disagree or omit fields. |
inspect | Debugging extraction choices | Selected root, candidate scores, cleanup counts, recovery data, and site-rule information. | A user-facing article format. |
readable | Cheap filtering before full extraction | A boolean estimate using text length, visibility, class/id hints, and link density. | The same answer full extraction would produce on every borderline page. |
How It Works
Lectito follows the same broad approach as Mozilla Readability, with a few fast paths for common article snapshots.
The extractor starts with a full HTML document and tries to find the subtree that behaves like an article. It uses signals that tend to survive across sites: text length, paragraph density, semantic tags, class and id names, and the ratio of links to readable text.
- Recover useful content from raw HTML snapshots, including declarative shadow DOM.
- Parse the document.
- Recover useful content from parsed snapshots, including selected mobile and shadow-root cases.
- Extract metadata, including JSON-LD before scripts are stripped.
- Accept long JSON-LD article text when structured data contains the body.
- Try known article containers such as
#article-bodybefore broad scoring. - Try a matching site profile or code extractor when one applies.
- Remove scripts, styles, hidden nodes, and unlikely content.
- Score candidate content roots by text length, tag type, class/id hints, and link density.
- Select the best root and include useful siblings.
- Clean the selected content.
- Apply schema text fallback when structured data is clearly better.
- Return HTML, Markdown, text, and diagnostics.
Extraction runs several attempts. Later attempts relax cleanup rules when the
first pass produces too little text. The first attempt that reaches
char_threshold is accepted. If no attempt reaches the threshold, Lectito may
return the best non-empty attempt.
This retry model matters because pages fail in different ways. Some pages hide the useful content behind classes that look like chrome. Others include enough related links or widgets to pull the score away from the main text. Relaxed attempts give Lectito another chance without making the first pass too loose.
content_selector can short-circuit root selection for known documents:
#![allow(unused)] fn main() { let options = ReadabilityOptions { content_selector: Some("main article".to_string()), ..ReadabilityOptions::default() }; }
Lectito also has a small built-in list of known content containers, including
#article-body, [itemprop='articleBody'], .article-body, and
.entry-content. These are attempted before generic scoring. They still go
through cleanup, media handling, URL rewriting, and diagnostics.
Site profiles provide URL-scoped hints without disabling generic extraction:
#![allow(unused)] fn main() { let options = ReadabilityOptions { site_profiles: vec![r#" name = "example" hosts = ["example.com"] content_roots = ["article"] remove = [".ad", "nav"] "#.to_string()], ..ReadabilityOptions::default() }; }
If a profile produces content below char_threshold, Lectito records the
profile decision in diagnostics and continues with generic readability attempts.
After the root is selected, cleanup removes empty nodes, normalizes links and media, preserves selected classes, and prepares the HTML for Markdown and text conversion.
Diagnostics
Use diagnostics to inspect extraction decisions.
Diagnostics are for development, fixture work, and bug reports. They explain which candidates were considered, which root was selected, and why an extraction was accepted or downgraded to a best attempt.
#![allow(unused)] fn main() { use lectito::{extract_with_diagnostics, ReadabilityOptions}; let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?; println!("{:?}", report.diagnostics.outcome); }
ExtractionReport contains:
article: the extracted article, if founddiagnostics: details about attempts and candidate selection
Outcomes:
| Outcome | Meaning |
|---|---|
Accepted | An attempt met char_threshold. |
BestAttempt | No attempt met the threshold, but non-empty content was found. |
NoContent | No useful content was found. |
Each attempt records:
- cleanup flags
- candidate count
- top candidates
- entry points
- selected root
- cleanup counts
- recovery counts
- extracted text length
Fast paths such as JSON-LD article text or a known content container may record
an accepted attempt with candidate_count = 0. That means Lectito accepted a
specific root before generic candidate scoring ran.
When a site profile or code extractor matches, diagnostics include site_rule.
That record reports the matched profile or extractor, whether it was bundled,
which roots were selected, how many removals ran, whether the result met
char_threshold, and any fallback reason.
Start with outcome, selected_root, and text_len. If the selected root is
wrong, inspect the candidate list. If the root is right but output is noisy,
inspect cleanup counts and preserved classes.
CLI diagnostics:
lectito article.html --diagnostic-format pretty
lectito article.html --diagnostic-format json
lectito inspect article.html
API Overview
Lectito has two public API targets:
- Rust Crate API for native Rust applications, CLIs, and server integrations.
- WASM API for browser, web worker, bundler, and Node.js integrations.
Both targets use the same core extractor and Markdown conversion logic. The Rust crate is the source of truth; the WASM crate maps that API into JavaScript types and camelCase option names.
Rust Crate API
Public exports from lectito:
The crate exposes the extraction API, output structs, diagnostics, errors, and Markdown helpers.
#![allow(unused)] fn main() { pub use config::{Article, MarkdownOptions, MediaRetention, ReadabilityOptions, ReadableOptions}; pub use diagnostics::{ AttemptDiagnostic, CandidateDiagnostic, CandidateSelection, CleanupDiagnostic, ContentSelectorDiagnostic, ExtractionDiagnostics, ExtractionOutcome, ExtractionReport, FlagDiagnostic, NodeDiagnostic, RecoveryDiagnostic, }; pub use error::Error; pub use extract::{clean_article_html, extract, extract_with_diagnostics}; pub use markdown::{html_to_markdown, markdown_to_html, markdown_with_toml_frontmatter}; pub use readable::is_probably_readable; }
Extraction
Use extract for normal application code.
#![allow(unused)] fn main() { pub fn extract( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<Option<Article>, Error> }
Returns Ok(Some(article)) when content is found, Ok(None) when the document
has no useful article content, and Err for invalid input or processing
failures.
Extraction tries JSON-LD article text and common article-body containers before generic readability scoring.
Set content_selector when you already know the article root.
Set disable_json_ld when structured data is wrong for the page.
Use extract_with_diagnostics when you need extraction details in addition to
the article.
#![allow(unused)] fn main() { pub fn extract_with_diagnostics( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<ExtractionReport, Error> }
Returns the same article result with extraction diagnostics.
Use clean_article_html when you only need the cleaned article HTML.
#![allow(unused)] fn main() { pub fn clean_article_html( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<Option<String>, Error> }
Readability Check
Use is_probably_readable before full extraction when you are filtering many
documents.
#![allow(unused)] fn main() { pub fn is_probably_readable( html: &str, options: &ReadableOptions, ) -> Result<bool, Error> }
Returns a quick readability estimate without full extraction.
Markdown
The Markdown helpers are available separately for callers that already have a clean HTML fragment, want to render Markdown as HTML, or want CLI-style frontmatter.
#![allow(unused)] fn main() { pub fn html_to_markdown(html: &str) -> String }
Converts HTML fragments to Markdown.
#![allow(unused)] fn main() { pub fn markdown_to_html(markdown: &str, options: &MarkdownOptions) -> String }
Converts Markdown to HTML using CommonMark/GFM options.
#![allow(unused)] fn main() { pub fn markdown_with_toml_frontmatter( article: &Article, source: Option<&str>, ) -> Result<String, Error> }
Formats an article as Markdown with TOML frontmatter.
WASM API
The npm package @stormlightlabs/lectito exposes Lectito to JavaScript through
wasm-bindgen.
It supports browser, web worker, bundler, and Node.js use.
npm install @stormlightlabs/lectito
The Rust crate is still named lectito-wasm.
Build Targets
wasm-pack build crates/wasm --target bundler
wasm-pack build crates/wasm --target web
wasm-pack build crates/wasm --target nodejs
wasm-pack writes lectito_wasm.d.ts with the public TypeScript API.
Initialization
Bundler builds initialize when imported:
import { extract } from "@stormlightlabs/lectito";
const article = extract(html, "https://example.com/post");
The web target needs the async initializer:
import init, { extract } from "./lectito_wasm.js";
await init();
const article = extract(html, "https://example.com/post");
The nodejs target initializes when loaded:
const { extract } = require("./lectito_wasm.js");
const article = extract(html, "https://example.com/post");
Functions
export function extract(
html: string,
baseUrl?: string | null,
options?: ReadabilityOptions | null,
): Article | null;
export function extractWithDiagnostics(
html: string,
baseUrl?: string | null,
options?: ReadabilityOptions | null,
): ExtractionReport;
export function isProbablyReadable(html: string, options?: ReadableOptions | null): boolean;
export function cleanHtml(
html: string,
baseUrl?: string | null,
options?: CleanHtmlOptions | null,
): string | null;
export function htmlToMarkdown(html: string): string;
export function markdownToHtml(markdown: string, options?: MarkdownOptions | null): string;
Types
Option fields use camelCase. Returned article fields keep the core Rust snake_case names.
export type MediaRetention = "none" | "conservative" | "article" | "all";
export interface ReadabilityOptions {
maxElemsToParse?: number | null;
nbTopCandidates?: number;
charThreshold?: number;
contentSelector?: string | null;
siteProfiles?: string[];
mobileViewportWidth?: number | null;
classesToPreserve?: string[];
keepClasses?: boolean;
disableJsonLd?: boolean;
linkDensityModifier?: number;
mediaRetention?: MediaRetention;
}
export interface ReadableOptions {
minContentLength?: number;
minScore?: number;
}
export interface MarkdownOptions {
gfm?: boolean;
footnotes?: boolean;
math?: boolean;
allowRawHtml?: boolean;
}
export type CleanHtmlOptions = ReadabilityOptions;
export interface Article {
title?: string | null;
byline?: string | null;
dir?: string | null;
lang?: string | null;
content: string;
markdown: string;
text_content: string;
length: number;
excerpt?: string | null;
site_name?: string | null;
published_time?: string | null;
image?: string | null;
domain?: string | null;
favicon?: string | null;
}
export interface ExtractionReport {
article: Article | null;
diagnostics: unknown;
}
mediaRetention accepts "none", "conservative", "article", or "all".
Errors
Functions throw JavaScript Error objects for invalid base URLs, oversized
documents, option conversion failures, and serialization failures.
Sanitization
cleanHtml performs Lectito article cleanup. It is not a complete
untrusted-HTML security policy.
Browser integrations that accept arbitrary HTML should run a dedicated sanitizer such as DOMPurify before passing content into Lectito. Sanitize again before rendering returned HTML when the original input is untrusted.
Release Checks
Run the WASM tests and build all supported package targets:
pnpm --dir web exec wasm-pack test --node ../crates/wasm
pnpm --dir web exec wasm-pack build ../crates/wasm --target bundler --out-dir ../../target/wasm-pack/bundler
pnpm --dir web exec wasm-pack build ../crates/wasm --target web --out-dir ../../target/wasm-pack/web
pnpm --dir web exec wasm-pack build ../crates/wasm --target nodejs --out-dir ../../target/wasm-pack/nodejs
The build commands run wasm-opt; restricted sandboxes may need permission to
execute it.
Article
Article is the extraction result.
The struct is serializable and contains both content and metadata. The content fields are generated from the selected article root; metadata can come from document metadata, JSON-LD, Open Graph tags, or the extracted content itself.
#![allow(unused)] fn main() { pub struct Article { pub title: Option<String>, pub byline: Option<String>, pub dir: Option<String>, pub lang: Option<String>, pub content: String, pub markdown: String, pub text_content: String, pub length: usize, pub excerpt: Option<String>, pub site_name: Option<String>, pub published_time: Option<String>, pub image: Option<String>, pub domain: Option<String>, pub favicon: Option<String>, } }
Fields:
| Field | Meaning |
|---|---|
title | Best title from metadata or document content. |
byline | Author/byline when detected. |
dir | Text direction, such as ltr or rtl. |
lang | Document language when detected. |
content | Cleaned article HTML. |
markdown | Markdown generated from content. |
text_content | Plain text generated from content. |
length | UTF-16 length of extracted text, matching Mozilla Readability. |
excerpt | Short summary or first useful paragraph. |
site_name | Publisher or site name. |
published_time | Publication timestamp when detected. |
image | Lead image URL when detected. |
domain | Source domain when available. |
favicon | Favicon URL when detected. |
content, markdown, and text_content are different views of the same
extracted article. Prefer content when structure matters, markdown when the
article will be displayed or edited as text, and text_content when indexing or
summarizing.
length follows Mozilla Readability's UTF-16 convention. It can differ from a
Rust chars().count() value for text outside the Basic Multilingual Plane.
Options
ReadabilityOptions
ReadabilityOptions changes extraction behavior. Most callers should start
with ReadabilityOptions::default() and only set fields that solve a specific
problem.
#![allow(unused)] fn main() { pub struct ReadabilityOptions { pub max_elems_to_parse: Option<usize>, pub nb_top_candidates: usize, pub char_threshold: usize, pub content_selector: Option<String>, pub site_profiles: Vec<String>, pub mobile_viewport_width: Option<usize>, pub classes_to_preserve: Vec<String>, pub keep_classes: bool, pub disable_json_ld: bool, pub link_density_modifier: f32, pub media_retention: MediaRetention, } pub enum MediaRetention { None, Conservative, Article, All, } }
Defaults:
#![allow(unused)] fn main() { ReadabilityOptions { max_elems_to_parse: None, nb_top_candidates: 5, char_threshold: 500, content_selector: None, site_profiles: Vec::new(), mobile_viewport_width: Some(480), classes_to_preserve: Vec::new(), keep_classes: false, disable_json_ld: false, link_density_modifier: 0.0, media_retention: MediaRetention::Article, } }
content_selector is the most direct override. Use it when the caller knows
where the article lives in the document. When it is unset, Lectito still tries a
small built-in list of common article-body containers before generic scoring.
site_profiles accepts TOML profile strings that provide host-scoped content
roots, removal selectors, metadata hints, cleanup settings, and fallback
behavior. Profiles run before generic scoring, after the JSON-LD and known
container fast paths.
char_threshold controls when an attempt is accepted. nb_top_candidates
controls how many candidates remain in play during generic scoring.
disable_json_ld skips JSON-LD metadata extraction and the JSON-LD article-body
fast path. It does not disable Open Graph, Twitter card, or DOM metadata.
media_retention controls image and media preservation in the extracted article:
None: remove figures, images, and embedded media from content.Conservative: text-first cleanup; media survives only if the generic extractor keeps it.Article: keep figures/images that look like article body content. This is the default.All: keep media that remains in the selected article subtree, subject to unsafe/embed cleanup.
ReadableOptions
ReadableOptions only affects is_probably_readable. It does not change full
article extraction.
#![allow(unused)] fn main() { pub struct ReadableOptions { pub min_content_length: usize, pub min_score: f32, } }
Use lower thresholds for short-form content. Use higher thresholds when false positives are more expensive than missed articles.
Defaults:
#![allow(unused)] fn main() { ReadableOptions { min_content_length: 140, min_score: 20.0, } }
Site Profiles
Site profiles are TOML extraction hints scoped by URL host. They are useful when a site has a stable content container or predictable clutter, but still returns ordinary article-shaped HTML.
Profiles run before generic readability scoring. If a profile produces text
below char_threshold, Lectito records the profile decision in diagnostics and
continues with generic extraction.
Example
name = "example"
hosts = ["example.com"]
subdomains = true
path_prefixes = ["/blog"]
exclude_path_prefixes = ["/blog/comments"]
content_roots = ["article", "#content"]
remove = [".ad", "nav", "footer"]
remove_id_or_class = ["sidebar"]
[metadata]
title = ["h1"]
author = [".byline"]
date = ["time/@datetime"]
image = ["meta[property='og:image']/@content"]
site_name = "Example"
title_suffixes = [" - Example"]
[cleanup]
enabled = true
prune = true
[fallback]
generic_on_empty = true
Fields
| Field | Meaning |
|---|---|
name | Human-readable profile name used in diagnostics. |
hosts | Hosts matched by the profile. www. is ignored during matching. |
subdomains | When true, subdomains of each host also match. |
path_prefixes | Optional path prefixes. Omit to match every path on the host. |
exclude_path_prefixes | Optional path prefixes that suppress the profile after host matching. |
content_roots | CSS selectors or supported XPath selectors for article roots. |
remove | CSS selectors or supported XPath selectors to remove before extraction. |
remove_id_or_class | Exact id or class tokens to remove. |
Metadata fields are optional selector lists, except site_name, which is a
constant. Selectors may target attributes with the supported XPath .../@attr
form.
Cleanup defaults to enabled. prune controls conditional cleanup. Disabling
cleanup should be reserved for sites where the profile root is already clean and
generic cleanup removes useful structure.
Selector Support
Profiles accept CSS selectors directly. They also accept a focused XPath subset for compatibility with rule corpuses and older bundled rules:
//tag//*[@id='value']//tag[@class='a b']//tag[contains(@class, 'value')]/text()suffixes/@attributesuffixes for metadata selectors
Unsupported XPath expressions are ignored by selector matching, so bundled profiles should have tests that prove their roots match representative pages.
User Profiles
Rust callers pass profile TOML strings through ReadabilityOptions:
#![allow(unused)] fn main() { let options = ReadabilityOptions { site_profiles: vec![std::fs::read_to_string("example.com.toml")?], ..ReadabilityOptions::default() }; }
The CLI accepts repeatable profile paths:
lectito article.html --base-url https://example.com/post --site-profile example.com.toml
User profiles take precedence over bundled profiles. More specific host and path matches win within each source group.