Back
gh

kepano/defuddle: Get the main content of any page as Markdown.

Get the main content of any page as Markdown. Contribute to kepano/defuddle development by creating an account on GitHub.

by kepano github.com 1,507 words
View original

de·fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.

Beware! Defuddle is very much a work in progress!

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Overview

Defuddle takes a URL or HTML, finds the main content, and returns cleaned HTML or Markdown. Defuddle was created for the browser extension Obsidian Web Clipper, but it is designed to run in any environment.

Defuddle can be used as a replacement for Mozilla Readability with a few differences:

Usage

Browser

import Defuddle from 'defuddle';

// Parse the current document
const defuddle = new Defuddle(document);
const result = defuddle.parse();

// Access the content and metadata
console.log(result.content);
console.log(result.title);
console.log(result.author);

Node.js

defuddle/node accepts a DOM Document from any implementation (JSDOM, linkedom, happy-dom, etc.).

import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';

const { document } = parseHTML(html);
const result = await Defuddle(document, 'https://example.com/article', {
  markdown: true
});

console.log(result.content);
console.log(result.title);
console.log(result.author);

Or with JSDOM:

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';

const dom = new JSDOM(html, { url: 'https://example.com/article' });
const result = await Defuddle(dom.window.document, 'https://example.com/article');

Note: for defuddle/node to import properly, the module format in your package.json has to be set to { "type": "module" }

CLI

Defuddle includes a command-line interface for parsing web pages directly from the terminal. You can run it with npx or install it globally.

# Parse a local HTML file
npx defuddle parse page.html

# Parse a URL
npx defuddle parse https://example.com/article

# Output as markdown
npx defuddle parse page.html --markdown

# Output as JSON with metadata
npx defuddle parse page.html --json

# Extract a specific property
npx defuddle parse page.html --property title

# Save output to a file
npx defuddle parse page.html --output result.html

# Enable debug mode
npx defuddle parse page.html --debug

CLI Options

OptionAliasDescription
--output <file>-oWrite output to a file instead of stdout
--markdown-mConvert content to markdown format
--mdAlias for --markdown
--json-jOutput as JSON with metadata and content
--property <name>-pExtract a specific property (e.g., title, description, domain)
--debugEnable debug mode
--lang <code>-lPreferred language (BCP 47, e.g. en, fr, ja)

Installation

npm install defuddle

For Node.js usage, install a DOM implementation:

npm install linkedom

Or use JSDOM:

npm install jsdom

CLI installation

To use the defuddle command globally, install it with the -g flag:

npm install -g defuddle

Or use npx to run the CLI without installing globally:

npx defuddle parse https://example.com/article

Response

Defuddle returns an object with the following properties:

PropertyTypeDescription
authorstringAuthor of the article
contentstringCleaned up string of the extracted content
descriptionstringDescription or summary of the article
domainstringDomain name of the website
faviconstringURL of the website’s favicon
imagestringURL of the article’s main image
languagestringLanguage of the page in BCP 47 format (e.g. en, en-US)
metaTagsobjectMeta tags
parseTimenumberTime taken to parse the page in milliseconds
publishedstringPublication date of the article
sitestringName of the website
schemaOrgDataobjectRaw schema.org data extracted from the page
titlestringTitle of the article
wordCountnumberTotal number of words in the extracted content
debugobjectDebug info including content selector and removals (when debug: true)

Bundles

Defuddle is available in three different bundles:

  1. Core bundle (defuddle): The main bundle for browser usage. No dependencies.
  2. Full bundle (defuddle/full): Includes additional features for math equation parsing and Markdown conversion.
  3. Node.js bundle (defuddle/node): For Node.js environments. Accepts any DOM Document (e.g. from linkedom, JSDOM, or happy-dom). Includes full capabilities for math and Markdown conversion.

The core bundle is recommended for most use cases. It still handles math content, but doesn’t include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable <math> elements using mathml-to-latex and temml libraries.

Options

OptionTypeDefaultDescription
debugbooleanfalseEnable debug logging and return debug info in the response
urlstringURL of the page being parsed
markdownbooleanfalseConvert content to Markdown
separateMarkdownbooleanfalseKeep content as HTML and return contentMarkdown as Markdown
removeExactSelectorsbooleantrueRemove elements matching exact selectors like ads, social buttons, etc.
removePartialSelectorsbooleantrueRemove elements matching partial selectors like ads, social buttons, etc.
removeHiddenElementsbooleantrueRemove elements hidden via CSS (display:none, visibility:hidden, etc.)
removeLowScoringbooleantrueRemove non-content blocks by scoring (navigation, link lists, etc.)
removeSmallImagesbooleantrueRemove small images (icons, tracking pixels, etc.)
removeImagesbooleanfalseRemove images.
standardizebooleantrueStandardize HTML (footnotes, headings, code blocks, etc.)
contentSelectorstringCSS selector to use as the main content element, bypassing auto-detection
useAsyncbooleantrueAllow async extractors to fetch from third-party APIs when no local content is available.
languagestringPreferred language (BCP 47 tag, e.g. en, fr). Sets Accept-Language header and selects transcript language.
includeRepliesboolean | ‘extractors''extractors’Include replies: 'extractors' for site-specific extractors only, true for all, false for none.

HTML standardization

Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.

Headings

Code blocks

Code block are standardized. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

Inline references and footnotes are converted to a standard format:

Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div id="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>
        Footnote content.&nbsp;<a href="#fnref:1" class="footnote-backref">↩</a>
      </p>
    </li>
    </ol>
</div>

Math

Math elements, including MathJax and KaTeX, are converted to standard MathML:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" data-latex="a \neq 0">
  <mi>a</mi>
  <mo>≠</mo>
  <mn>0</mn>
</math>

Callouts

Callout and alert elements from various sources are standardized to blockquotes with a data-callout attribute. When converting to Markdown, these become Obsidian-style callouts.

Supported sources:

The standardized HTML follows the Obsidian Publish format:

<div data-callout="info" class="callout">
  <div class="callout-title">
    <div class="callout-title-inner">Info</div>
  </div>
  <div class="callout-content">
    <p>This is an informational callout.</p>
  </div>
</div>

In Markdown:

> [!info] Info
> This is an informational callout.

Development

Build

To build the package, you’ll need Node.js and npm installed. Then run:

# Install dependencies
npm install

# Clean and build
npm run build

Third-party services

When using parseAsync(), if no content can be extracted from the local HTML, Defuddle may fetch content from third-party APIs as a fallback. This only happens when the page HTML contains no usable content (e.g. client-side rendered SPAs). You can disable this by setting useAsync: false in options.

Debugging

Debug mode

You can enable debug mode by passing an options object when creating a new Defuddle instance:

const result = new Defuddle(document, { debug: true }).parse();

// Access debug info
console.log(result.debug.contentSelector); // CSS selector path of chosen main content element
console.log(result.debug.removals);        // Array of removed elements with reasons

When debug mode is enabled:

The debug field contains:

PropertyTypeDescription
contentSelectorstringCSS selector path of the chosen main content element
removalsarrayList of elements removed during processing

Each removal entry contains:

PropertyTypeDescription
stepstringPipeline step that removed the element (e.g. removeLowScoring, removeBySelector, removeHiddenElements)
selectorstringCSS selector or pattern that matched (for selector-based removal)
reasonstringWhy the element was removed (e.g. score: -20, display:none)
textstringFirst 200 characters of the removed element’s text content

Pipeline toggles

You can disable individual pipeline steps to diagnose content extraction issues:

// Skip content scoring to see if it's removing content incorrectly
const result = new Defuddle(document, { removeLowScoring: false }).parse();

// Skip hidden element removal (useful for CSS sidenote layouts)
const result = new Defuddle(document, { removeHiddenElements: false }).parse();

// Skip small image removal
const result = new Defuddle(document, { removeSmallImages: false }).parse();

Content selector

Use contentSelector to bypass Defuddle’s auto-detection and specify the main content element directly:

const result = new Defuddle(document, {
  contentSelector: 'article.post-content'
}).parse();

If the selector doesn’t match any element, Defuddle falls back to auto-detection.