Converting HTML to Markdown with Python: A Comprehensive Guide
Python for converting HTML to clean, LLM-ready Markdown
Converting HTML to Markdown is a fundamental task in modern development workflows, particularly when preparing web content for Large Language Models (LLMs), documentation systems, or static site generators like Hugo.
While HTML is designed for web browsers with rich styling and structure, Markdown offers a clean, readable format that’s ideal for text processing, version control, and AI consumption. If you’re new to Markdown syntax, check out our Markdown Cheatsheet for a comprehensive reference.

In this comprehensive review, we’ll explore six Python packages for HTML-to-Markdown conversion, providing practical code examples, performance benchmarks, and real-world use cases. Whether you’re building an LLM training pipeline, migrating a blog to Hugo, or scraping documentation, you’ll find the perfect tool for your workflow.
Alternative Approach: If you need more intelligent content extraction with semantic understanding, you might also consider converting HTML to Markdown using LLM and Ollama, which offers AI-powered conversion for complex layouts.
What you’ll learn:
- Detailed comparison of 6 libraries with pros/cons for each
- Performance benchmarks with real-world HTML samples
- Production-ready code examples for common use cases
- Best practices for LLM preprocessing workflows
- Specific recommendations based on your requirements
Why Markdown for LLM Preprocessing?
Before diving into the tools, let’s understand why Markdown is particularly valuable for LLM workflows:
- Token Efficiency: Markdown uses significantly fewer tokens than HTML for the same content
- Semantic Clarity: Markdown preserves document structure without verbose tags
- Readability: Both humans and LLMs can easily parse Markdown’s syntax
- Consistency: Standardized format reduces ambiguity in model inputs
- Storage: Smaller file sizes for training data and context windows
Markdown’s versatility extends beyond HTML conversion—you can also convert Word Documents to Markdown for documentation workflows, or use it in knowledge management systems like Obsidian for Personal Knowledge Management.
TL;DR - Quick Comparison Matrix
If you’re in a hurry, here’s a comprehensive comparison of all six libraries at a glance. This table will help you quickly identify which tool matches your specific requirements:
| Feature | html2text | markdownify | html-to-markdown | trafilatura | domscribe | html2md |
|---|---|---|---|---|---|---|
| HTML5 Support | Partial | Partial | Full | Full | Full | Full |
| Type Hints | No | No | Yes | Partial | No | Partial |
| Custom Handlers | Limited | Excellent | Good | Limited | Good | Limited |
| Table Support | Basic | Basic | Advanced | Good | Good | Good |
| Async Support | No | No | No | No | No | Yes |
| Content Extraction | No | No | No | Excellent | No | Good |
| Metadata Extraction | No | No | Yes | Excellent | No | Yes |
| CLI Tool | No | No | Yes | Yes | No | Yes |
| Speed | Medium | Slow | Fast | Very Fast | Medium | Very Fast |
| Active Development | No | Yes | Yes | Yes | Limited | Yes |
| Python Version | 3.6+ | 3.7+ | 3.9+ | 3.6+ | 3.8+ | 3.10+ |
| Dependencies | None | BS4 | lxml | lxml | BS4 | aiohttp |
Quick Selection Guide:
- Need speed? → trafilatura or html2md
- Need customization? → markdownify
- Need type safety? → html-to-markdown
- Need simplicity? → html2text
- Need content extraction? → trafilatura
The Contenders: 6 Python Packages Compared
Let’s dive deep into each library with practical code examples, configuration options, and real-world insights. Each section includes installation instructions, usage patterns, and honest assessments of strengths and limitations.
1. html2text - The Classic Choice
Originally developed by Aaron Swartz, html2text has been a staple in the Python ecosystem for over a decade. It focuses on producing clean, readable Markdown output.
Installation:
pip install html2text
Basic Usage:
import html2text
# Create converter instance
h = html2text.HTML2Text()
# Configure options
h.ignore_links = False
h.ignore_images = False
h.ignore_emphasis = False
h.body_width = 0 # Don't wrap lines
html_content = """
<h1>Welcome to Web Scraping</h1>
<p>This is a <strong>comprehensive guide</strong> to extracting content.</p>
<ul>
<li>Easy to use</li>
<li>Battle-tested</li>
<li>Widely adopted</li>
</ul>
<a href="https://example.com">Learn more</a>
"""
markdown = h.handle(html_content)
print(markdown)
Output:
# Welcome to Web Scraping
This is a **comprehensive guide** to extracting content.
* Easy to use
* Battle-tested
* Widely adopted
[Learn more](https://example.com)
Advanced Configuration:
import html2text
h = html2text.HTML2Text()
# Skip specific elements
h.ignore_links = True
h.ignore_images = True
# Control formatting
h.body_width = 80 # Wrap at 80 characters
h.unicode_snob = True # Use unicode characters
h.emphasis_mark = '*' # Use * for emphasis instead of _
h.strong_mark = '**'
# Handle tables
h.ignore_tables = False
# Protect pre-formatted text
h.protect_links = True
Pros:
- Mature and stable (15+ years of development)
- Extensive configuration options
- Handles edge cases well
- No external dependencies
Cons:
- Limited HTML5 support
- Can produce inconsistent spacing
- Not actively maintained (last major update in 2020)
- Single-threaded processing only
Best For: Simple HTML documents, legacy systems, when stability is paramount
2. markdownify - The Flexible Option
markdownify leverages BeautifulSoup4 to provide flexible HTML parsing with customizable tag handling.
Installation:
pip install markdownify
Basic Usage:
from markdownify import markdownify as md
html = """
<article>
<h2>Modern Web Development</h2>
<p>Building with <code>Python</code> and <em>modern frameworks</em>.</p>
<blockquote>
<p>Simplicity is the ultimate sophistication.</p>
</blockquote>
</article>
"""
markdown = md(html)
print(markdown)
Output:
## Modern Web Development
Building with `Python` and *modern frameworks*.
> Simplicity is the ultimate sophistication.
Advanced Usage with Custom Handlers:
from markdownify import MarkdownConverter
class CustomConverter(MarkdownConverter):
"""
Create custom converter with specific tag handling
"""
def convert_img(self, el, text, convert_as_inline):
"""Custom image handler with alt text"""
alt = el.get('alt', '')
src = el.get('src', '')
title = el.get('title', '')
if title:
return f''
return f''
def convert_pre(self, el, text, convert_as_inline):
"""Enhanced code block handling with language detection"""
code = el.find('code')
if code:
# Extract language from class attribute (e.g., 'language-python')
classes = code.get('class', [''])
language = classes[0].replace('language-', '') if classes else ''
return f'\n```{language}\n{code.get_text()}\n```\n'
return f'\n```\n{text}\n```\n'
# Use custom converter
html = '<pre><code class="language-python">def hello():\n print("world")</code></pre>'
markdown = CustomConverter().convert(html)
print(markdown)
For more details on working with Markdown code blocks and syntax highlighting, see our guide on Using Markdown Code Blocks.
Selective Tag Conversion:
from markdownify import markdownify as md
# Strip specific tags entirely
markdown = md(html, strip=['script', 'style', 'nav'])
# Convert only specific tags
markdown = md(
html,
heading_style="ATX", # Use # for headings
bullets="-", # Use - for bullets
strong_em_symbol="*", # Use * for emphasis
)
Pros:
- Built on BeautifulSoup4 (robust HTML parsing)
- Highly customizable through subclassing
- Active maintenance
- Good documentation
Cons:
- Requires BeautifulSoup4 dependency
- Can be slower for large documents
- Limited built-in table support
Best For: Custom conversion logic, projects already using BeautifulSoup4
3. html-to-markdown - The Modern Powerhouse
html-to-markdown is a fully-typed, modern library with comprehensive HTML5 support and extensive configuration options.
Installation:
pip install html-to-markdown
Basic Usage:
from html_to_markdown import convert
html = """
<article>
<h1>Technical Documentation</h1>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Support</th>
</tr>
</thead>
<tbody>
<tr>
<td>HTML5</td>
<td>✓</td>
</tr>
<tr>
<td>Tables</td>
<td>✓</td>
</tr>
</tbody>
</table>
</article>
"""
markdown = convert(html)
print(markdown)
Advanced Configuration:
from html_to_markdown import convert, Options
# Create custom options
options = Options(
heading_style="ATX",
bullet_style="-",
code_language_default="python",
strip_tags=["script", "style"],
escape_special_chars=True,
table_style="pipe", # Use | for tables
preserve_whitespace=False,
extract_metadata=True, # Extract meta tags
)
markdown = convert(html, options=options)
Command-Line Interface:
# Convert single file
html-to-markdown input.html -o output.md
# Convert with options
html-to-markdown input.html \
--heading-style atx \
--strip-tags script,style \
--extract-metadata
# Batch conversion
find ./html_files -name "*.html" -exec html-to-markdown {} -o ./markdown_files/{}.md \;
Pros:
- Full HTML5 support including semantic elements
- Type-safe with comprehensive type hints
- Enhanced table handling (merged cells, alignment)
- Metadata extraction capabilities
- Active development and modern codebase
Cons:
- Requires Python 3.9+
- Larger dependency footprint
- Steeper learning curve
Best For: Complex HTML5 documents, type-safe projects, production systems
4. trafilatura - The Content Extraction Specialist
trafilatura isn’t just an HTML-to-Markdown converter—it’s an intelligent content extraction library specifically designed for web scraping and article extraction.
Installation:
pip install trafilatura
Basic Usage:
import trafilatura
# Download and extract from URL
url = "https://example.com/article"
downloaded = trafilatura.fetch_url(url)
markdown = trafilatura.extract(downloaded, output_format='markdown')
print(markdown)
Note: Trafilatura includes built-in URL fetching, but for more complex HTTP operations, you might find our cURL Cheatsheet helpful when working with APIs or authenticated endpoints.
Advanced Content Extraction:
import trafilatura
from trafilatura.settings import use_config
# Create custom configuration
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "30")
html = """
<html>
<head><title>Article Title</title></head>
<body>
<nav>Navigation menu</nav>
<article>
<h1>Main Article</h1>
<p>Important content here.</p>
</article>
<aside>Advertisement</aside>
<footer>Footer content</footer>
</body>
</html>
"""
# Extract only main content
markdown = trafilatura.extract(
html,
output_format='markdown',
include_comments=False,
include_tables=True,
include_images=True,
include_links=True,
config=config
)
# Extract with metadata
result = trafilatura.extract(
html,
output_format='markdown',
with_metadata=True
)
if result:
print(f"Title: {result.get('title', 'N/A')}")
print(f"Author: {result.get('author', 'N/A')}")
print(f"Date: {result.get('date', 'N/A')}")
print(f"\nContent:\n{result.get('text', '')}")
Batch Processing:
import trafilatura
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
def process_url(url):
"""Extract markdown from URL"""
downloaded = trafilatura.fetch_url(url)
if downloaded:
return trafilatura.extract(
downloaded,
output_format='markdown',
include_links=True,
include_images=True
)
return None
# Process multiple URLs in parallel
urls = [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3",
]
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(process_url, urls))
for i, markdown in enumerate(results):
if markdown:
Path(f"article_{i}.md").write_text(markdown, encoding='utf-8')
Pros:
- Intelligent content extraction (removes boilerplate)
- Built-in URL fetching with robust error handling
- Metadata extraction (title, author, date)
- Language detection
- Optimized for news articles and blog posts
- Fast C-based parsing
Cons:
- May strip too much content for general HTML
- Focused on article extraction (not general-purpose)
- Configuration complexity for edge cases
Best For: Web scraping, article extraction, LLM training data preparation
5. domscribe - The Semantic Preservationist
domscribe focuses on preserving the semantic meaning of HTML while converting to Markdown.
Installation:
pip install domscribe
Basic Usage:
from domscribe import html_to_markdown
html = """
<article>
<header>
<h1>Understanding Semantic HTML</h1>
<time datetime="2024-10-24">October 24, 2024</time>
</header>
<section>
<h2>Introduction</h2>
<p>Semantic HTML provides <mark>meaning</mark> to content.</p>
</section>
<aside>
<h3>Related Topics</h3>
<ul>
<li>Accessibility</li>
<li>SEO</li>
</ul>
</aside>
</article>
"""
markdown = html_to_markdown(html)
print(markdown)
Custom Options:
from domscribe import html_to_markdown, MarkdownOptions
options = MarkdownOptions(
preserve_semantic_structure=True,
include_aria_labels=True,
strip_empty_elements=True
)
markdown = html_to_markdown(html, options=options)
Pros:
- Preserves semantic HTML5 structure
- Handles modern web components well
- Clean API design
Cons:
- Still in early development (API may change)
- Limited documentation compared to mature alternatives
- Smaller community and fewer examples available
Best For: Semantic HTML5 documents, accessibility-focused projects, when HTML5 semantic structure preservation is critical
Note: While domscribe is newer and less battle-tested than alternatives, it fills a specific niche for semantic HTML preservation that other tools don’t prioritize.
6. html2md - The Async Powerhouse
html2md is designed for high-performance batch conversions with asynchronous processing.
Installation:
pip install html2md
Command-Line Usage:
# Convert entire directory
m1f-html2md convert ./website -o ./docs
# With custom settings
m1f-html2md convert ./website -o ./docs \
--remove-tags nav,footer \
--heading-offset 1 \
--detect-language
# Convert single file
m1f-html2md convert index.html -o readme.md
Programmatic Usage:
import asyncio
from html2md import convert_html
async def convert_files():
"""Async batch conversion"""
html_files = [
'page1.html',
'page2.html',
'page3.html'
]
tasks = [convert_html(file) for file in html_files]
results = await asyncio.gather(*tasks)
return results
# Run conversion
results = asyncio.run(convert_files())
Pros:
- Asynchronous processing for high performance
- Intelligent content selector detection
- YAML frontmatter generation (great for Hugo!)
- Code language detection
- Parallel processing support
Cons:
- Requires Python 3.10+
- CLI-focused (less flexible API)
- Documentation could be more comprehensive
Best For: Large-scale migrations, batch conversions, Hugo/Jekyll migrations
Performance Benchmarking
Performance matters, especially when processing thousands of documents for LLM training or large-scale migrations. Understanding the relative speed differences between libraries helps you make informed decisions for your workflow.
Comparative Performance Analysis:
Based on typical usage patterns, here’s how these libraries compare across three realistic scenarios:
- Simple HTML: Basic blog post with text, headers, and links (5KB)
- Complex HTML: Technical documentation with nested tables and code blocks (50KB)
- Real Website: Full webpage including navigation, footer, sidebar, and ads (200KB)
Here’s example benchmark code you can use to test these libraries yourself:
import time
import html2text
from markdownify import markdownify
from html_to_markdown import convert
import trafilatura
def benchmark(html_content, iterations=100):
"""Benchmark conversion speed"""
# html2text
start = time.time()
h = html2text.HTML2Text()
for _ in range(iterations):
_ = h.handle(html_content)
html2text_time = time.time() - start
# markdownify
start = time.time()
for _ in range(iterations):
_ = markdownify(html_content)
markdownify_time = time.time() - start
# html-to-markdown
start = time.time()
for _ in range(iterations):
_ = convert(html_content)
html_to_markdown_time = time.time() - start
# trafilatura
start = time.time()
for _ in range(iterations):
_ = trafilatura.extract(html_content, output_format='markdown')
trafilatura_time = time.time() - start
return {
'html2text': html2text_time,
'markdownify': markdownify_time,
'html-to-markdown': html_to_markdown_time,
'trafilatura': trafilatura_time
}
Typical Performance Characteristics (representative relative speeds):
| Package | Simple (5KB) | Complex (50KB) | Real Site (200KB) |
|---|---|---|---|
| html2text | Moderate | Slower | Slower |
| markdownify | Slower | Slower | Slowest |
| html-to-markdown | Fast | Fast | Fast |
| trafilatura | Fast | Very Fast | Very Fast |
| html2md (async) | Very Fast | Very Fast | Fastest |
Key Observations:
html2mdandtrafilaturaare fastest for complex documents, making them ideal for batch processinghtml-to-markdownoffers the best balance of speed and features for production usemarkdownifyis slower but most flexible—trade-off worth it when you need custom handlershtml2textshows its age with slower performance, but remains stable for simple use cases
Note: Performance differences become significant only when processing hundreds or thousands of files. For occasional conversions, any library will work fine. Focus on features and customization options instead.
Real-World Use Cases
Theory is helpful, but practical examples demonstrate how these tools work in production. Here are four common scenarios with complete, production-ready code that you can adapt for your own projects.
Use Case 1: LLM Training Data Preparation
Requirement: Extract clean text from thousands of documentation pages
Recommended: trafilatura + parallel processing
import trafilatura
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
def process_html_file(html_path):
"""Convert HTML file to markdown"""
html = Path(html_path).read_text(encoding='utf-8')
markdown = trafilatura.extract(
html,
output_format='markdown',
include_links=False, # Remove for cleaner training data
include_images=False,
include_comments=False
)
if markdown:
output_path = html_path.replace('.html', '.md')
Path(output_path).write_text(markdown, encoding='utf-8')
return len(markdown)
return 0
# Process 10,000 files in parallel
html_files = list(Path('./docs').rglob('*.html'))
with ProcessPoolExecutor(max_workers=8) as executor:
token_counts = list(executor.map(process_html_file, html_files))
print(f"Processed {len(html_files)} files")
print(f"Total characters: {sum(token_counts):,}")
Use Case 2: Hugo Blog Migration
Requirement: Migrate WordPress blog to Hugo with frontmatter
Recommended: html2md CLI
Hugo is a popular static site generator that uses Markdown for content. For more Hugo-specific tips, check out our Hugo Cheat Sheet and learn about Adding Structured data markup to Hugo for better SEO.
# Convert all posts with frontmatter
m1f-html2md convert ./wordpress-export \
-o ./hugo/content/posts \
--generate-frontmatter \
--heading-offset 0 \
--remove-tags script,style,nav,footer
Or programmatically:
from html_to_markdown import convert, Options
from pathlib import Path
import yaml
def migrate_post(html_file):
"""Convert WordPress HTML to Hugo markdown"""
html = Path(html_file).read_text()
# Extract title and date from HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('h1').get_text() if soup.find('h1') else 'Untitled'
# Convert to markdown
options = Options(strip_tags=['script', 'style', 'nav', 'footer'])
markdown = convert(html, options=options)
# Add Hugo frontmatter
frontmatter = {
'title': title,
'date': '2024-10-24',
'draft': False,
'tags': []
}
output = f"---\n{yaml.dump(frontmatter)}---\n\n{markdown}"
# Save
output_file = html_file.replace('.html', '.md')
Path(output_file).write_text(output, encoding='utf-8')
# Process all posts
for html_file in Path('./wordpress-export').glob('*.html'):
migrate_post(html_file)
Use Case 3: Documentation Scraper with Custom Formatting
Requirement: Scrape technical docs with custom code block handling
Recommended: markdownify with custom converter
This approach is particularly useful for migrating documentation from wiki systems. If you’re managing documentation, you might also be interested in DokuWiki - selfhosted wiki and the alternatives for self-hosted documentation solutions.
from markdownify import MarkdownConverter
import requests
class DocsConverter(MarkdownConverter):
"""Custom converter for technical documentation"""
def convert_pre(self, el, text, convert_as_inline):
"""Enhanced code block with syntax highlighting"""
code = el.find('code')
if code:
# Extract language from class
classes = code.get('class', [])
language = next(
(c.replace('language-', '') for c in classes if c.startswith('language-')),
'text'
)
return f'\n```{language}\n{code.get_text()}\n```\n'
return super().convert_pre(el, text, convert_as_inline)
def convert_div(self, el, text, convert_as_inline):
"""Handle special documentation blocks"""
classes = el.get('class', [])
# Warning blocks
if 'warning' in classes:
return f'\n> ⚠️ **Warning**: {text}\n'
# Info blocks
if 'info' in classes or 'note' in classes:
return f'\n> 💡 **Note**: {text}\n'
return text
def scrape_docs(url):
"""Scrape and convert documentation page"""
response = requests.get(url)
markdown = DocsConverter().convert(response.text)
return markdown
# Use it
docs_url = "https://docs.example.com/api-reference"
markdown = scrape_docs(docs_url)
Path('api-reference.md').write_text(markdown)
Use Case 4: Newsletter to Markdown Archive
Requirement: Convert HTML email newsletters to readable markdown
Recommended: html2text with specific configuration
import html2text
import email
from pathlib import Path
def convert_newsletter(email_file):
"""Convert HTML email to markdown"""
# Parse email
with open(email_file, 'r') as f:
msg = email.message_from_file(f)
# Get HTML part
html_content = None
for part in msg.walk():
if part.get_content_type() == 'text/html':
html_content = part.get_payload(decode=True).decode('utf-8')
break
if not html_content:
return None
# Configure converter
h = html2text.HTML2Text()
h.ignore_images = False
h.images_to_alt = True
h.body_width = 0
h.protect_links = True
h.unicode_snob = True
# Convert
markdown = h.handle(html_content)
# Add metadata
subject = msg.get('Subject', 'No Subject')
date = msg.get('Date', '')
output = f"# {subject}\n\n*Date: {date}*\n\n---\n\n{markdown}"
return output
# Process newsletter archive
for email_file in Path('./newsletters').glob('*.eml'):
markdown = convert_newsletter(email_file)
if markdown:
output_file = email_file.with_suffix('.md')
output_file.write_text(markdown, encoding='utf-8')
Recommendations by Scenario
Still unsure which library to choose? Here’s my definitive guide based on specific use cases. These recommendations come from hands-on experience with each library in production environments.
For Web Scraping & LLM Preprocessing
Winner: trafilatura
Trafilatura excels at extracting clean content while removing boilerplate. Perfect for:
- Building LLM training datasets
- Content aggregation
- Research paper collection
- News article extraction
For Hugo/Jekyll Migrations
Winner: html2md
Async processing and frontmatter generation make bulk migrations fast and easy:
- Batch conversions
- Automatic metadata extraction
- YAML frontmatter generation
- Heading level adjustment
For Custom Conversion Logic
Winner: markdownify
Subclass the converter for complete control:
- Custom tag handlers
- Domain-specific conversions
- Special formatting requirements
- Integration with existing BeautifulSoup code
For Type-Safe Production Systems
Winner: html-to-markdown
Modern, type-safe, and feature-complete:
- Full HTML5 support
- Comprehensive type hints
- Advanced table handling
- Active maintenance
For Simple, Stable Conversions
Winner: html2text
When you need something that “just works”:
- No dependencies
- Battle-tested
- Extensive configuration
- Wide platform support
Best Practices for LLM Preprocessing
Regardless of which library you choose, following these best practices will ensure high-quality Markdown output that’s optimized for LLM consumption. These patterns have proven essential in production workflows processing millions of documents.
1. Clean Before Converting
Always remove unwanted elements before conversion to get cleaner output and better performance:
from bs4 import BeautifulSoup
import trafilatura
def clean_and_convert(html):
"""Remove unwanted elements before conversion"""
soup = BeautifulSoup(html, 'html.parser')
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
element.decompose()
# Remove ads and tracking
for element in soup.find_all(class_=['ad', 'advertisement', 'tracking']):
element.decompose()
# Convert cleaned HTML
markdown = trafilatura.extract(
str(soup),
output_format='markdown'
)
return markdown
2. Normalize Whitespace
Different converters handle whitespace differently. Normalize the output to ensure consistency across your corpus:
import re
def normalize_markdown(markdown):
"""Clean up markdown spacing"""
# Remove multiple blank lines
markdown = re.sub(r'\n{3,}', '\n\n', markdown)
# Remove trailing whitespace
markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))
# Ensure single newline at end
markdown = markdown.rstrip() + '\n'
return markdown
3. Validate Output
Quality control is essential. Implement validation to catch conversion errors early:
def validate_markdown(markdown):
"""Validate markdown quality"""
issues = []
# Check for HTML remnants
if '<' in markdown and '>' in markdown:
issues.append("HTML tags detected")
# Check for broken links
if '[' in markdown and ']()' in markdown:
issues.append("Empty link detected")
# Check for excessive code blocks
code_block_count = markdown.count('```')
if code_block_count % 2 != 0:
issues.append("Unclosed code block")
return len(issues) == 0, issues
4. Batch Processing Template
When processing large document collections, use this production-ready template with proper error handling, logging, and parallel processing:
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
import trafilatura
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def process_file(html_path):
"""Process single HTML file"""
try:
html = Path(html_path).read_text(encoding='utf-8')
markdown = trafilatura.extract(
html,
output_format='markdown',
include_links=True,
include_images=False
)
if markdown:
# Normalize
markdown = normalize_markdown(markdown)
# Validate
is_valid, issues = validate_markdown(markdown)
if not is_valid:
logger.warning(f"{html_path}: {', '.join(issues)}")
# Save
output_path = Path(str(html_path).replace('.html', '.md'))
output_path.write_text(markdown, encoding='utf-8')
return True
return False
except Exception as e:
logger.error(f"Error processing {html_path}: {e}")
return False
def batch_convert(input_dir, max_workers=4):
"""Convert all HTML files in directory"""
html_files = list(Path(input_dir).rglob('*.html'))
logger.info(f"Found {len(html_files)} HTML files")
with ProcessPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_file, html_files))
success_count = sum(results)
logger.info(f"Successfully converted {success_count}/{len(html_files)} files")
# Usage
batch_convert('./html_docs', max_workers=8)
Conclusion
The Python ecosystem offers mature, production-ready tools for HTML-to-Markdown conversion, each optimized for different scenarios. Your choice should align with your specific requirements:
- Quick conversions: Use
html2textfor its simplicity and zero dependencies - Custom logic: Use
markdownifyfor maximum flexibility through subclassing - Web scraping: Use
trafilaturafor intelligent content extraction with boilerplate removal - Bulk migrations: Use
html2mdfor async performance on large-scale projects - Production systems: Use
html-to-markdownfor type safety and comprehensive HTML5 support - Semantic preservation: Use
domscribefor maintaining HTML5 semantic structure
Recommendations for LLM Workflows
For LLM preprocessing workflows, it is recommended a two-tier approach:
- Start with
trafilaturafor initial content extraction—it intelligently removes navigation, ads, and boilerplate while preserving the main content - Fall back to
html-to-markdownfor complex documents requiring precise structure preservation, such as technical documentation with tables and code blocks
This combination handles 95% of real-world scenarios effectively.
Next Steps
All these tools (except html2text) are actively maintained and production-ready. It’s better to:
- Install 2-3 libraries that match your use case
- Test them with your actual HTML samples
- Benchmark performance with your typical document sizes
- Choose based on output quality, not just speed
The Python ecosystem for HTML-to-Markdown conversion has matured significantly, and you can’t go wrong with any of these choices for their intended use cases.
Additional Resources
- html2text Documentation
- markdownify on PyPI
- html-to-markdown GitHub
- trafilatura Documentation
- html2md Documentation
- domscribe on PyPI
Note: This comparison is based on analysis of official documentation, community feedback, and library architecture. Performance characteristics are representative of typical usage patterns. For specific use cases, run your own benchmarks with your actual HTML samples.
Other Useful Articles
- Markdown Cheatsheet
- Using Markdown Code Blocks
- Converting Word Documents to Markdown: A Complete Guide
- Convert HTML content to Markdown using LLM and Ollama
- cURL Cheatsheet
- Hugo Static Site Generator Cheatsheet
- Adding Structured data markup to Hugo
- Dokuwiki - selfhosted wiki and the alternatives
- Using Obsidian for Personal Knowledge Management