When processing text scraped from the web or user-generated content, you’ll often need to remove HTML tags while keeping the readable text. Many developers reach for external packages like BeautifulSoup, but you can achieve the same goal using only Python’s standard library.
The following snippet is a minimal and dependency-free solution for stripping HTML tags from a string. It’s based on a community contribution by Eloff on Stack Overflow.
from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
How it works
HTMLParseris part of Python’s standard library and can process HTML input incrementally.- The custom subclass
MLStripperoverrides thehandle_data()method to capture only text content. StringIOefficiently collects the output as the parser processes the HTML.- The helper function
strip_tags()simply feeds the HTML input and returns the collected text.
Example usage
html = "<p>Hello <strong>world</strong>! This is <a href='#'>a link</a>.</p>"
text = strip_tags(html)
print(text)
Output:
Hello world! This is a link.
Why this approach?
- No dependencies — uses only Python’s built-in modules.
- Lightweight and fast — suitable for small to medium HTML snippets.
- Safe and controlled — avoids executing any scripts or external libraries.
For larger or malformed HTML documents, you might still prefer robust parsers like BeautifulSoup or lxml. But for most basic HTML cleanup tasks, this standard-library solution is elegant and effective.
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.