Open
Description
This little example demonstrates how easy HTML sanitization might be with beautifulsoup
:
from bs4 import BeautifulSoup
html_content = """
<html>
<body>
<h1 class="title" >Title</h1>
<script>alert('This is malicious');</script>
<p id="para1" style="color: red;">This is a paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_content, "html.parser")
# Remove specific tags
for tag in soup(["script", "style"]):
tag.decompose()
# Sanitize attributes
allowed_attributes = {"p": ["id"], "h1": []}
for tag in soup.find_all(True):
if tag.name in allowed_attributes:
tag.attrs = {key: value for key, value in tag.attrs.items() if key in allowed_attributes[tag.name]}
else:
tag.attrs = {} # Remove all attributes for tags not in the allowed list
print(soup.prettify())
We should consider this as part of #631