Expose an HTML sanitization API for HTML e-mail (clients)?

Andrew Sutherland

unread,

Apr 11, 2012, 1:46:11 AM4/11/12

If B2G e-mail clients want to display HTML e-mails, they need to have
some way to safely render (a subset of) HTML content. There are a few
specific goals:

- Avoid security problems, especially given that e-mail is unsolicited;
you don't have to browse to a bad website, bad e-mails can come to you.
Don't load plugins (ex: flash) that could have security vulnerabilities,
don't run javascript, etc.

- Avoid information leakage (by default), specifically that the e-mail
is being read or has been read. Loading remote images, performing DNS
pre-fetches, affecting local browser behavior in a way that a webpage
could detect (like visited link trickery.)

- Avoid annoyances / support only a limited subset of HTML. For
historical (limited e-mail client rendering capabilities /
lowest-common-denominator support), the above reasons, and some
discretion, there is an accepted subset of HTML e-mail that's
supported. It's nice to try and stop HTML e-mail from getting more
annoying.

These goals can be met through some combination of limiting the
capabilities of the HTML document and sanitizing the HTML. In
Thunderbird, the limiting is done by way of an nsIContentPolicy
implementation (
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMsgContentPolicy.cpp
) and nsTreeSanitizer by way of nsIParserUtils (
http://mxr.mozilla.org/comm-central/source/mailnews/mime/src/mimemoz2.cpp#2228
). The content policy is not hardcoded; it will authorize remote images
based on per-message and per-sender whitelists and some preferences not
exposed to the UI.

For B2G, the stock options for capability limitation seem to be:
- HTML5 iframe sandbox support, still in-progress. It's not clear to me
that this will be as effective at nsIContentPolicy in preventing
information leakage. Bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=341604 feature:
https://wiki.mozilla.org/Features/Platform/Iframe_Sandbox spec:
http://dev.w3.org/html5/spec/the-iframe-element.html#attr-iframe-sandbox
- Maybe CSP (Content Security Policy) depending on whether the webapp
plays origin games or if CSP can be made to apply to createObjectURL...?

Sanitization options include:
- Use the Caja HTML/CSS sanitizers (
http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer ) or some
other off-the-shelf sanitizer that pre-dates DOMParser.
- Use DOMParser ( https://developer.mozilla.org/en/DOM/DOMParser ) to
implement a sanitizer oneself.
- Expose a subset of nsIParserUtils (
https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIParserUtils )
functionality. (cid URL-rewriting would take place after sanitization;
http://tools.ietf.org/html/rfc2392 )

It seems like being able to reuse the existing HTML sanitizer (and its
convertToPlainText method!) would be a huge win for the B2G e-mail
client, but the use-case does seem pretty specific to e-mail clients.

cc'ing Henri Sivonen as he recently overhauled the HTML sanitizer for
Gecko and Thunderbird and it's not clear if he's on this list.

Andrew

Ian Bicking

unread,

Apr 11, 2012, 1:55:54 AM4/11/12

to Andrew Sutherland, [email protected], Henri Sivonen

On Tue, Apr 10, 2012 at 5:46 PM, Andrew Sutherland <
[email protected]> wrote:

> It seems like being able to reuse the existing HTML sanitizer (and its
> convertToPlainText method!) would be a huge win for the B2G e-mail client,
> but the use-case does seem pretty specific to e-mail clients.
>

The features seems equally useful for an RSS reader, and something like
Readability, and I'm sure many other things yet to be thought of.

Donovan Preston

unread,

Apr 11, 2012, 3:02:08 AM4/11/12

to [email protected]

If a general sanitation api is available, then applications can be much
smarter about loading content across the network. The utility of an api
like this certainly wouldn't be limited to email clients.

One approach which works right now is using the dom.js html5 parser in a
Worker. In this configuration, the dom seen by the page is confined to
the worker. The dom, implemented by dom.js, emits a convenient stream of
normalized dom mutation messages to the main page, which can then be
scanned to build a sanitary dom in the actual page.

Of course if there is already something else available that also
includes logic for what is sanitary, using that would be preferable.

Jonas Sicking

unread,

Apr 11, 2012, 3:58:22 AM4/11/12

to Andrew Sutherland, [email protected], Henri Sivonen

> It seems like being able to reuse the existing HTML sanitizer (and its
> convertToPlainText method!) would be a huge win for the B2G e-mail client,
> but the use-case does seem pretty specific to e-mail clients.
>

> cc'ing Henri Sivonen as he recently overhauled the HTML sanitizer for Gecko
> and Thunderbird and it's not clear if he's on this list.

I would love to expose an API like this to web content. I suspect this
is something that would improve security on the web since it would
mean using libraries which we can ensure are more correct, rather than
ones of various quality that are floating around out there.

IE has for a long time had an API called toStaticHTML. Unfortunately
it has many shortcomings (including bad performance, and only aimed at
solving issues with scripting, not callbacks like images) so we
probably want to find something else.

I'd love to see an API which let you pass in a set of configuration
options and a HTML-string, and returned a DOM. Something like:

options = {
allowElements: ["div", "p", "table", "tr", "td", "span", "hr", "font"],
allowAttributes: { table: "bgcolor", font: ["color", "size"] },
allowScript: false, // disables onX attributes and javascript urls too
allowResourcesFrom: ["http://mysite"],
allowStyleProperties: ["color", "background", "font-size", "font-weight"]
}
myDocFragment = SafeDOMParser("untrusted html here", options);

There's of course a lot of rough edges in the above. Like
allowElements and allowAttributes seems redundant. And is there really
a use-case for setting allowScript to true?

It might also be very cool to allow callbacks for various things so
that sites can have more complex policies, as well as allowing the
page to modify elements as they are being created. But I think it's
important that the "easy" way of doing things is to pass whitelists to
the API.

There is also the problem that I think that every now and then
pre-cache DNS names for normal <a> anchors. This would allow for
checking if someone has read an email. We could solve this by sticking
an attribute on the element which would mean "don't resolve DNS name
for this link unless clicked on". We'd of course need to get such an
attribute standardized too.

/ Jonas

Ian Bicking

unread,

Apr 11, 2012, 5:18:58 AM4/11/12

to Jonas Sicking, [email protected], Andrew Sutherland, Henri Sivonen

On Tue, Apr 10, 2012 at 7:58 PM, Jonas Sicking <[email protected]> wrote:

> [...]

> IE has for a long time had an API called toStaticHTML. Unfortunately
> it has many shortcomings (including bad performance, and only aimed at
> solving issues with scripting, not callbacks like images) so we
> probably want to find something else.
>
> I'd love to see an API which let you pass in a set of configuration
> options and a HTML-string, and returned a DOM. Something like:
>
> options = {
> allowElements: ["div", "p", "table", "tr", "td", "span", "hr", "font"],
> allowAttributes: { table: "bgcolor", font: ["color", "size"] },
> allowScript: false, // disables onX attributes and javascript urls too
> allowResourcesFrom: ["http://mysite"],
> allowStyleProperties: ["color", "background", "font-size", "font-weight"]
> }
> myDocFragment = SafeDOMParser("untrusted html here", options);
>
> There's of course a lot of rough edges in the above. Like
> allowElements and allowAttributes seems redundant. And is there really
> a use-case for setting allowScript to true?
>

Instead of parsing it "safely", it seems like it would be simpler to parse
it into a sort of dead form (not interpreting scripts or loading any
resources), and then the developer manipulate the resulting DOM to remove
things they don't want. I can imagine some helpers or another function
that might do some of these things "right", but you could still allow the
developer to do what they want. For instance, it's a concern in some
contexts if the content clashes with the parent styles, but at some point
this might require heuristics and so a cleanup pass over the DOM would be
necessary.

The actual parsing options I imagine are mostly to suppress loading any
resources and executing any scripts (i.e., don't *do* anything), and to
indicate if you expect a fragment or a document (looks like DOMParser
always creates a full document currently?). Then you can imagine some new
methods like dom.removeScripts(), dom.removeAllExcept("div, p, table, tr,
td, span, hr, font") (no good way to remove attributes, though?),
dom.unwrapAllExcept(...) (like removeAllExcept, except removes tag but
keeps the content), or dom.filter(function (el) {if (el.tagName == 'IMG' &&
(! el.src.substr(0, "http://mysite".length) == "http://mysite")) {return
false} return true})... and so forth. Not sure how to handle CSS, but if
you actually allow external stylesheets (which is not
*entirely*implausible) then you'd need to filter the display of the
elements, not the
actual content itself. Some more high level methods could allow the
browser to be more security-conscious and apply new rules over time, like
.removeUnsafeElements().

Jonas Sicking

unread,

Apr 11, 2012, 5:55:46 AM4/11/12

to Ian Bicking, [email protected], Andrew Sutherland, Henri Sivonen

On Tue, Apr 10, 2012 at 7:18 PM, Ian Bicking <[email protected]> wrote:
> On Tue, Apr 10, 2012 at 7:58 PM, Jonas Sicking <[email protected]> wrote:
>>
>> [...]
>>
>> IE has for a long time had an API called toStaticHTML. Unfortunately
>> it has many shortcomings (including bad performance, and only aimed at
>> solving issues with scripting, not callbacks like images) so we
>> probably want to find something else.
>>
>> I'd love to see an API which let you pass in a set of configuration
>> options and a HTML-string, and returned a DOM. Something like:
>>
>> options = {
>> allowElements: ["div", "p", "table", "tr", "td", "span", "hr", "font"],
>> allowAttributes: { table: "bgcolor", font: ["color", "size"] },
>> allowScript: false, // disables onX attributes and javascript urls too
>> allowResourcesFrom: ["http://mysite"],
>> allowStyleProperties: ["color", "background", "font-size", "font-weight"]
>> }
>> myDocFragment = SafeDOMParser("untrusted html here", options);
>>
>> There's of course a lot of rough edges in the above. Like
>> allowElements and allowAttributes seems redundant. And is there really
>> a use-case for setting allowScript to true?
>
>
> Instead of parsing it "safely", it seems like it would be simpler to parse
> it into a sort of dead form (not interpreting scripts or loading any

> resources).

I think this would lead to the same problem that we have now, where
people end up using black-lists rather than white-lists to determine
which elements and attributes to "let through". This is also making it
much more risky than we'd like to introduce new elements/attributes
since people aren't always filtering them as they should.

Additionally, iterating over a tree and checking that all elements,
attributes and attribute-values are part of an approved list is a lot
more work than the above. And would you have remembered to check for
javascript-urls? What about background image URLs in style attributes
to avoid ping-backs?

/ Jonas

Ian Bicking

unread,

Apr 11, 2012, 6:12:32 AM4/11/12

to Jonas Sicking, [email protected], Andrew Sutherland, Henri Sivonen

That's why I suggest handy methods like removeAllExcept.

Additionally, iterating over a tree and checking that all elements,
> attributes and attribute-values are part of an approved list is a lot
> more work than the above. And would you have remembered to check for
> javascript-urls? What about background image URLs in style attributes
> to avoid ping-backs?

I do agree that higher level methods are also good, like removing or
filtering out external resource references, and filtering out scripts. And
of course such fun examples as <div style="background-image:
url(javascript:alert%28%29)"> – which I don't particularly trust libraries
to get right all the time. Using modifications of the DOM after parsing
doesn't preclude this kind of filtering.

This still leaves the overhead of creating the DOM elements and then
immediately removing them. It would surprise me if this was being used
much on documents that were so large that this was concerning, though that
certainly is an argument. Not doing it at the parsing stage means we
create something that is general and extensible in the future, for
filtering policies we haven't yet considered. (E.g., you might not remove
resource requests, but might want to rewrite the URLs to go through an
anonymizing proxy.) We could still make it extensible during the parsing
phase, like allowing for a filtering visitor. But if we do that then we're
stuck creating kind of stub DOM node for the filter to access, and I'm not
sure if there's any savings then.

Jonas Sicking

unread,

Apr 11, 2012, 6:35:18 AM4/11/12

to Ian Bicking, [email protected], Andrew Sutherland, Henri Sivonen

I agree that applying filtering after a DOM has been created doesn't
preclude things like filtering all elements except a whitelist, or
filtering out anything that can execute script.

However I think that a "policy" like argument like the one I proposed
is more clear than having your policy expressed as a series of
function calls. It also seems to me like it forces the application to
think through what it wants to let through, rather than force the
application to think about what it filter out.

I think the goal should be that anything which isn't explicitly
thought about and enumerated by the page, should be filtered out. That
seems harder to archive using a after-the-fact filtering than a
upfront policy.

I agree that filtering has many of the advantages that you mention,
but when it comes to security APIs, I think it's more important to
prioritize things like failing safely, and being explicit.

/ Jonas

Paul Theriault

unread,

Apr 11, 2012, 6:35:29 PM4/11/12

to Jonas Sicking, [email protected], Andrew Sutherland, Ian Bicking, Henri Sivonen

+1 for this would be a very useful API, and for much beyond B2G.

As an example use case: Instant messaging in Thunderbird currently takes
the approach of using DOMParser.parseFromString to parse untrusted
message strings, then remove all elements/attribute based on a
whitelist. (Similar to Ian's proposal above).

At a glance Jonas's approach seems more safe to me, so that only allowed
elements/tags/ actually get instantiated at all, but my opinion isn't
very well informed.

The real work to me though is coming up with the list of
elements/attributes/allowedstyles - to be useful I think we should be
providing a safe default (even if we allow developers to overide them
with a policy of some kind).

Which is why I really like the idea of combining capabilities with
sanitisation - something like CSP allows developers to explicitly define
what behavior is allowed, rather than trying to achieve this effect by
restricting a DOM to a subset of HTML tags (which requires a very
detailed knowledge of tags/attributes, and isn't necessarily
future-proof). Is it possible today to set a content security policy
for a document created with something like DOMParser.createFromString or
DOMImplementation.createHTMLDocument. I guess not, since ultimately you
would be setting innerHTML or document.append or something in a docshell
that is already created?

(BCCing sec team since this is interesting/important I think...)

> _______________________________________________
> dev-webapi mailing list
> [email protected]
> https://lists.mozilla.org/listinfo/dev-webapi

Ian Melven

unread,

Apr 11, 2012, 8:52:55 PM4/11/12

to Paul Theriault, Ian Bicking, [email protected], Andrew Sutherland, Henri Sivonen, Jonas Sicking

Hi,

----- Original Message -----
From: "Paul Theriault" <[email protected]>
To: "Jonas Sicking" <[email protected]>
Cc: [email protected], "Andrew Sutherland" <[email protected]>, "Ian Bicking" <[email protected]>, "Henri Sivonen" <[email protected]>
Sent: Wednesday, April 11, 2012 8:35:29 AM
Subject: Re: Expose an HTML sanitization API for HTML e-mail (clients)?

> +1 for this would be a very useful API, and for much beyond B2G.

yes, toStaticHTML has been on the security roadmap as an idea for some time.
nothing seems to have happened on it since it has issues as Jonas mentioned, but
I think there's significant desire to make something like this
available to content..

> The real work to me though is coming up with the list of
> elements/attributes/allowedstyles - to be useful I think we should be
> providing a safe default (even if we allow developers to overide them
> with a policy of some kind).

i think a safe default + customization via override if needed is a great approach.

> Which is why I really like the idea of combining capabilities with
> sanitisation - something like CSP allows developers to explicitly define
> what behavior is allowed, rather than trying to achieve this effect by
> restricting a DOM to a subset of HTML tags (which requires a very
> detailed knowledge of tags/attributes, and isn't necessarily
> future-proof). Is it possible today to set a content security policy
> for a document created with something like DOMParser.createFromString or
> DOMImplementation.createHTMLDocument. I guess not, since ultimately you
> would be setting innerHTML or document.append or something in a docshell
> that is already created?

this is a really good point, Paul. iframe sandbox is all about capabilities,
but it is very coarse grained (no plugins, allow-scripts, allow-top-navigation,
allow-forms, and allow-same-origin) and there's ongoing discussion about what
other 'opt-back-in' flags it should accept. Microsoft would like 'allow-popups'
(relaxing the restrictions on a sandboxed document opening windows) to be one of them, for example.

iframe sandbox can disable script as Jonas proposed in his example configuration
options. CSP goes further and lets you specify where you allow scripts from and
gives you some of the allowResourcesFrom functionality, broken down by type
style/script/image etc.

note also that CSP 1.1 (CSP 1.0 is being finalized currently) has a proposed sandbox directive as well,
which will allow content to sandbox itself as well as/instead of a containing iframe with a sandbox directive doing this.

applying a CSP to content that is basically in the form of a string and then
used to create a document seems like a pretty cool idea. There's also
discussion of allowing a document to specify its CSP in a <meta> tag happening
in the w3c WebAppSec group, see https://bugzilla.mozilla.org/show_bug.cgi?id=663570
but there are some concerns around this.

all this said, i still would like to see the 'HTML sanitization api' happen,
the overlaps with other functionality should be considered, but i still
think it will be useful to folks.

> (BCCing sec team since this is interesting/important I think...)

i won't crosspost to mozilla.dev.security but i think this idea would
also find support there !

ian

Henri Sivonen

unread,

Apr 12, 2012, 2:50:27 PM4/12/12

to [email protected]

Andrew Sutherland wrote:
> For B2G, the stock options for capability limitation seem to be:
> - HTML5 iframe sandbox support, still in-progress. It's not clear to
> me
> that this will be as effective at nsIContentPolicy in preventing
> information leakage.

AFAICT, non-leakage of the information that some resources got loaded
is not a goal for sandboxed iframes today. If we wanted to use
sandboxing to prevent fetches of non-attachment images, we'd have to
design a new flag/feature for sandboxing.

> Sanitization options include:
> - Use the Caja HTML/CSS sanitizers (
> http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer ) or some
> other off-the-shelf sanitizer that pre-dates DOMParser.
> - Use DOMParser ( https://developer.mozilla.org/en/DOM/DOMParser ) to
> implement a sanitizer oneself.

Or for fragments you could create a document using
document.implementation.createHTMLDocument("") (to obtain a document
flagged as "loaded as data") and then set innerHTML on a node owned by
the document returned by createHTMLDocument(). I'm not sure if this
works cross-engine, but it should work in Gecko. (It's been on my list
of things to test and email the WHATWG list, but I haven't gotten
around to it.)

> - Expose a subset of nsIParserUtils (
> https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIParserUtils
> )
> functionality.

I think it would not be a good idea to expose that functionality to
Web content. History suggests that Web content will become dependent
on the weirdest corner cases when provided with the opportunity to do
so. The functionality in nsIParserUtils hasn't been designed carefully
enough or with multi-vendor participation to be supported forever
after Web content becomes dependent on its corner cases.

> It seems like being able to reuse the existing HTML sanitizer (and
> its
> convertToPlainText method!) would be a huge win for the B2G e-mail
> client, but the use-case does seem pretty specific to e-mail clients.

nsIParserUtils is motivated by the needs on Thunderbird. But we are
better positioned to change the callers in Thunderbird than to change
callers all over the Web. Even in the case of Firefox extensions, we
are in a better position to evangelize and curate extensions than all
the Web the content out there.

> cc'ing Henri Sivonen as he recently overhauled the HTML sanitizer for
> Gecko and Thunderbird and it's not clear if he's on this list.

I wasn't. Thanks. I'm on the list now.

Jonas Sicking wrote:
> I would love to expose an API like this to web content. I suspect
> this
> is something that would improve security on the web since it would
> mean using libraries which we can ensure are more correct, rather
> than
> ones of various quality that are floating around out there.

In principle Mozilla could publish a library in JS with the same level
of correctness we'd get out of Mozilla baking the feature into Gecko's
C++.

> IE has for a long time had an API called toStaticHTML. Unfortunately
> it has many shortcomings (including bad performance, and only aimed
> at
> solving issues with scripting, not callbacks like images) so we
> probably want to find something else.

In a presentation to folks who might want to do browser-related
security research and feature design, Collin Jackson used toStaticHTML
as an example of a bad security feature. Even though it looks
conceptually simple on surface, the details are complicated and
ill-defined.

> I'd love to see an API which let you pass in a set of configuration
> options and a HTML-string, and returned a DOM. Something like:
>
> options = {
> allowElements: ["div", "p", "table", "tr", "td", "span", "hr",
> "font"],
> allowAttributes: { table: "bgcolor", font: ["color", "size"] },
> allowScript: false, // disables onX attributes and javascript urls
> too

The above could be implemented as a JavaScript library with relative
ease. Is there a reason to believe that putting the tree traversal on
the C++ side would be necessary for performance?

> allowResourcesFrom: ["http://mysite"],
> allowStyleProperties: ["color", "background", "font-size",
> "font-weight"]
> }

Sanitizing CSS is much more difficult than sanitizing HTML. That's
because sometimes sites want to roundtrip CSS properties and values
that the browser doesn't support and the behavior of the CSS parser
depends on which properties and values the parser supports. It's worth
noting that Daniel Glazman wrote a CSS parser in JavaScript for
BlueGriffon in order to retain properties that Gecko's CSS parser
would discard without making them available in the object model.

Note that whenever nsTreeSanitizer finds -moz-binding, all
unsupported-by-Gecko CSS--even the kind that some
contenteditable/designmode-based editors would want to keep--is
removed as collateral damage.

> myDocFragment = SafeDOMParser("untrusted html here", options);

Chances are there are use case for fragment parsing and use cases for
complete doc parsing.

> It might also be very cool to allow callbacks for various things so
> that sites can have more complex policies, as well as allowing the
> page to modify elements as they are being created. But I think it's
> important that the "easy" way of doing things is to pass whitelists
> to
> the API.

Even though I have resisted supporting caller-supplied whitelists in
nsTreeSanitizer, I think site-supplied whitelists are the right way to
go for a Web-exposed API. That's because the contents of the
whitelists will change over time and Web content will become dependent
on the exact contents of the whitelists, so if the whitelist was
browser-provided we could never changed the whitelist in the future.
On the other hand, in the case of APIs that aren't exposed to Web
content such as nsTreeSanitizer, I think it makes more sense to make
the configuration use case driven them to make the API support
caller-supplied whitelists.

> There is also the problem that I think that every now and then
> pre-cache DNS names for normal <a> anchors. This would allow for
> checking if someone has read an email.

Whoa! I hope this doesn't happen if the document is marked as "loaded
as data". Does it?

Paul Theriault wrote:
> +1 for this would be a very useful API, and for much beyond B2G.
>

> As an example use case: Instant messaging in Thunderbird currently
> takes
> the approach of using DOMParser.parseFromString to parse untrusted
> message strings, then remove all elements/attribute based on a
> whitelist. (Similar to Ian's proposal above).

nsIParserUtils.parseFragment exists in order to facilitate the
implementation of IM in Thunderbird.

> At a glance Jonas's approach seems more safe to me, so that only
> allowed
> elements/tags/ actually get instantiated at all, but my opinion isn't
> very well informed.

I'd expect any solution to involve instantiating all element nodes
first as inert and then throwing some of the nodes away. That's how
all Gecko-provided sanitization features work, because making the HTML
tree builder work without some nodes actually being there would be
more complicated.

> The real work to me though is coming up with the list of
> elements/attributes/allowedstyles - to be useful I think we should be
> providing a safe default (even if we allow developers to overide them
> with a policy of some kind).

For a Web-exposed API, I think we shouldn't have a default whitelist,
because in the future we couldn't change the whitelist out of fear of
existing sites depending on weird corner cases and it would be sad to
have new features sites throw away harmless elements introduced after
the whitelist was created.

> Which is why I really like the idea of combining capabilities with
> sanitisation - something like CSP allows developers to explicitly
> define
> what behavior is allowed, rather than trying to achieve this effect
> by
> restricting a DOM to a subset of HTML tags

To me, that sort of thing looks like an approach that won't be
successful on the Web, because Web authors will manage to make their
code rely on the craziest and most esoteric corner case behaviors of
the browsers they test with. That is, there's the risk that sites
would become dependent on the exact mapping from permitted behaviors
to specific elements and attributes so we wouldn't be able to keep
adjusting up what the permitted behaviors mean in the feature anyway.

Ian Melven wrote:
> yes, toStaticHTML has been on the security roadmap as an idea for
> some time.
> nothing seems to have happened on it since it has issues as Jonas
> mentioned, but
> I think there's significant desire to make something like this
> available to content..

Do we have a precisely reverse engineered spec for toStaticHTML yet? I
think we shouldn't support toStaticHTML for the same reasons I think
we shouldn't expose any browser API with the default whitelist to Web
content.

I think it would be fine for Mozilla to publish a JavaScript library
with a default whitelist, though. Users of a JavaScript library would
get a snapshot of the whitelist at the time they copy the library onto
their own site, but this wouldn't prevent further development of new
versions of the library the way further development of a
browser-provided API would be prevented. Users of all versions of the
library code update to newer versions of the library at their own pace
and testing that the new versions makes sense for their site.

I suggest starting out by packaging the sort of sanitizing
functionality Jonas proposed as a JavaScript library written on top of
DOMParser for full doc in string, XHR for full doc from URL and
document.implementation.createHTMLDocument("").body.innerHTML for
fragments. If there's a very good reason why the sanitization should
be browser-assisted (i.e. partially C++), I think the issue should be
raised on the WHATWG list instead of just putting something in B2G.

--
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/

Ian Melven

unread,

Apr 13, 2012, 10:44:55 PM4/13/12

to Henri Sivonen, [email protected]

Hi,

----- Original Message -----
From: "Henri Sivonen" <[email protected]>
To: [email protected]
Sent: Thursday, April 12, 2012 4:50:27 AM
Subject: Re: Expose an HTML sanitization API for HTML e-mail (clients)?

> For a Web-exposed API, I think we shouldn't have a default whitelist,
> because in the future we couldn't change the whitelist out of fear of
> existing sites depending on weird corner cases and it would be sad to
> have new features sites throw away harmless elements introduced after
> the whitelist was created.

this is a good point.

> Do we have a precisely reverse engineered spec for toStaticHTML yet? I
> think we shouldn't support toStaticHTML for the same reasons I think
> we shouldn't expose any browser API with the default whitelist to Web
> content.

yes, i don't think we should try to actually implement toStaticHTML,
but rather something along the same lines, in a co-ordinated
and hopefully eventually standardized way (if others agree that an
API like this is desired).

I like your idea of a JS library to do this - I feel that there
must already be some out there that do sanitization - but also I
worry about subtle differences between different implementations
and differing views on what is 'safe' (and bugs). I think that
a standardized API for doing sanitization would be pretty great,
if it could be made to happen.

thanks,
ian

Jonas Sicking

unread,

May 3, 2012, 1:49:15 PM5/3/12

to Henri Sivonen, [email protected]

On Thu, Apr 12, 2012 at 4:50 AM, Henri Sivonen <[email protected]> wrote:
>> I'd love to see an API which let you pass in a set of configuration
>> options and a HTML-string, and returned a DOM. Something like:
>>
>> options = {
>> allowElements: ["div", "p", "table", "tr", "td", "span", "hr",
>> "font"],
>> allowAttributes: { table: "bgcolor", font: ["color", "size"] },
>> allowScript: false, // disables onX attributes and javascript urls
>> too
>
> The above could be implemented as a JavaScript library with relative
> ease. Is there a reason to believe that putting the tree traversal on
> the C++ side would be necessary for performance?

The advantage of something built into the browser is that we can
update it as needed when we're adding features to the browser. That
might be needed for things like 'allowScript' or other options which
aren't plain whitelists of elements/attributes/properties but rather
behavioral filters.

Additionally, building it into the browser increases the chance that
someone will use our API rather than poorly written 3rd party library.
Of course, that assumes that we think we can do better than the
average library author.

Experimenting using a library would make a lot of sense though. That
way we could also gauge what the advantages of building it in would be
since we'd have something to compare to.

Also note that an API doesn't have to be written in C++ to be built
into the browser. We have several JS-implemented APIs, and are moving
towards more. So if we have a JS implementation that we're happy with
we could likely just build that into the browser unless the
performance is too bad (in which case we'd need to build it into the
browser anyway).

>> allowResourcesFrom: ["http://mysite"],
>> allowStyleProperties: ["color", "background", "font-size",
>> "font-weight"]
>> }
>
> Sanitizing CSS is much more difficult than sanitizing HTML. That's
> because sometimes sites want to roundtrip CSS properties and values
> that the browser doesn't support and the behavior of the CSS parser
> depends on which properties and values the parser supports. It's worth
> noting that Daniel Glazman wrote a CSS parser in JavaScript for
> BlueGriffon in order to retain properties that Gecko's CSS parser
> would discard without making them available in the object model.

Wouldn't the exact same problem occur if sites wanted to roundtrip the
markup (which I know BlueGriffon would love to do)? I agree that
roundtripping will in general be a problem if someone cares about it.
My initial reaction would be to say that roundtripping isn't supported
and see how big of a problem that is. This is a security API and not
an editor API after all.

>> myDocFragment = SafeDOMParser("untrusted html here", options);
>
> Chances are there are use case for fragment parsing and use cases for
> complete doc parsing.

Good point.

>> There is also the problem that I think that every now and then
>> pre-cache DNS names for normal <a> anchors. This would allow for
>> checking if someone has read an email.
>
> Whoa! I hope this doesn't happen if the document is marked as "loaded
> as data". Does it?

I don't know.

/ Jonas