Andrew Sutherland wrote:
> For B2G, the stock options for capability limitation seem to be:
> - HTML5 iframe sandbox support, still in-progress. It's not clear to
> me
> that this will be as effective at nsIContentPolicy in preventing
> information leakage.
AFAICT, non-leakage of the information that some resources got loaded
is not a goal for sandboxed iframes today. If we wanted to use
sandboxing to prevent fetches of non-attachment images, we'd have to
design a new flag/feature for sandboxing.
Or for fragments you could create a document using
document.implementation.createHTMLDocument("") (to obtain a document
flagged as "loaded as data") and then set innerHTML on a node owned by
the document returned by createHTMLDocument(). I'm not sure if this
works cross-engine, but it should work in Gecko. (It's been on my list
of things to test and email the WHATWG list, but I haven't gotten
around to it.)
I think it would not be a good idea to expose that functionality to
Web content. History suggests that Web content will become dependent
on the weirdest corner cases when provided with the opportunity to do
so. The functionality in nsIParserUtils hasn't been designed carefully
enough or with multi-vendor participation to be supported forever
after Web content becomes dependent on its corner cases.
> It seems like being able to reuse the existing HTML sanitizer (and
> its
> convertToPlainText method!) would be a huge win for the B2G e-mail
> client, but the use-case does seem pretty specific to e-mail clients.
nsIParserUtils is motivated by the needs on Thunderbird. But we are
better positioned to change the callers in Thunderbird than to change
callers all over the Web. Even in the case of Firefox extensions, we
are in a better position to evangelize and curate extensions than all
the Web the content out there.
> cc'ing Henri Sivonen as he recently overhauled the HTML sanitizer for
> Gecko and Thunderbird and it's not clear if he's on this list.
I wasn't. Thanks. I'm on the list now.
Jonas Sicking wrote:
> I would love to expose an API like this to web content. I suspect
> this
> is something that would improve security on the web since it would
> mean using libraries which we can ensure are more correct, rather
> than
> ones of various quality that are floating around out there.
In principle Mozilla could publish a library in JS with the same level
of correctness we'd get out of Mozilla baking the feature into Gecko's
C++.
> IE has for a long time had an API called toStaticHTML. Unfortunately
> it has many shortcomings (including bad performance, and only aimed
> at
> solving issues with scripting, not callbacks like images) so we
> probably want to find something else.
In a presentation to folks who might want to do browser-related
security research and feature design, Collin Jackson used toStaticHTML
as an example of a bad security feature. Even though it looks
conceptually simple on surface, the details are complicated and
ill-defined.
> I'd love to see an API which let you pass in a set of configuration
> options and a HTML-string, and returned a DOM. Something like:
>
> options = {
> allowElements: ["div", "p", "table", "tr", "td", "span", "hr",
> "font"],
> allowAttributes: { table: "bgcolor", font: ["color", "size"] },
> allowScript: false, // disables onX attributes and javascript urls
> too
The above could be implemented as a JavaScript library with relative
ease. Is there a reason to believe that putting the tree traversal on
the C++ side would be necessary for performance?
> allowResourcesFrom: ["
http://mysite"],
> allowStyleProperties: ["color", "background", "font-size",
> "font-weight"]
> }
Sanitizing CSS is much more difficult than sanitizing HTML. That's
because sometimes sites want to roundtrip CSS properties and values
that the browser doesn't support and the behavior of the CSS parser
depends on which properties and values the parser supports. It's worth
noting that Daniel Glazman wrote a CSS parser in JavaScript for
BlueGriffon in order to retain properties that Gecko's CSS parser
would discard without making them available in the object model.
Note that whenever nsTreeSanitizer finds -moz-binding, all
unsupported-by-Gecko CSS--even the kind that some
contenteditable/designmode-based editors would want to keep--is
removed as collateral damage.
> myDocFragment = SafeDOMParser("untrusted html here", options);
Chances are there are use case for fragment parsing and use cases for
complete doc parsing.
> It might also be very cool to allow callbacks for various things so
> that sites can have more complex policies, as well as allowing the
> page to modify elements as they are being created. But I think it's
> important that the "easy" way of doing things is to pass whitelists
> to
> the API.
Even though I have resisted supporting caller-supplied whitelists in
nsTreeSanitizer, I think site-supplied whitelists are the right way to
go for a Web-exposed API. That's because the contents of the
whitelists will change over time and Web content will become dependent
on the exact contents of the whitelists, so if the whitelist was
browser-provided we could never changed the whitelist in the future.
On the other hand, in the case of APIs that aren't exposed to Web
content such as nsTreeSanitizer, I think it makes more sense to make
the configuration use case driven them to make the API support
caller-supplied whitelists.
> There is also the problem that I think that every now and then
> pre-cache DNS names for normal <a> anchors. This would allow for
> checking if someone has read an email.
Whoa! I hope this doesn't happen if the document is marked as "loaded
as data". Does it?
Paul Theriault wrote:
> +1 for this would be a very useful API, and for much beyond B2G.
>
> As an example use case: Instant messaging in Thunderbird currently
> takes
> the approach of using DOMParser.parseFromString to parse untrusted
> message strings, then remove all elements/attribute based on a
> whitelist. (Similar to Ian's proposal above).
nsIParserUtils.parseFragment exists in order to facilitate the
implementation of IM in Thunderbird.
> At a glance Jonas's approach seems more safe to me, so that only
> allowed
> elements/tags/ actually get instantiated at all, but my opinion isn't
> very well informed.
I'd expect any solution to involve instantiating all element nodes
first as inert and then throwing some of the nodes away. That's how
all Gecko-provided sanitization features work, because making the HTML
tree builder work without some nodes actually being there would be
more complicated.
> The real work to me though is coming up with the list of
> elements/attributes/allowedstyles - to be useful I think we should be
> providing a safe default (even if we allow developers to overide them
> with a policy of some kind).
For a Web-exposed API, I think we shouldn't have a default whitelist,
because in the future we couldn't change the whitelist out of fear of
existing sites depending on weird corner cases and it would be sad to
have new features sites throw away harmless elements introduced after
the whitelist was created.
> Which is why I really like the idea of combining capabilities with
> sanitisation - something like CSP allows developers to explicitly
> define
> what behavior is allowed, rather than trying to achieve this effect
> by
> restricting a DOM to a subset of HTML tags
To me, that sort of thing looks like an approach that won't be
successful on the Web, because Web authors will manage to make their
code rely on the craziest and most esoteric corner case behaviors of
the browsers they test with. That is, there's the risk that sites
would become dependent on the exact mapping from permitted behaviors
to specific elements and attributes so we wouldn't be able to keep
adjusting up what the permitted behaviors mean in the feature anyway.
Ian Melven wrote:
> yes, toStaticHTML has been on the security roadmap as an idea for
> some time.
> nothing seems to have happened on it since it has issues as Jonas
> mentioned, but
> I think there's significant desire to make something like this
> available to content..
Do we have a precisely reverse engineered spec for toStaticHTML yet? I
think we shouldn't support toStaticHTML for the same reasons I think
we shouldn't expose any browser API with the default whitelist to Web
content.
I think it would be fine for Mozilla to publish a JavaScript library
with a default whitelist, though. Users of a JavaScript library would
get a snapshot of the whitelist at the time they copy the library onto
their own site, but this wouldn't prevent further development of new
versions of the library the way further development of a
browser-provided API would be prevented. Users of all versions of the
library code update to newer versions of the library at their own pace
and testing that the new versions makes sense for their site.
I suggest starting out by packaging the sort of sanitizing
functionality Jonas proposed as a JavaScript library written on top of
DOMParser for full doc in string, XHR for full doc from URL and
document.implementation.createHTMLDocument("").body.innerHTML for
fragments. If there's a very good reason why the sanitization should
be browser-assisted (i.e. partially C++), I think the issue should be
raised on the WHATWG list instead of just putting something in B2G.
--
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/