wr-toronto

# Agenda for the Toronto WebRender hack week Jan 16 - 20, 2017

## Refactoring how / when tiling is used

A recap of the latest WR changes and the stuff to come.
Reasoning for killing tiles, complications, and performance results.
Using the depth buffer for the win.
Tiling vs software occlusion culling.

## General optimization
Look at https://github.com/servo/servo

## Clipping optimizations

Providing the optimal code path for a single rounded-cornered rectangle.
Avoiding the allocation and rendering of the opaque parts of complex intersecting clips.
Figuring the progressive clipping logic for the complex cases.

## 3D transform support

What is there currently. What is missing.
Primitive sorting by Z depth.

## SVG API/implementation

Generic API for SVG and Glyphs.
How to make the API work reasonably with big SVG's (hundreds of thousands of elements)

## Image integration

Replacing the Arc<Vec<u8>> with a trait that would let us do things to avoid copies in gecko and unlock the memory after upload
Figuring update_image API changes for progressive image loading

## C++ bindings organization
general organization of things namespaces, avoiding proliferation of types, etc.

## CI testing
Adding reftests to WR repo

Also, if we want to follow stylo and put webrender into the servo/servo repo and have it synced to m-c

Automated performance regression tests
Recording # of pixels written per test
Storing and displaying these statistics

## Angle compatibility
Discuss all the details with Glenn so that we are ready to work on it the week after when Sotaro is in.
Look into the glShaderBinary-providing Angle extension to pre-cache shaders (on Windows)

## More graphics API backends for WR
Do we need any for MVP?

Some possible other topics:
    Subpixel positioning
    Incremental vertex texture updates
    Zoom / pan
    Pixel snapping


# ============================ #
# ======== Discussion ========== #
# ============================ #

## Recap of WR state

Tile problems:
- transformed stuff clips to tiles, processes more pixels than needed
- some effects like blur need to sample outside the bounds

Tile goods:
- clip can skip the tiles completely inside the inner rectangles
- better batching (even at the cost of more vertices)

- because when tiling, we know in advance that some primitives don't overlap, so we don't have to split the batches that often

TODO:
- get the worst case scenario for the tile-less rendering
- can still do tiling (selectively) for batching, not shaders
- still need a bit of shader changes to finish (*)
- ~~figure out the slowness of Servo page (from 2.5ms to 5ms)~~
- check that rounded cornered rects have their own performance bar!

Document the discussion points, see https://github.com/servo/webrender/issues/744

## Missing image API features

- handle large images and do tiling on them into 2D Array textures.
- map external images to texture atlas regions.
- provide its internal timing for the external image callbacks, so that the vsync expectation is taken into account.

## CI

- WR recently got the ref-testing

- runs on Travis CI
- compares YAML to (a simplified) YAML, not to raw images at the moment

- can we add simple profiling to CI (Travis)?
- need dedicated hardware for proper graphics timing

## 3D Transforms

We can't rely on the depth buffer because of transparent surfaces (which would require some sort of order-independent transparency).
We need to split the intersecting planes and order the non-intersecting results before processing it with our current depth ordering.

There is a flag passed in with the StackingContext for "preserve 3D", which forces us to re-order the children every time we update the spatial graph.
For 3D-transformed nodes, we need 2 paths:

simple nodes with solid color - just drawn straight to the frame
complex nodes with stuff inside them:
we draw to an off-screen target (render task)
that simplifies our plane splitting, since the split quads would just have different UVs of the vertices

Need to do it after the transform after we figure the scrolling root/stacking context architecture.

There is an open question about transformed text w.r.t. sub-pixel AA. Perhaps, we don't need to worry about AA for the animated text?
TODO: check Gecko on Windows (Jeff M)
TODO: add support for the transformed text rasterization in WR

add the rotation angle to the glyph key
handle that in the vertex shader

## Naming and directory structure

Directory structure will look like this:

gfx/

webrender/ <-- this is https://github.com/servo/webrender/tree/master/webrender
webrender_traits/ <-- this is https://github.com/servo/webrender/tree/master/webrender_traits
webrender_bindings/

this will contain C bindings named wr_blah_blah_blah.
this will also contain some C++ wrappers in the mozilla::wr namespace

layers/

this will contain C++ classes in the mozilla::layers namespace

If any names need to be prefixed to make them more obviously webrender-related, the prefix should be "WebRender" or "Wr" (note the case).
nical is going to make these changes

## Image formats

MinVP:

remove RGB - not supported by HW
add a flag for transparency, affecting how RGBA is handled
move that flag, format, width, and height into some sort of ImageDescriptor
bring the format names close to DXGI/Vulkan

## Scrolling roots VS Stacking contexts VS Containing blocks VS clipping

Main problem - scrolling roots are NOT stacking contexts, in general. The movement to separate them has started, but there was no general model to actually process the scroll regions correctly.
Extra problem: stacking contexts are NOT always propagating their transformation/position changes to the child items.

Solution:

API should define scrolling roots, independently from pushing/popping the stacking contexts.
Each item, stacking context, or a clip region is defined within the current stacking context, but also carries a scrolling root ID that has been defined.
WR builds a tree of reference frames (RF) and scrolling roots (SR). An RF's scroll root is the first one on the way up the tree.
Stacking contexts get destructed internally into:

has a transform?

yes - becomes a reference frame (RF)
no - becomes an extra position offset for the contained elements

Effect groups (like opacity and blend modes) (may need clarification here, since it may require the whole stacking context to render into an off-screen target)

Each of RF, SR, clip, or item is associated with both a containing RF and a SR. The world transformation is computed as:

T = local_transform * T_rf + delta_scroll * T_sr
local_transform is just an offset for anything but RF
T_rf and T_sr are expected to be resolved here. If there is a dependency loop, we consider the input to be invalid (assuming it's not possible in CSS).

Using this equation, we can descend into the SRF tree and re-compute all the world transformations each frame
The GPU representation for shaders will have an entry for each `(RF, Option<SR>)` pair, shaders don't even need to be changed

One of the side effects is - less work for Servo to prepare the display lists for us.
The clip stack gets accumulated via the same SRF tree.

## More graphics API backends for WR

We still keep Angle for:

WebGL, it will need to provide a texture for us in the API that we use natively.
DXGI interop (video decoding)
Tooling (better graphics debugging)
better ES3 compatibility
D3D9 compatibility

MinVP - keep using Angle.
MaxVP - get D3D12 first.

## SVG/Path rendering plans

MinVP plan:
WR knows what area is covered by a path, splits it into tiles of equal size, allocated in a single texture array.
When WR backend processes the visibility, it knows which tiles (or regions of them) are visible. Then it has a trait for path rendering. For each visible tile, it requests a blob that the path renderer can draw into a CPU-side image. This blob can then be skPicture, ftOutline, or whatever else.

Benefits of this approach are:
    - no hard dependencies on skia or freetype
    - potentially less work to do since we only care about visible tiles
    - automatic support for large geometry objects, because of tiling
    - off-loading the content thread, which no longer needs to draw those images
    - unlocks the way to remove the painted layers entirely, cleaning up the Gecko code

The previous plan was to use gecko tiled painted layers in the content process to render into texture client and share the produced textures with WebRender using the external image mechanism.
The new plan is to use a special kind of painted layer that creates a recording of the drawing commands (using SkPicture or Moz2D recording or some other serialization mechanism) and pass that to webrender as described above. The advantages are that we preserve the layerization logic thanks to having a recording per painted layer, while moving all of the remaining painting off the main thread, and avoiding the memory and CPU overheads associated to texture sharing and the buffering it requires.
The problem with SkPicture is that it serializes everything into each stream, including images and font data which can cause some recordings to be huge (for example 3 layers containing text using the same web font will cause the font data to be serialized 3 times). So we may decide to not use SkPicture. This should be transparent to WebRender, however: the idea is to register a "VectorImageRenderer" which take blobs containing the recording and produces the rasterized output images. This way the VectorImageRenderer trait can be implemented by Gecko without WebRender having to depend on Skia or any other solution we might choose.
As a bonus the research folks can experiment with crazy GPU vector rendering by implementing their own VectorImageRenderer without colliding with Gecko requirements.
nical has an incomplete webrender patch that adds this VectorImage concept (but doesn't do any rendering yet).

MaxVP:
WR needs to pave a way for a native GPU path renderer, be it either Lyon or something new and fancy that Patrick is working on.

## Image integration

Shared memory for image data:
For add_image, hiding the image data behind a trait would involve issues, since it has to be serializable (hence, the trait should be boxed and derived from serde traits).
This is not really needed for MinVP, since Gecko will use the extended external image API stuff instead (see "missing image API features").
Related - https://github.com/servo/webrender/issues/723

Progressive images:

interlacing (PNG) - to be handled by Gecko internally, WR will get the full image contents on each update. Alternatively, we may expect the stride given to update_image to be a multiple of the actual stride.
chunks (JPEG) - TODO: change the update_image API to receive a sub-image, thus providing a rectangle with stride. We also need to fill up the image as transparent when it's just created.

### Sub-pixel positioning

TODO: Add the sub-pixel offset to the glyph key, quantized to 1/3 or 1/4 of the pixel.
Doesn't make sense for large glyphs, since it may force multiple copies of a glyph, consuming VRAM.

### Zoom / pan

Logic is essentially similar to our `device_pixel_ratio`, minus the snapping part.
TODO: need to add the zoom/offset API setters on the pipeline level (as well as add the logic to the shaders), which would support panning/zooming in the document or iframes.

### Incremental vertex texture updates

Instead of updating the textures directly and tracking the dirty areas, we can scatter the changes by drawing a bunch of point primitives into an FBO with those textures.
Uploading the data would then be a stream operation (with discard), and we'd only need to upload the changed information, and nothing else.
See https://github.com/servo/webrender/issues/458

## Clipping optimizations

General case solution:

Find out the inner rectangle of the clip stack - the axis aligned (in world space) intersection of all the clip instances, area contained by all clips.
The difference between the outer rectangle of the stack and the inner one is split into 4 axis-aligned rectangles that are allocated independently in the mask cache. All the clip instances are then drawn into all 4 sub-rectangles, with some room for culling on the way.
When we draw an affected primitive, we have the cache coordinates of all 4 sub-rectangles, and we figure out which to sample first (or none).

Each affected primitive can be drawn in 2 passes then. The intersection of it with the inner rectangle (of the clip stack) can be drawn in the opaque pass.
Problem is - the inner rectangle is given in the world space, and the primitive is in the local space, so intersecting those would be difficult.
Then the full primitive can be drawn in the transparent pass. The pixels already filled out by the opaque pass will be rejected by the Z test.

Optimizations: any clip instance that contains another clip instance completely can be discarded from the stack.
Likewise, a primitive that is completely inside the inner rectangle of the clip stack can render as if there is no clips on it.

Popular custom case - one rounded cornered rectangle:

Split the rectangle into 4 parts: 3 parts are covering all the opaque space inside it, and the rest belongs to the rounded corners.
In the opaque pass, the first 3 parts are drawn.
The mask is generated for the corners independently.
The corners are then drawn as little rectangles in the transparency pass, each fetching from a single mask region.

Rendering the clip stack itself into the mask:

Current approach:

figure out the axis-aligned bounding box of the intersection of all clips in the stack
allocate the target with the size of that bounding box, assume it was previously cleared as opaque (1)
render each clip instance into the mask with multiplicative blending

the vertex shader stretches the vertices, making sure that each clip is covering the whole mask (!)

it's done. The primitive then just takes a sample from the mask for each pixel.

Problems:

For a common case of a rounded cornered rectangle, we allocate all the space it takes, which may span over multiple screens (!). This is now being addressed by the "Popular custom case" logic.
Drawing clip instances is piggy-backing on the transform shaders, which are no longer needed for general primitives because of the tile removal
Drawing clip instances spreads pixels outside of the clips, to cover the whole bounding box area
If some pixel is not covered by a clip instance, we still blend everything on top of it and process the clips.

Problems 2 and 3 could be addressed by using the stencil. We'd have it as 0 initially, then each instance would bump it to the next value. The primitive shader would just then need to read the value and compare it to the total number of clips that affect it.

Problem 4 can also be addressed. For each clip stack *level*, we'd only draw on top of pixels that passed all the clip tests to this level, and bump their values.
This optimization requires the stencil state to change for each clip stack level, so it would introduce some batch breaks. It seems to be the most efficient approach overall.

For both stencil techniques, there are 2 options to deal with clipped pixels:

fetch the stencil in the fragment shaders of the primitives and compare the value to something we pass with the primitive
do an extra pass over the bounding box of the clip stack, drawing into the mask and filling all the pixels that didn't pass some of the tests with 0