2015-01-15, 13:35
(2015-01-15, 00:34)smallint Wrote: * On behalf of wolfgar and smallint *
Abstract
The new proposed iMX implementation improves rendering in terms of used memory bandwidth and hardware utilization. User reports and our tests showed over the last months that we are currently hitting the hard limit with de-interlacing
and HD video streams. We identified buffer copies and GPU utilization as main bottleneck. The new implementation addresses these is limitations.
Description
The old implementation uses the following path to render a de-interlaced picture:
1. Decode data to buffer1
2. De-interlace buffer1 into buffer2
3. Render buffer2 with GPU to back buffer
4. Copy back buffer to fb0
5. Send fb0 buffer to IPU for display
This involves color conversions and buffer copies at any stage. Since CSC is very fast this is most likely not the primary perfomance penality. The used memory bandwidth and high utilization of GPU are the main drawbacks of this method according to our tests:
Note that memory bandwidth may be an improper naming and that bus load may be more appropriate. Indeed, one could measure quite greater memory bandwidth when using only the GPU for instance. But it does not mean that we have big
margins on the memory bus. GPU is able to perform long read/write bursts (64 bytes) from/to memory whereas some other components (in particular the VDIC) are structurally limited to smaller burst size and intrinsically use the memory bus in a non optimal way triggering contention while the theorics maximum bandwidth is far from being reached.
We are unable to deinterlace using VDIC and render with GPU in time for demanding videos streams.
The Freescale profiling tool mmdc (named according to the multi-mode DDR controller that provides debug functions) shows that these cases exhibit a bus load (ratio between busy cycles and total cycles) which is greater
than 97% which means that almost all cycles are busy. Unfortunately all these busy cycles are not used in a perfect way to transfer data thus the bandwidth which is not so high while the bus is definitively unable to deliver more...
The high bus load sometime exhibits as black screen: the HDMI display looses sync with the box due to DP which could not transfer the memory fast enough. That happens regularily with my (smallint) installation when de-interlacing 1080i50 with the render method to be replaced. The new implementation aims to avoid buffer copies as well as limiting bus
usage as much as possible and draws as follows:
1. Decode data to buffer1
2. De-interlace buffer1 into buffer2
3. Show buffer2 (framebuffer panning, no copy)
4. Send fb0 and fb1 to IPU for display combining them on the flow (thanks
to DP capability)
Note that step 1,2 and 3 are performed in separate threads with each method.
We avoid two buffer copies (to GPU and from GPU) by rendering directly into another framebuffer (fb1) which is composed with fb0 at each sync by the display processor (DP).
This allows for very fast de-interlacing of HD streams and even enables double rate rendering with 50fps.
It causes less hardware utilization since the GPU is not used at all during fullscreen playback which leads to less power consumption, less thermal dissipation and more memory bandwidth left for other tasks.
There are still possible improvements by playing with fb0, fb1 usage but current results are impressive already.
As Android is concerned it should work as well but needs to be tested.
What does change?
Since we are now rendering into another framebuffer the current GLES code does not have access to that framebuffer anymore. 3D rotations, color correction, gamma control and all that does not work with GL anymore but the IPU
implements corresponding functionality. As a consequence the screenshot feature is currently broken and does not save the video content. This can be fixed quite easily (not yet done).
Furthermore the framebuffer needs to be setup with 32 bpp to make compositing of fb0 and fb1 work with transparency. To address also lower end hardware based on i.MX6 Solo like hummingboard, 16 bpp can also work but with limitations. The implementation checks the current number of bits per pixel and switching to alpha blending (bpp == 32) or color keying (bpp == 16) of the GUI overlay.
Figures
Code:File Progressive De-interlacing Double rate
-------------------------- ----------- -------------- -----------
1080i50_h264_stream - 29ms 16ms
1080i50_h264_mbaff_stream - 16ms 11ms
burosch1_stream - 7ms 7ms
Those measurements were taken on wolfgars cubox-i4 with VPU@352Mhz and tweaked VPU prio an axi bus (devmem 0x00c49100 w 7 && devmem 0x00c49104 w 7).
The following numbers were gathered on my (smallint) box with a vanilla ArchLinux installation on a Wandboard Quad running kernel 3.10.17.
Code:File Progressive De-interlacing Double rate
-------------------------- ----------- -------------- -----------
1080i50_h264_mbaff 26ms 22ms 19ms
1080i50_h264_mbaff (GPU) 26ms 35ms 28ms
1080i50 28ms 44ms 39ms
1080i50 (GPU) 45ms 53ms 42ms
Conclusion
After running this code for some time now on my box used on a daily basis we don't want to switch back to GPU rendering anymore. The playback feels much smoother and the full HD double rate feature in combination with Stéphans
kernel fix is something one don't want to miss anymore.
There are most likely additional issue that need to be addressed with that approach but those can be fixed in the software and we don't deal with hardware limitations to such extent as with the old implementation.
NOTE: This implementation requires a merge of PR 6090.
The implementation is available at [1].
wolfgar & smallint
[1] https://github.com/smallint/xbmc/tree/thread
Could the basic concept of this implementation also be reproduced on other none-iMX hardware platforms as well?
I'm thinking about for both other embedded systems as well as low-end or older desktop GPUs.