Login at Kodi Home

Hedda · (This post was last modified: 2015-01-15, 13:36 by Hedda.)

(2015-01-15, 00:34)smallint Wrote: * On behalf of wolfgar and smallint *

Abstract

The new proposed iMX implementation improves rendering in terms of used memory bandwidth and hardware utilization. User reports and our tests showed over the last months that we are currently hitting the hard limit with de-interlacing
and HD video streams. We identified buffer copies and GPU utilization as main bottleneck. The new implementation addresses these is limitations.

Description

The old implementation uses the following path to render a de-interlaced picture:

1. Decode data to buffer1
2. De-interlace buffer1 into buffer2
3. Render buffer2 with GPU to back buffer
4. Copy back buffer to fb0
5. Send fb0 buffer to IPU for display

This involves color conversions and buffer copies at any stage. Since CSC is very fast this is most likely not the primary perfomance penality. The used memory bandwidth and high utilization of GPU are the main drawbacks of this method according to our tests:

Note that memory bandwidth may be an improper naming and that bus load may be more appropriate. Indeed, one could measure quite greater memory bandwidth when using only the GPU for instance. But it does not mean that we have big
margins on the memory bus. GPU is able to perform long read/write bursts (64 bytes) from/to memory whereas some other components (in particular the VDIC) are structurally limited to smaller burst size and intrinsically use the memory bus in a non optimal way triggering contention while the theorics maximum bandwidth is far from being reached.

We are unable to deinterlace using VDIC and render with GPU in time for demanding videos streams.

The Freescale profiling tool mmdc (named according to the multi-mode DDR controller that provides debug functions) shows that these cases exhibit a bus load (ratio between busy cycles and total cycles) which is greater
than 97% which means that almost all cycles are busy. Unfortunately all these busy cycles are not used in a perfect way to transfer data thus the bandwidth which is not so high while the bus is definitively unable to deliver more...

The high bus load sometime exhibits as black screen: the HDMI display looses sync with the box due to DP which could not transfer the memory fast enough. That happens regularily with my (smallint) installation when de-interlacing 1080i50 with the render method to be replaced. The new implementation aims to avoid buffer copies as well as limiting bus
usage as much as possible and draws as follows:

1. Decode data to buffer1
2. De-interlace buffer1 into buffer2
3. Show buffer2 (framebuffer panning, no copy)
4. Send fb0 and fb1 to IPU for display combining them on the flow (thanks
to DP capability)

Note that step 1,2 and 3 are performed in separate threads with each method.

We avoid two buffer copies (to GPU and from GPU) by rendering directly into another framebuffer (fb1) which is composed with fb0 at each sync by the display processor (DP).

This allows for very fast de-interlacing of HD streams and even enables double rate rendering with 50fps.

It causes less hardware utilization since the GPU is not used at all during fullscreen playback which leads to less power consumption, less thermal dissipation and more memory bandwidth left for other tasks.

There are still possible improvements by playing with fb0, fb1 usage but current results are impressive already.

As Android is concerned it should work as well but needs to be tested.

What does change?

Since we are now rendering into another framebuffer the current GLES code does not have access to that framebuffer anymore. 3D rotations, color correction, gamma control and all that does not work with GL anymore but the IPU
implements corresponding functionality. As a consequence the screenshot feature is currently broken and does not save the video content. This can be fixed quite easily (not yet done).

Furthermore the framebuffer needs to be setup with 32 bpp to make compositing of fb0 and fb1 work with transparency. To address also lower end hardware based on i.MX6 Solo like hummingboard, 16 bpp can also work but with limitations. The implementation checks the current number of bits per pixel and switching to alpha blending (bpp == 32) or color keying (bpp == 16) of the GUI overlay.

Figures

Code:
File Progressive De-interlacing Double rate -------------------------- ----------- -------------- ----------- 1080i50_h264_stream - 29ms 16ms 1080i50_h264_mbaff_stream - 16ms 11ms burosch1_stream - 7ms 7ms

Those measurements were taken on wolfgars cubox-i4 with VPU@352Mhz and tweaked VPU prio an axi bus (devmem 0x00c49100 w 7 && devmem 0x00c49104 w 7).

The following numbers were gathered on my (smallint) box with a vanilla ArchLinux installation on a Wandboard Quad running kernel 3.10.17.

Code:
File Progressive De-interlacing Double rate -------------------------- ----------- -------------- ----------- 1080i50_h264_mbaff 26ms 22ms 19ms 1080i50_h264_mbaff (GPU) 26ms 35ms 28ms 1080i50 28ms 44ms 39ms 1080i50 (GPU) 45ms 53ms 42ms

Conclusion

After running this code for some time now on my box used on a daily basis we don't want to switch back to GPU rendering anymore. The playback feels much smoother and the full HD double rate feature in combination with Stéphans
kernel fix is something one don't want to miss anymore.

There are most likely additional issue that need to be addressed with that approach but those can be fixed in the software and we don't deal with hardware limitations to such extent as with the old implementation.

NOTE: This implementation requires a merge of PR 6090.

The implementation is available at [1].

wolfgar & smallint

[1] https://github.com/smallint/xbmc/tree/thread

Could the basic concept of this implementation also be reproduced on other none-iMX hardware platforms as well?

I'm thinking about for both other embedded systems as well as low-end or older desktop GPUs.

smallint · 2015-01-15, 13:48

It depends on the system if it supports e.g. to combine framebuffers in hardware and if that will be faster than GPU rendering. Ours is a specific implementation based on tests with the i.MX platform. I think the Raspberry Pi does something similar with its hardware layers.

pssturges · 2015-01-17, 05:28

Fantastic work! Been looking forward to this for while. When can we expect to see it in say openelec? When we get 5.1 maybe?

Thank you!

**zaphod24** · 2015-01-20, 23:44

(2015-01-17, 05:28)pssturges Wrote: Fantastic work! Been looking forward to this for while. When can we expect to see it in say openelec? When we get 5.1 maybe?

Thank you!

Or maybe 5.0.1?

smallint · 2015-01-21, 11:33

Guys, this is a developer thread and not about distributing packages.

smallint · 2015-01-22, 00:27

I pushed another version as that posted in our description: https://github.com/smallint/xbmc/tree/imx-rework. This is how I think the described workflow should be implemented. If someone could give that a try and give feedback ... except HD+ double rate (without VPU@352M) runs smoothly on my boxes. Once 6090 is merged in I will merge that into 5805 unless someone stops me from doing that for good reason.

**fritsch** · 2015-01-22, 09:05

Looking through that tree, it seems 6090 is missing in there - so that should be manually picked?

Any chance to rebase this ontop of master and just pick 6090 for now - that way it will already be complete for merging and rebasing it ontop of 14.1 (which somebody will likely do, so that the distributions can have that tested in their releases) is much easier?

Thanks much!

smallint · 2015-01-22, 10:55

6090 is in there because the whole workflow is based on the new interfaces: https://github.com/smallint/xbmc/commit/...c4f82c044b and more commits before that.

**fritsch** · 2015-01-22, 10:56

Okay - I lost the overview this morning, when I tried to rebase it ontop of master.

smallint · 2015-01-22, 11:19

Here it is but untested: https://github.com/smallint/xbmc/tree/imx-rework-rebase

**fritsch** · 2015-01-22, 12:05

Thank you very much - will give it a try this evening.

smallint · 2015-01-22, 12:17

One thing that is still missing is RenderCapture. Screenshot should work. I compared the code against imx-rework + merged master and there are not diffs so the code in imx-rework-rebase should be OK.

**fritsch** · (This post was last modified: 2015-01-22, 12:28 by fritsch.)

Codewise it looks fine and really non intrusive, that's nice. There are some diffs with just white space correction for Application.cpp or GUIWindowManager.cpp which could be a separate commit and go in without that other work. A little #if 0 path in LinuxRendererGLES.cpp, which could probably be removed I don't see anything that would stop that from being merged.

The RenderCapture could also follow later.

Code:
+#define GL_VIV_YV12 0x8FC0

#define GL_VIV_NV12 0x8FC1

+#define GL_VIV_YUY2 0x8FC2

+#define GL_VIV_UYVY 0x8FC3

+#define GL_VIV_NV21 0x8FC4

+#define GL_VIV_I420 0x8FC5

could perhaps be #defined elsewhere and just included, but that's personal taste to keep LinuxRendererGLES.cpp clean. Would not hurt in DVDCodecs/Video/DVDVideoCodecIMX.h though.

Very nice, curious very much until the evening :-)

smallint · 2015-01-22, 12:34

The GL defines are not necessary anymore since that path is obsolete. They could be removed completely as well as Vivante specific extensions. But if we want to have fallback GL output (e.g. for Android) just keep them. Another thing is the include for g_IMXContext which is currently the codec include and should go into somewhere else.

Lets test that first and then I or wolfgar will clean up things for the final rebase.

**fritsch** · 2015-01-22, 12:37

I will try to make an OpenELEC build tonight, do you have a link to the kernel patch, which is needed for double rate to correct the offsets? Cause this also must go into that build.