Uses a different cache primitive and a differend madd(u) encoding.
Also added a flag for BGR vs RGB color output (since PSP is assuming to
be BGR for speed).
Aside from that the ABI required some special function calls for PIC.
This allows us to emit the handlers directly in a more efficient manner.
At the same time it allows for an easy fix to emit PIC code, which is
necessary for libretro. This also enables more platform specific
optimizations and variations, perhaps even run-time multiplatform
support.