VR software and hardware

📅 2024-08-14 📄 source

Linux-centric overview

General VR stack simplified:

[VR app]
↕ VR API
[VR runtime]
↕ stuff between runtime and VR hardware incl. sensors protocols
[VR hardware]

All common VR apps use VR APIs with VR libraries or runtimes implementing them, which abstract away complexities and differences of VR hardware, presenting it in uniform convenient manner and taking care of stuff which app developers won't be happy to bother with. You can see VR APIs as analog for 3D graphics APIs, and VR runtimes as analog for implementations of these APIs, some being specific to single VR hardware family (like graphics drivers from GPUs vendors), others more broad/modular (like Mesa). Each HMD usually has "perfect support" in specific vendor-supplied VR runtime, but can be also supported by alternate FOSS project (currently all effort is converging to Monado project becoming like "Mesa for VR"), also there are "translation layers" between APIs and other "bridges" allowing to use apps developed against an API with HMDs which don't have this API supported in their vendors runtime, but have another one which is simpler to develop against then to reverse engineer HMD stuff (like all those D3D<->Vulkan wrappers). Also there are "VR streaming" solutions for using apps running on PC with "standalone HMDs", which involves sending frames as compressed video stream over network.

Note: VR APIs don't handle 3D rendering, rendering is done using existing 3D graphics APIs, VR APIs are used to obtain data which affects rendering, such as spatial positions of cameras for projections and FOV, and to pass rendered frames to VR runtime which can transform them or even use them as input to synthesize actual displayed frames (compositing, reprojection, etc)

Note: in tech world term "XR" (extended) is preferred as generalization of overlapping concepts of VR/AR (augmented)/MR (mixed reality); all these, including VR as understood by most "HMD with stereoscopic display and sensors and possibly some controllers", are just usecases of more generic tech:

Note: another common non-VR XR usecase is "magic window", when virtual environment is "projected" on phone display (not placed in some "phone VR viewer") adjusting to its spatial position. Besides, there are developments in area of volumetric displays.

Note: VR runtimes are userpace things and all VR-specific stuff is computed in userspace (well, most GPU drivers have 3D graphics APIs implemented in userspace too, with kernel parts doing only basic stuff). If runtime has modular architecture allowing to add support for different HMDs, its modules are usually called drivers. Runtimes themselves usually aren't called drivers, even though monolithic ones are kind of.

There are now 3 relevant cross-vendor VR APIs:

Notable vendor-specific APIs:

HMDs

Wired PC VR HMD connectivity: most PC VR HMDs still use separate USB/HDMI/power cables, even if "bound" together. There was a push from Nvidia for "all-in-one" USB-C connector and cable using USB DP alt mode for video out and USB PD for power delivery, which they called VirtualLink (it was just subset of modern USB, nothing proprietary), but quickly abandoned it, "VirtualLink" USB-C port was only present in some RTX 20xx cards. "All-in-one" USB-C is also present in some newer AMD cards, including RX 7000 series. HMDs which support single USB-C cable are still hard to find. Standalone Oculus Quest supports Oculus Link USB-C cable, but it sends video via proprietary Oculus software using H.264/H.265 compression.

Wired PC VR HMDs all use DP/HDMI for video out, but for for sensors data there is no cross-vendor protocol standard. Closest one is probably Valve "lighthouse" which is used by multiple manufacturers ("minor independent" manufacturers tend to target Valve ecosystem).

Most PC VR HMDs are designed for "room-scale VR" requiring placing some stationary beacons or outer cameras at least for positional tracking/6DoF. HMDs with "self-sufficient" positional tracking include WMR HMDs and standalone HMDs such as Oculus Quest.

Note: 3DoF (3 degrees of freedom, i. e. 3 axes of rotation) is "looking around from single point", requiring rotational tracking only, 6DoF (3 axes of rotation + 3 axes of translation) is "looking and moving around", requiring rotational+positional tracking. Rotational is relatively easy to implement "well enough" for a device with gyro (for inyalowda lucky to have great "beacon" underneath), positional is very hard, especially "inside-out" (using only sensors placed in HMD) without beacons, and it's common situation when FOSS project for HMDs with full set of sensors supports rotational very well, but positional is missing or stuck in experimental state with poor experience.

Some specific HMDs which are of interest to me:

VR smoothing technologies

VR experience suffers a lof from visuals on frame currently seen by user not matching his current position. 60 FPS are fine for most and 120 for almost everyone for perception of movement of objects on screen as smooth, and HMD displays have appropriate refresh rates, however:

One approach for dealing with these problems is about modifying rendered frames before display to better match current/predicted users position, generally known as reprojection, https://en.wikipedia.org/wiki/Asynchronous_reprojection (called "asynchronous" because it's not synchronous with apps rendering thread, "reprojection" because it is seen as reprojecting 3D scene from altered view point/direction, even though with most sophisticated solutions it's not the only thing which is happening). Simplest rotational reprojection treats projections as "celestial spheres", more sophiscicated ones use depth maps with special handling of VR UI elements and head-locked objects. Chronology of developments in this area:

TODO: pros and cons, adoption outside of VR as alternative to Freesync?

Another approach is to reduce amount of computations needed to render frames

Other:

TODO: current state of VR smoothing in relevant VR stacks, especially in Monado

There are claims that Google Daydream supports reprojection, so Cardboard probably doesn't

Using phone as PC VR (and Android-based standalone HMDs)

In such "VR streaming" setups PC software which receives tracking data from phone, feeds it to app and streams video rendered by app to phone is called "server", and phone software which streams tracking data to PC and receives and displays rendered video is called "client".

Currently most mature partially-FOSS (runs on Linux via SteamVR) solution is ALVR project. It consists of server https://github.com/alvr-org/ALVR (implemented as SteamVR driver) and client https://github.com/alvr-org/PhoneVR (Android app using Cardboard library, i. e. 3DoF only), https://www.youtube.com/watch?v=_5k9htTdpuI

Standalone HMDs like Oculus Quest also use Android, allow to install Android .apks and expose OpenXR API to apps, which creates appeal to have single streaming solution "to rule them all". Whole stack to be like:

Alternatively, phone Android VR streaming client should not rely on any VR API on phone, but just send sensors data and camera stream to PC for processing (one of greatest failures of phone VR is phone getting hot from computations, might be good idea to offload them to PC as much as possible, though it's a question whether compressing and transmitting data will be lighter).

Relevant projects:

There are claims that Oculus streaming solution has some advanced optimizations which are very hard to implement in alternate projects, or even impossible to implement without using some private interfaces available only for Oculus streaming client. However there are alternate proprietary streaming solutions such as Steam Link app for Oculus Quest (currently only works only with Windows version of SteamVR) and Virtual Desktop, which seem to compete well (VD often considered better than official one). Complaints about ALVR mostly fall into 3 groups:

TODO: Android "VR helper" setting, OpenXR "standard loader"

TODO: which protocols are used between server and client? Video is universally sent as H.264/H.265/AV1-compressed stream, but what about VR-specific metadata for frames, such as frame submission timing for reprojection? Clocks sync? https://github.com/alvr-org/ALVR/wiki/How-ALVR-works#the-streaming-pipeline-overview .

QR codes of Cardboard VR viewers

QR code actually encodes URL of format https://google.com/cardboard/cfd?p={device_params_string} (or URL https://g.co/cardboard which indicates standard "Cardboard Viewer v1" device, or URL which redirects to such URL, see https://github.com/googlevr/cardboard/issues/444 https://developers.google.com/cardboard/reference/unity/class/Google/XR/Cardboard/Api#savedeviceparams ), where {device_params_string} is base64-encoded serialized protobuf message of format DeviceParams, defined and explained in https://github.com/googlevr/cardboard/blob/master/proto/cardboard_device.proto . To parse it with Python ( https://protobuf.dev/getting-started/pythontutorial/ ):

import base64
import cardboard_device_pb2
device_params = cardboard_device_pb2.DeviceParams()
device_params.ParseFromString(base64.b64decode(device_params_string))
print(device_params)

Example output for device_params_string CgZIb21pZG8SDUhvbWlkbyAibWluaSIdhxZZPSW28309KhAAAEhCAABIQgAASEIAAEhCWAE1KVwPPToIexQuPs3MTD1QAGAC:

vendor: "Homido"
model: "Homido \"mini\""
screen_to_lens_distance: 0.05299999937415123
inter_lens_distance: 0.06199999898672104
left_eye_field_of_view_angles: 50.0
left_eye_field_of_view_angles: 50.0
left_eye_field_of_view_angles: 50.0
left_eye_field_of_view_angles: 50.0
tray_to_lens_distance: 0.03500000014901161
distortion_coefficients: 0.17000000178813934
distortion_coefficients: 0.05000000074505806
vertical_alignment: CENTER
primary_button: TOUCH

If "Skip" button is pressed in Cardboard app QR code scan dialog, Cardboard sets params for "Cardboard Viewer v1" which are defined in class CardboardV1DeviceParams in https://github.com/googlevr/cardboard/blob/master/sdk/device_params/android/java/com/google/cardboard/sdk/deviceparams/CardboardV1DeviceParams.java#L21

Linux VR desktop

I'm currently interested in "simple" solution with single desktop "projected" on VR surface, like a giant screen floating in space in front of user, initially positioned on HMD virtual image plane, with possibility to lock to head/re-center it in case it drifts.

3DoF VR videos

This is only type of VR video which is currently affordable to produce, it is filmed by double camera with wide angle lens (180 degree) or 2 opposite-facing double cameras (360 degree, via "stitching" of 2 180degree videos). Perception is perfect only as long as head movements are restricted to positions in which eyes positions match cameras (only pitch/nodding movements, following the central vertical line of video), but in practice feeling of depth is good for most people when looking around "naturally".

To be more precise about video format, there are different options, but almost all use "SBS EQR" (side-by-side equirectangular), popularized by Youtube (its Android app has VR mode for viewing such videos in Cardboard VR viewers).

Most websites with VR videos currently seem to use "Delight XR" video player which relies on browser WebXR support and can be recognized by common UI and <dl8-video> tag in code: https://delight-vr.com/documentation/dl8-video/

Solutions for local playback:

Note: smartphones (mass produced mobile SoCs) are currently limited to 3840x2160 resolution for hadware accelerated video decoding (for comparison, Nvidia cards NVDEC can do 8192x8192); software decoding, even if available, usually doesn't play well with VR projection rendering pipelines. Easiest way to check hardware accelerated video decoding limitations of Android smartphone is to open "chrome://gpu" URL in Chrome. Probably the only mobile SoC family capable of ultra high resolution video decoding is Qualcomm Snapdragon XR, used by virtually all standalone HMDs.

TODO: future VR video formats with compression redundancy between half-frames, efficient usage of available rectangular frame size

TODO: affordable VR video cameras

aframe-vr-player notes

aframe-vr-player uses A-Frame which uses three.js . three.js is JS 3D graphics library, it doesn't touch HTML/DOM, it has its own model with "renderer" rendering "scene" using "camera", "scene" is filled with "meshes" 3D objects, and only thing related to HTML/DOM is renderer.domElement canvas, which is intended to be added to page to display rendered frames somewhere. A-Frame is JS XR framework, it uses DOM to allow describing XR scene as hierarchy of custom HTML tags (<a-scene>, <a-entity>, <a-assets>, <a-image>), parses them, sets up and renders scene using three.js (and WebXR API of course, using it to adjust camera and handing renderer canvas over to it so that browser can pass projections to lower level VR runtime for displaying on HMD).

Main entities in aframe-vr-player scene:

There's no projection maths in player code, it just sets up spheres with texturing from video and camera with FOV and direction, and underlying 3D stack does the rest.

There's no explicit access to decoded video frame pixmap, it just sets <video> element id as texture source for spheres with offset for left/right half of frame, and underlying stack does the rest.

Stereo is disabled when UI is shown via setting "default eye" sphere visible to camera for both eyes and hiding another one.

A-Frame scene hierarchy elements have attrs which reference A-Frame "components", and values of these attrs contain args for these components.

Approximate flow:

6DoF VR videos

Currently most impressive thing I've seen is Light fields demo from Google team https://augmentedperception.github.io/deepviewvideo/ , they also released demo app in Steam with static 6DoF images https://store.steampowered.com/app/771310/Welcome_to_Light_Fields/

There's also Pseudoscience VR player popular in reddit 6DoF community, which seems to "guess" depth for 3DoF videos and hidden parts of objects.

Games and sims

Physical-physiological FAQ

Other notes and links


Return to index