5G Multimedia Standardization
Frédéric Gabin1, Gilles Teniou2, Nikolai Leung3 and Imre Varga4
1Chairman of 3GPP SA4, Standardisation Manager at Ericsson, France
2Vice Chairman of 3GPP SA4, Senior Standardization Manager at Orange, France
3Vice Chairman of 3GPP SA4, Director of Technical Standards at Qualcomm, Philippines
4Chairman of 3GPP SA4 EVS SWG, Director of Technical Standards at Qualcomm, Germany
E-mail: frederic.gabin@ericsson.com; gilles.teniou@orange.com; nleung@qti.qualcomm.com; ivarga@qti.qualcomm.com
Received 30 March 2018;
Accepted 3 May 2018
In the past 10 years, the Smartphone device and its 4G Mobile Broadband Connection supported the now well-established era of video multimedia services. Future mass market multimedia services are expected to be highly immersive and interactive. This paper presents an overview of 5G multimedia aspects as specified by 3GPP for various services that will be provisioned over the 5G network. Specifically, we cover the evolution of streaming services for 5G, Virtual Reality 360∘ video streaming, real-time speech and audio communication services VR evolution and user generated multimedia content.
In the past 10 years, the Smartphone device and its 4th generation (4G) Mobile Broad-Band (MBB) connection supported the now well-established era of video multimedia services. Future mass market multimedia services are expected to be highly immersive and interactive. Multimedia services evolve rapidly, require capable devices with flexible Application programming interfaces (APIs) and scalable distribution networks. The 5th Generation (5G) system enhances the support of MBB applications that consume bandwidth and require low latencies.
This paper presents an overview of 5G multimedia aspects as specified by the Third Generation Partnership project (3GPP) for various services that will be provisioned over the 5G network. Specifically, we cover the evolution of streaming services for 5G, Virtual Reality 360∘ video streaming, real-time speech and audio communication services VR evolution and user generated multimedia content.
Content providers, broadcasters and operators intend to leverage 5G systems using enhanced Mobile Broadband (eMBB) slices to deliver on-demand and live multimedia content to their subscribers. 3GPP currently studies the required evolution of media delivery services specifications.
Virtual Reality (VR) is currently the hottest topic in the field of new audio-visual experiences. It is a rendered version of a delivered audio and visual scene, designed to mimic the sensory stimuli of the real world as naturally as possible to an observer as he moves within the limits defined by the application and the equipment. Providing a 360∘ experience relies on a new set of representation formats for both audio and video signals that 3GPP intends to specify.
3GPP launched the new Work Item on Enhanced Voice Service (EVS) Codec Extension for Immersive Voice and Audio Services (IVAS) in September 2017 meeting. IVAS is the next generation 3GPP codec for 4G/5G, built upon the success of the EVS codec. It intends to cover use cases on real-time conversational voice, multi-stream teleconferencing, VR conversational and user generated live and non-live content streaming. In addition to address the increasing demand for rich multimedia services, teleconferencing applications over 4G/5G will benefit from this next generation codec used as an improved conversational coder supporting multi-stream coding (e.g., channel, object and scene-based audio).
User generated content, especially video, has become recently some of the leading content viewed by Internet users, surpassing the popularity of branded videos and movies. Slightly preceding this trend has been the rapid increase in video traffic uploaded to popular streaming sites, with surveys showing that most Internet users upload or share a video at least once a month. The initial 3GPP version of the Framework for Live Uplink Streaming focused on a fast time to market solution leveraging IP Multimedia System (IMS) Multimedia Telephony-based implementations and allowing Hyper Text Transfer protocol (HTTP [21]) Representational State Transfer (RESTful) interface for both control signalling 3rd party services specific user plane protocol stacks like e.g. RTMP streaming protocol.
3GPP Packet Switched Streaming (PSS) services have been specified and maintained by 3GPP since its Release 4 (Rel-4). While initially most deployments were operator managed audio-visual services, the specifications evolved into a set of enablers for streaming application, network and User Equipment (UE) capabilities.
The latest PSS architecture as defined in TS 26.233 [1] in Rel-15 is depicted in Figure 1:
The PSS architecture maps to 3G, 4G and 5G systems with PSS server mapped to Application Function (AF) and PSS Client in the UE. The PSS Client supports the 3GPP Dynamic Adaptive Streaming over HTTP (3GP-DASH) [6] protocol and format for acquisition of Media Presentation Description playlist (MPD) and acquisition and playback media segments. The UE supports a set of audio and video decoders and subtitling formats as specified in TS 26.234 [14], the video profiles in TS 26.116 [2] and the yet to be published virtual reality profiles in TS 26.118 [3]. It also supports Scene Description format as specified in TS 26.307 [4]. The PSS server handles content acquisition and delivery according to the same set of codecs and formats. The server may also act as Application Function and interact with the Policy and Charging Rules Function (PCRF) via the Rx reference point for Quality of Service (QoS) control as specified in TS 29.214 [5]. The Server And Network Assisted DASH (SAND) protocol as specified in TS 26.247 [6] enables network assistance, proxy caching and consistent Quality of Experience (QoE) and QoS operations.
Content providers, broadcasters and operators intend to leverage 5G systems using eMBB slices (see TS 23.501 [7]) to deliver on-demand and live multimedia content to their subscribers.
3GPP currently studies the required evolution of media delivery services specifications. The findings are documented in TR 26.891 [8].
Figure 2 illustrates the current work in progress with regards to media functions and interfaces on a 5G system architecture.
Most media distribution on 5G are expected to be based on Adaptive Bit Rate (e.g. 3GP-DASH) streaming with HTTP 1.1 to deliver file-based video content. The expected common video container format is fragmented MPEG 4 (fMP4) which is based on International Organization for Standardization Based Media File Format (ISO-BMFF) as currently used in 3GP-DASH profile. The recently specified Motion Picture Expert Group (MPEG) Common Media Application Format (CMAF) format [9] is a profile of fMP4 and can be used with various manifest formats. Typically, media segments are addressed with Uniform Resource Locators (URLs) where the domain name indicates the content provider name, i.e. the domain name of Content Origin.
The major components of media distribution are Content Preparation, Content Origin, and Delivery. The media delivery network elements and their functions are currently being studied and mapped to the 5G system. Media delivery functions are: Playlist and Media segment acquisition/delivery, Capability exchange, QoE metrics collection, Digital Rights Management (DRM) protection, Scene description, Network assistance, domain Name Server (DNS) address resolution and Load balancing, Content Distribution Network and QoS management.
For example, the streaming server may reside within the Mobile Network Operator’s (MNO’s) network as a dedicated application function or it may reside externally and interact with the network through the Network Exposure Function (NEF, see [7]). As another example, the media data flows through the Data Network (DN, see [7]) to the User Plane Function (UPF, see [7]) directly or through one or more AFs. In the latter case, the AF may act as if it were the origin server.
TR 26.891 “5G enhanced mobile broadband; Media distribution” [8] is planned to be completed in June 2018 and it is expected to become the foundation for Rel-16 normative work on the 3GPP 5G streaming evolved specification.
Virtual Reality is certainly the hottest topic in the field of new audio-visual experiences these days. It is a rendered version of a delivered audio and visual scene, designed to mimic the sensory stimuli of the real world as naturally as possible to an observer as he moves within the limits defined by the application and the equipment.
Virtual reality usually, but not necessarily, requires a user to wear a Head Mounted Display (HMD), to completely replace the user’s field of view with a simulated visual component, and to wear headphones, to provide the user with the accompanying audio. Some form of head and motion tracking of the user in VR is usually also necessary to allow the simulated visual and audio components to be updated to ensure that, from the user’s perspective, items and sound sources remain consistent with the user’s movements.
Apart from the complexity aspects on how to produce VR contents, the demanding required distribution bitrate and end-to-end latency put 5G as the most appropriate access network technology for ensuring the quality of experience.
Providing a 360∘ experience relies on a new set of representation formats for both audio and video signals. Wherever the user looks, he shall receive a comprehensive set of images and sounds for him to understand where the “objects” are around him.
This is achieved with 360∘ video representations for which each pixel corresponds to a particular viewing orientation. At the production side, camera systems acquire pictures from every direction that are stitched altogether at a later stage depending on the projected surface selected by the transmission system.
The most popular projections are the EquiRectangular Projection (ERP) map where the 360∘ picture is mapped to a sphere, and the Cube Map Projection (CMP) where the picture is mapped to a cube as illustrated on Figure 3.
These video signals can then be encoded with existing codecs such as MPEG-4 Advance Video coding (AVC) and High Efficiency video coding (HEVC), together with some metadata describing the 360∘ nature of the content required for the correct rendering.
On audio aspects, the sound coming from the 360∘ scene needs to be rendered accordingly to the viewer instantaneous orientation. It means that a spatial audio representation is required together with a binaural renderer to the user’s headphones.
The audio capture can be achieved with an appropriate microphone array capturing the surrounding audio field and/or using a multiple microphones configuration associated to each audio source in the scene.
There are 3 common representation models for spatial audio:
3GPP Service and Architecture Working Group #4 (SA4, the media and codec working group) has conducted a study on Virtual Reality, documented in the technical report TR 26.918 [10]. The group has investigated the possible relevant VR 360∘ use cases impacting the 3GPP ecosystem (access networks and user equipment).
The Rel-15 normative work of 3GPP consists of defining the technical enablers for the delivery of an audiovisual 360∘ scene. The specification TS 26.118 3GPP Virtual Reality profiles for streaming applications [3] defines interoperability points for VR 360∘ streaming services.
Similarly to what has been achieved in the past with the TV profiles [2] this specification, still under development at the time of writing, will provide a set of operation points describing the media formats for VR 360∘ together with their mapping to the DASH delivery.
To achieve this, a reference architecture of the 3GPP client has been described as depicted in Figure 4.
This reference client architecture defines 3 interoperability points highlighted by the vertical red bars on the figure:
The suitability of the 5G for VR 360∘ relies many on 2 main features of this radio access network.
Operation points defined in TS 26.118 [3] will for sure enable VR 360∘ services in 4G but the 5G capacity will allow the service to be scaled to a larger population.
3GPP launched the new Work Item on EVS Codec Extension for Immersive Voice and Audio Services (IVAS Codec) at the TSG-SA September 2017 meeting.
IVAS is the next generation 3GPP codec for 4G/5G, built upon the success of the EVS codec. The 3GPP real-time Enhanced Voice Services (EVS) codec has delivered a highly significant improvement in user experience with the introduction of super-wideband (SWB) and full-band (FB) speech and audio coding, together with improved packet loss resiliency.
The basic idea behind the IVAS codec work item is to cover use cases on real-time conversational voice, multi-stream teleconferencing, VR conversational and user generated live and non-live content streaming. In addition to address the increasing demand for rich multimedia services, teleconferencing applications over 4G/5G will benefit from this next generation codec used as an improved conversational coder supporting multi-stream coding (e.g., channel, object and scene-based audio).
The introduction of 4G/5G high-speed wireless access to telecommunications networks, combined with the availability of increasingly powerful hardware platforms, will enable advanced communications and multimedia services to be deployed more quickly and easily than ever before.
Immersive services and applications, as envisioned in 3GPP TR 22.891 [11] and especially VR services and applications described in TR 26.918 [10], are expected to provide an immersive user experience which, when compared to existing media services, will deliver a quantum leap in the quality of experience. An immersive audio-visual experience implies, for the audio component, that a spatial sound impression is convincingly consistent with the presented visual scene. In addition, the user should be able to move, within certain limits defined by the application, throughout the scene, and the audio component will adjust to reflect the user’s spatial orientation/position.
3GPP TR 22.891 [11] and TR 26.918 [10] identify various immersive use cases and application scenarios that may be broadly subdivided into either UE-originated (user generated) or professionally generated content.
The approach proposed is to build upon the EVS codec with the goal of developing a single codec with attractive features and performance (e.g. excellent audio quality, low delay, spatial audio coding support, appropriate range of bit rates, high-quality error resiliency, practical implementation complexity). In the scope of 3GPP, the predominant audio rendering instrument is envisaged to be headphones but configurations with e.g. tablet speaker playback may also be of relevance.
The overall objective of the IVAS work item is to develop a single general-purpose audio codec for immersive 4G and 5G services and applications including the VR use cases envisioned in 3GPP TR 26.918 [10].
The objectives of the standardization work are detailed below:
The developments under this work item should lead to a set of new specifications defining among others textual description, fixed-point C code, floating-point C code and associated test vectors of the IVAS codec, also including Real Time Transport protocol (RTP) payload format, Session Description Protocol (SDP) parameter definitions, jitter buffer management, rendering and packet loss concealment methods. It is envisioned that subsequent work outside this work item will address suitable acoustic send and receive end requirements enabling immersive user experience.
The standardization process consists of two major parts: settings requirements that successful IVAS codec candidates must fulfil and defining the rigorous testing and selection framework to ensure the selection and subsequent standardization of the most attractive candidate based on well-known technical characteristics. Several permanent project documents are being prepared to support the standardization process. The current target for IVAS codec is to become part of Rel-16.
In the recent years, user generated content, especially video, has become some of the leading content viewed by Internet users, surpassing the popularity of branded videos and movies. Slightly preceding this trend has been the rapid increase in video traffic uploaded to popular streaming sites, with surveys showing that most Internet users upload or share a video at least once a month. The latest statistics report that more video content is uploaded in 30 days than the major U.S. television networks have created in 30 years [12]. It is expected that consumption will become even more compelling as user generated media becomes richer in quality, resolution, timeliness, and immersiveness.
As the revenue shift from MNO-managed services to Over-The-Top (OTT) services continues, traditional companies in the 3GPP and wireless ecosystem are often challenged between resisting this trend or finding a way to generate revenue working with these new OTT business models. With their very low bandwidth requirement allowing operation over best-effort QoS, OTT speech services have commoditized the voice services market. For higher bandwidth applications such as video streaming, some MNOs have tried to provide their own content offerings, while others work with existing content providers in reduced- or zero-rate subscription models that have slightly more success but with unclear monetization models for the MNO beyond subscriber growth. Such efforts are often questioned as taking steps in a race to the bottom with their competitors.
Live Uplink Streaming has the potential to address this competition between these two segments with a collaborative model that is necessary for successfully addressing the user generated content market. Neither can manage without the other: QoS support by the MNO becomes more relevant for uploading richer media in a timely manner (a.k.a. “Live”) while the existing OTT user base is necessary for widespread use and commercial adoption.
After voice, the earliest form of user generated content could be considered to be the Short Messaging Service (SMS) with its 140-character limit still providing value in today’s political climate. SMS evolved into the Multimedia Messaging Service (MMS) which upgraded the media formats to include speech, audio, synthetic audio, still images, bitmap graphics, video, vector graphics, etc. [13]. While not as ubiquitous as SMS, MMS was well-adopted in some markets where photos and video clips were used primarily for mobile advertising.
A less well-known but also standardized messaging service was the IMS Messaging Services [15–17] which supports Immediate messaging, Session based messaging, and provides descriptions of using Combining CS and IMS Services (CSI) which allows the sharing of video or still images during a CS voice call.
The support of multiple media types (e.g., speech/audio, video, and timed text) in a single Multimedia Message sent from the MMS client to the MMS proxy and MMS servers is provided by the 3GPP File format [18] which is an instance of the ISO base media file format. This specification also provides the timing, structure, and media data for multimedia streams that are used by PSS [14] and MBMS [19]. HTTP streaming extensions are also defined for use with DASH [6].
Aside from defining a structure for integration of speech/audio codecs (including Adaptive Multirate Wideband codec (AMR-WB), EVS, Enhanced aacPlus) and video codecs (including AVC/H.264 and HEVC/H.265), the 3GPP File Format also integrates location timed metadata which, along with camera orientation information, could be leveraged for immersive media experiences that are captured from multiple perspectives using mobile devices (e.g., multiple drones filming an event, see Figure 5).
Personal and semi-professional “Live” broadcasting of user generated content has become more popular, especially among users of social networks and streaming services. However current applications that operate OTT using best-effort QoS can only provide low resolution (480p, sometimes 720p) streams with quality already considered unacceptable. Semi-professionals in the field have reportedly resorted to streaming High Definition (HD) and Ultra High Definition (UHD) video uploads over multiple simultaneous links via multiple mobile devices and across different radio access technologies (see Figure 6). Some form of guaranteed QoS is needed to provide a viable service with practical capturing and transmitting devices that could support wider adoption and usage.
Tests in commercial 4G Long Term Evolution (LTE) networks demonstrate that uplink transmission of high-quality video requires QoS and can only support 2-3 users per cell. With its ability to provide even higher data rates at lower latencies, 5G has the potential to provide a QoS level that supports multiple live HD and UHD video streams in the same geographic area.
Live Streaming typically does not require the same low latencies as conversational media, even when there is some interaction between the viewers and the operator of the capture device. The additional latency budget provides flexibility for uplink schedulers to improve cell capacity when supporting these high data rate streams.
The initial Release 15 version of the Framework for Live Uplink Streaming (FLUS) TS [20] focused on a fast time to market solution leveraging IMS MTSI-based implementations. The architecture (see Figure 7) re-used the IMS session control and MTSI protocol stack to support delivering an uplink stream to a server from which it could be forwarded onto viewers. It also supported live streaming directly to another MTSI client to provide a richer form of the “See What I See” CSI service.
SA4 also recognized that support for 3rd party service providers (e.g., social networks and streaming sites for UGC), with their large user base, would be important for a successful service. Initial support for 3rd party services was provided in Release 15 by defining the non-IMS framework that enabled use of the more web-friendly HTTP RESTful interface for control signalling (F-C) and allowing a flexible user plane (F-U) that allows 3rd party services to continue using their user plane protocol stacks over the 3GPP radio interface. For example, the widely used Real Time Messaging Protocol (RTMP) streaming protocol can be used within the Live Uplink Streaming framework. The specific codec formats for this feature were unspecified in Release 15 to allow for maximum flexibility and left interoperability to be handled by the 3rd party between its servers and clients.
The plans for Release 16 include further enhancing the support for 3rd party service providers by providing a QoS network API that would enable 3rd parties to request the necessary QoS for the uplink. Along with this, SA4 plans to investigate developing new QCIs that would provide other latency operating points to trade-off between latency, bandwidth, and capacity. APIs between the terminal/application and the uplink server/sink will also be developed to enable control of network-based processing such as stitching, transcoding, and how the media is to be distributed (e.g., over PSS, DASH, MBMS).
The area of multimedia technology is highly dynamic and advances rapidly. The adoption and growth of new services requires high performances, reliability and scalability of the 5G systems and its multimedia enablers. 3GPP relies solely on the contributions of its members to define those enablers. 5G commercial launches are around the corner and the standardization work on 5G multimedia has started and will continue in the years to come to support new advanced immersive and interactive services offered by operators and third-party to their subscribers.
[1] 3GPP TS 26.233 Transparent end-to-end Packet-switched Streaming service (PSS); General description.
[2] 3GPP TS 26.116 : “Television (TV) over 3GPP services; Video profiles”.
[3] 3GPP TS 26.118: “3GPP Virtual reality profiles for streaming applications”.
[4] 3GPP TS 26.307: “Presentation layer for 3GPP services”.
[5] 3GPP TS 29.214: “Policy and Charging Control over Rx reference point”.
[6] 3GPP TS 26.247: “Transparent end-to-end Packet-switched Streaming Service (PSS); Progressive Download and Dynamic Adaptive Streaming over HTTP (3GP-DASH)”.
[7] 3GPP TS 23.501: “System Architecture for the 5G System”.
[8] 3GPP TR 26.891: “5G enhanced mobile broadband; Media distribution”.
[9] MPEG CMAF: ISO/IEC CD 23000-19 Common Media Application Format
[10] 3GPP TR 26.918: “Virtual Reality (VR) media services over 3GPP”.
[11] 3GPP TR 22.891: “Study on New Services and Markets Technology Enablers”.
[12] https://www.wordstream.com/blog/ws/2017/03/08/video-marketing-statistics
[13] 3GPP TS 26.140: “Multimedia Messaging Service (MMS); Media formats and codecs”.
[14] 3GPP TS 26.234 : “Transparent end-to-end Packet-switched Streaming Service (PSS); Protocols and codecs”.
[15] 3GPP TS 26.141: “IP Multimedia System (IMS) Messaging and Presence; Media formats and codecs”.
[16] 3GPP TS 22.340: “IP Multimedia Subsystem (IMS) messaging; Stage 1”.
[17] 3GPP TS 22.141: “Presence service; Stage 1”.
[18] 3GPP TS 26.244: “Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP)”.
[19] 3GPP TS 26.346: “Multimedia Broadcast/Multicast Service (MBMS); Protocols and codecs”.
[20] 3GPP TS 26.238: “Uplink streaming”.
[21] IETF RFC 2616, Hypertext Transfer Protocol – HTTP/1.1
Frédéric Gabin received his M.Sc. degree in electronics and digital telecommunication systems in 1997 from Telecom ParisTech. He worked in the areas of speech and radio signal processing as research and standardization engineer in Nortel Networks, as systems engineer, standardization project manager and then research manager for NEC terminal division and as standardization manager for Ericsson Mobile Platform and Ericsson. He is Standardization Manager for Media area at Ericsson. Frédéric Gabin served as delegate and chairman of several standardization groups at ETSI, 3GPP, DVB, GSMA and IMTC. He is the Chairman of the 3GPP SA4 Codec and Multimedia Working Group.
Gilles Teniou received the Master’s degree in Engineering Computer Vision, from the Education and Research department in Computer Science and Electrical Engineering, University of Rennes, France. He has been Head of video coding standardization activities at Orange. Gilles is currently Senior Standardization Manager on Content and TV services at Orange. He is in charge of managing the technical and operational standardization activities related to TV & Audiovisual Services including service architecture, technologies used for audiovisual streams (media formats and protocols) as well as the application environment used for TV. In 3GPP, Gilles is the Vice Chair of the 3GPP SA4 Working Group and the chairman of the Video Sub-Working Group.
Nikolai Leung received his B.S. Degree in Electrical Engineering from the University of the Philippines and his M.S. Degree in Electrical Engineering Communication Systems from the University of Michigan. He has been responsible for leading different engineering teams at QUALCOMM Technologies Incorporated and is currently a Director of Technical Standards. He is also serving as the Vice Chair of the 3GPP SA4 Working Group and the Chair of the Multimedia Telephony Services Sub-Working Group.
Imre Varga received his M.Sc. and Ph.D. degrees in electrical engineering and worked in various positions (R&D, project lead, line manager, department head for multimedia) on signal processing for professional audio, multimedia communication, speech coding and transmission, acoustics pre-processing, video applications, and command-and-control systems. He is Director of Technical Standards at QUALCOMM Technologies Incorporated responsible for speech and audio standardization.
Imre Varga served as delegate and chairman of standardization groups for speech and audio coding at ITU-T, 3GPP and IMTC. He also serves as the Chairman of the Enhanced Voice Services Sub-Working Group of 3GPP SA4.
Journal of ICT, Vol. 6_1&2, 117–136. River Publishers
doi: 10.13052/jicts2245-800X.618
This is an Open Access publication. © 2018 the Author(s). All rights reserved.
2 Streaming Service Evolutions for 5G
2.3 Foreseen Streaming Evolutions for 5G
3 VR 360∘ Video Streaming in 5G
4 Real-Time Audio and Video Communication Services VR Evolution
4.4 IVAS Standardization Process
5 User Generated Multimedia Content
5.1 User Generated Content (UGC)