NVIDIA has revealed more details regarding its GeForce RTX 30 series graphics cards during a Q&A session held over at the official NVIDIA subreddit (via Videocardz and Hardwareluxx). Several of the NVIDIA employees answered questions of the community which were related to the newly unveiled GeForce RTX 30 series graphics cards.
NVIDIA GeForce RTX 30 Series GPUs & Features Get Detailed In Reddit Q&A Session – New SM Design, New Features and Much More
The Reddit session amount to almost 2000 comments and the questions that were answered led to the revelation of interesting new details. NVIDIA still hasn’t detailed a lot as the community wanted since they want to allow independent reviewers to showcase what they have to offer in the form of their GeForce RTX 30 series cards and that will have to wait till the 17th of September which is also when NVIDIA would publish its Ampere gaming architecture whitepaper. Till then, let’s sit back and take look at what NVIDIA has to say of its new gaming lineup.
New NVIDIA Streaming Multiprocessor For Ampere Gaming GPU
The first question that I will highlight is the one that asks NVIDIA about its new architectural design specific to the gaming Ampere GPUs. The question is answered by NVIDIA’s Senior Vice President of content and technology.
- Could you elaborate a little on these doubling of CUDA cores?
- How does it affect the general architectures of the GPCs?
- How much of a challenge is it to keep all those FP32 units fed?
- What was done to ensure high occupancy?
[Tony Tamasi] One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.
The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.
A representation of the Ampere Gaming SM block diagram for next-gen NVIDIA GeForce RTX 30 series graphics cards created by Hardwareluxx & compared to the Turing Gaming SM.
Based on the information Tony provided, Hardwareluxx’s created a block diagram representation of the Ampere SM. The new SM block looks close to the final one and you can note the dual FP32 units in two data paths. Each SM consists of 128 CUDA cores which is why we have seen a doubling of the core count on the Ampere GPU. We will have a more detailed article on the Ampere GPUs & the underlying architecture on 17th September so look forward to it.
NVIDIA RTX IO – How It Works, What’s Needed To Make It Work & More
Moving on, Tony also answers a whole bunch of questions regarding RTX I/O which was unveiled by NVIDIA during the live event. RTX IO is described as a ‘suite of technologies’ aiming to deliver GPU-based loading and decompression of game assets, which the company claims can speed up I/O performance by a hundred times compared to standard hard drives and storage APIs.
NVIDIA did provide some details such as NVMe drive requirements rather than SATA SSDs. There’s a lot more to be known about this next-generation feature so keep reading below:
Pengwin17523 – Will there be a certain SSD speed requirement for RTX I/O?
[Tony Tamasi] There is no SSD speed requirement for RTX IO, but obviously, faster SSD’s such as the latest generation of Gen4 NVMe SSD’s will produce better results, meaning faster load times, and the ability for games to stream more data into the world dynamically. Some games may have minimum requirements for SSD performance in the future, but those would be determined by the game developers. RTX IO will accelerate SSD performance regardless of how fast it is, by reducing the CPU load required for I/O, and by enabling GPU-based decompression, allowing game assets to be stored in a compressed format and offloading potentially dozens of CPU cores from doing that work. Compression ratios are typically 2:1, so that would effectively amplify the read performance of any SSD by 2x. [link]
SBMS-A-Man108 – Does RTX IO allow the use of SSD space as VRAM? Or am I completely misunderstanding?
[Tony Tamasi] RTX IO allows reading data from SSD’s at much higher speed than traditional methods, and allows the data to be stored and read in a compressed format by the GPU, for decompression and use by the GPU. It does not allow the SSD to replace frame buffer memory, but it allows the data from the SSD to get to the GPU, and GPU memory much faster, with much less CPU overhead. [link]
Aztec47 – Could we see RTX IO coming to machine learning libraries such as Pytorch? This would be great for performance in real-time applications
[Tony Tamasi] NVIDIA delivered high-speed I/O solutions for a variety of data analytics platforms roughly a year ago with NVIDIA GPU DirectStorage. It provides for high-speed I/O between the GPU and storage, specifically for AI and HPC type applications and workloads. For more information please check out: https://developer.nvidia.com/blog/gpudirect-storage/ [link]
Qrios1ty – I am excited for the RTX I/O feature but I partially don’t get how exactly it works?
[Tony Tamasi] RTX IO and DirectStorage will require applications to support those features by incorporating the new API’s. Microsoft is targeting a developer preview of DirectStorage for Windows for game developers next year, and NVIDIA RTX gamers will be able to take advantage of RTX IO enhanced games as soon as they become available.
[Nestledrink] Yes
Turing and AMpere
How Much of A Difference Is there between PCIe 4.0 & PCIe 3.0
Another important question that Tony answered is the difference between PCIe 4.0 and PCIe 3.0 interfaces. For NVIDIA Ampere Gaming GPUs, it is stated that the difference in the performance of the gen 3.0 protocol versus the gen 4.0 protocol is less than a few percent and the major impact is from the CPU itself. This shouldn’t undermine PCIe Gen 4 platform owners as NVIDIA does mention potential performance increases with a full Gen 4 platform and those upgrading their PCs should consider that in mind.
Will PCIe 3.0 bottleneck the RTX 3090? Concerned because my Intel system does not support 4.0.
Tony Tamasi – System performance is impacted by many factors and the impact varies between applications. The impact is typically less than a few percent going from a x16 PCIE 4.0 to x16 PCIE 3.0. CPU selection often has a larger impact on performance.We look forward to new platforms that can fully take advantage of Gen4 capabilities for potential performance increases.
NVIDIA DLSS 2.1, Reflex, RTX Encoder Detailed
The next couple of questions are related to certain features of the NVIDIA Ampere Gaming GPUs such as DLSS 2.1, Reflex, and the Ampere Encoder which were answered by different employees.
EeK09 – What kind of advancements can we expect from DLSS? Most people were expecting a DLSS 3.0, or, at the very least, something like DLSS 2.1. Are you going to keep improving DLSS and offer support for more games while maintaining the same version?
[NV-Randy] DLSS SDK 2.1 is out and it includes three updates:
– New ultra performance mode for 8K gaming. Delivers 8K gaming on GeForce RTX 3090 with a new 9x scaling option.
– VR support. DLSS is now supported for VR titles.
– Dynamic resolution support. The input buffer can change dimensions from frame to frame while the output size remains fixed. If the rendering engine supports dynamic resolution, DLSS can be used to perform the required upscale to the display resolution. [link]
Carmen813 – Will there be any improvements to the RTX encoder in the Ampere series cards, similar to what we saw for the Turing Release? I did see info on the Broadcast software, but I’m thinking more along the lines of improvements in overall image quality at the same bitrate.
[Jason Paul] for RTX 30 Series, we decided to focus improvements on the video decode side of things and added* AV1 decode support. On the encode side, RTX 30 Series has the same great encoder as our RTX 20 Series GPU. We have also recently updated our NVIDIA Encoder SDK. In the coming months, livestream applications will be updating to this new version of the SDK, unlocking new performance options for streamers.
Akanash94 – Will Nvidia reflex work with pascal GPUs or is this only a Turing/Ampere feature?
[NV_Tim] It will work with 900 series + GPUs, including RTX 20-series.
NVIDIA RTX 30 Series Founders Edition Cooler – Quieter & Efficient Than Turing Founders Edition
NVIDIA’s Product Manager of GeForce, Qi Lin explains to a community member that the GeForce RTX 30 series Founders Edition design is cooler and quieter than the Founders Edition utilized by the Turing cards. He also says that most users don’t have to worry about airflow as long as their chassis is configured to bring fresh air to the GPU and move the air out of the PC case efficiently.
iCinn – Any idea if the dual airflow design is going to be messed up for inverted cases? More than previous designs? Seems like it would blow it down on the CPU. But the CPU cooler would still blow it out the case. Maybe it’s not so bad.
Second question. 10x quieter than the Titan for the 3090 is more or less quieter than an NVIDIA GeForce RTX2080 Super (Evga ultra fx for example)?
[Qi Lin] u/iCinn The new flow through cooling design will work great as long as chassis fans are configured to bring fresh air to the GPU, and then move the air that flows through the GPU out of the chassis. It does not matter if the chassis is inverted.
The Founders Edition RTX 3090 is quieter than both the Titan RTX and the Founders Edition RTX 2080 Super. We haven’t tested it against specific partner designs, but I think you’ll be impressed with what you hear… or rather, don’t hear. 🙂
As for the launch and prices, NVIDIA has announced that the GeForce RTX 3080 will be first to hit retail on the 17th of September followed by the GeForce RTX 3090 on 24th September and lastly, the GeForce RTX 3070 in October. The graphics cards will retail at prices of $1499 US (RTX 3090), $699 US (RT 3080), and $499 US (RTX 3070). Custom models will stick to the reference prices while the more premium models will feature higher prices.