NVIDIA seems to have provided more information to the press regarding its GeForce RTX 30 series graphics cards and the Ampere GPUs that they utilize. The information is part of a deep-dive NDA’d session which takes a closer look at both GA102 and GA104 Gaming Ampere GPUs which will land in the gaming market in the coming weeks.
NVIDIA GeForce RTX 30 Series Graphics Cards Specs, Performance & GA102/GA104 GPUs Further Detailed in Deep-Dive
The deep-dive session includes information on the NVIDIA GeForce RTX 30 series, some of which that we have already seen during the official unveil on 1st September and some with new info that provides us a more detailed look at the Ampere gaming GPUs. NVIDIA has detailed a small amount of information during its Reddit Q&A session where they talked about the new SM design for their Ampere GPUs. But before that, let’s take a look at the GPUs powering NVIDIA’s brand new Geforce RTX 30 series lineup. The following images are courtesy of Hardwareluxx.de.
NVIDIA GA102 GPU – The Flagship Ampere Gaming GPU For GeForce RTX 3090 & RTX 3080
The NVIDIA GA102 GPU is the flagship gaming chip which features a die size of 628mm2 and packs in a total of 28 Billion transistors. According to NVIDIA, the GA102 GPU comprises 6 GPCs that is the Graphics Processing Clusters and 6 TPC (Texture Processing Clusters). The GA102 GPU on the RTX 3090 makes use of 41 TPCs or 82 SMs while the GeForce RTX 3080 makes use of 34 TPCs or 68 SMs. Each SM on the Ampere GPU features 128 CUDA cores along with a redesigned structure which we will detail in a bit. The GA102 GPU on the RTX 3090 features a total of 10,496 cores while the one on the RTX 3080 features 8704 cores.
In terms of GPU density, the GA102 GPU is about twice as dense as the Turing TU102 GPU with 44.56 million transistors per square millimeters versus 24.67 million transistors per square millimeters on Turing and that’s all on the Samsung 8nm process node.
Each SM consists of four tensor cores and 1 RT core. The GA102 GPU features a shared L2 cache. It is 6 MB for the GeForce RTX 3090 and 5 MB for the RTX 3080. The specific GPU block diagram that’s been shared shows a total of ten 32-bit memory controllers for the GeForce RTX 3080 which deliver a 320-bit bus. The GeForce RTX 3090 will feature a total of twelve 32-bit memory controllers for a 384-bit bus interface.
NVIDIA GA104 GPU – The Efficiency and Gaming Optimized GPU For The GeForce RTX 3070
At the heart of the NVIDIA GeForce RTX 3070 graphics card lies the GA104 GPU. The GA104 is one of the many Ampere GPUs that we will be getting on the gaming segment. The GA104 GPU is the second-fastest Ampere chip in the stack. The GPU is based on Samsung’s 8nm (N8) process node. The GPU measures at 395.2mm2 and features 17.4 Billion transistors which are almost 93% of the transistors featured on the TU102 GPU. At the same time, the GA104 GPU is almost half the size of the TU102 GPU which is an insane amount of density.
For the GeForce RTX 3070, NVIDIA has enabled a total of 46 SM units on its flagship which results in a total of 5888 CUDA cores. In addition to the CUDA cores, NVIDIA’s GeForce RTX 3070 also comes packed with next-generation RT (Ray-Tracing) cores, Tensor cores, and brand new SM or streaming multi-processor units. The GPU features a total of 184 Tensor cores and 46 RT cores. There’s a large possibility that the GA104 GPU comes with a full fat 6144 core configuration which could launch in a future graphics card variant. The GA104 GPU features a 4 MB L2 shared cache and has a total of eight 32-bit memory controllers for a 256-bit wide bus interface.
NVIDIA GeForce RTX 30 Series ‘Ampere’ Graphics Card Specifications:
Graphics Card Name | NVIDIA GeForce RTX 3070 | NVIDIA GeForce RTX 3080 | NVIDIA GeForce RTX 3090 |
---|---|---|---|
GPU Name | Ampere GA104-300 | Ampere GA102-200 | Ampere GA102-300 |
Process Node | Samsung 8nm | Samsung 8nm | Samsung 8nm |
Die Size | 395.2mm2 | 628.4mm2 | 628.4mm2 |
Transistors | 17.4 Billion | 28 Billion | 28 Billion |
CUDA Cores | 5888 | 8704 | 10496 |
TMUs / ROPs | TBA | TBA | TBA |
Tensor / RT Cores | 184 / 46 | 272 / 68 | 328 / 82 |
Base Clock | 1500 MHz | 1440 MHz | 1400 MHz |
Boost Clock | 1730 MHz | 1710 MHz | 1700 MHz |
FP32 Compute | 20 TFLOPs | 30 TFLOPs | 36 TFLOPs |
RT TFLOPs | 40 TFLOPs | 58 TFLOPs | 69 TFLOPs |
Tensor-TOPs | 163 TOPs | 238 TOPs | 285 TOPs |
Memory Capacity | 8/16 GB GDDR6 | 10/20 GB GDDR6X | 24 GB GDDR6X |
Memory Bus | 256-bit | 320-bit | 384-bit |
Memory Speed | 14 Gbps | 19 Gbps | 19.5 Gbps |
Bandwidth | 448 Gbps | 760 Gbps | 936 Gbps |
TDP | 220W | 320W | 350W |
Price (MSRP / FE) | $499 US | $699 US | $1499 US |
Launch (Availability) | October 2020 | 17th September | 24th September |
NVIDIA Ampere SM (Streaming Multiprocessor Design) – Twice The FP32 Throughput
The NVIDIA GeForce RTX 30 series cards with Ampere GPU also comes with a brand new SM design which was recently explained by Tony Tamasi. Following are the full details of what’s new in the SM Ampere architecture:
One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating-point additions (FADD), or floating-point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating-point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.
The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.
Taking a closer look at the Ampere SM unit, each block consists of 128 FP32 units. However, one of the two FP32 data paths can also concurrently execute INT32 operations. The tensor cores consist of 4 units, there are four texture units per SM and a single RT core.
For its 3rd Gen Tensor cores, NVIDIA is using the same sparsity architecture that they’ve used on the Ampere HPC line of GPUs. While Ampere features 4 Tensor cores per SM compared to Turing’s 8 tensor cores per SM, they are not only based on the new 3rd Generation design but also get an increased count with the larger SM array. The Ampere GPU can execute 128 FP16 FMA operations per tensor core utilizing its entire INT16 cores and with sparsity, it can do up to 256. The total FP16 FMA operations per SM are increased to 512 and 1024 with sparsity. That’s a 2x increase over the Turing GPU in terms of inference performance with the updated Tensor design.
The same goes for ray tracing cores which in their 2nd iteration deliver twice the number of ray intersections compared to the Turing architecture. The higher number of SMs also amount to a higher number of RT cores & that also affects the overall performance of ray-tracing acceleration on Ampere.
GDDR6X – The Next Evolution in Graphics Memory, Designed Exclusively For NVIDIA’s GeForce RTX 30 Series Graphics Cards
The Micron GDDR6X memory brings a lot of new stuff to the table. It is faster, doubles the I/O data rate, and is the first to implement PAM4 multi-level signaling in memory dies. With the Geforce RTX 3090 class products, Micron’s GDDR6X memory achieves a bandwidth of up to 1 TB/s which is used to power the next-generation gaming experiences at high-fidelity resolutions such as 8K.
The new GDDR6X SGRAM:
- Doubles the data rate of SGRAM at a lower power per transaction while enabling breaking of the 1 Terabyte per second (TB/s) system memory bandwidth boundary for graphics card applications;
- Is the first discrete graphics memory device that employs PAM4 encoded signaling between the processor and the DRAM, using four voltage levels to encode and transfer two bits of data per interface clock.
- Can be designed and operated stably at high speeds, and built-in mass-production.
As mentioned, GDDR6X features the brand new PAM4 multilevel signaling techniques which helps transfer data much faster, doubles the I/O rate, pushing the capability of each memory dies from 64 GB/s to 84 GB/s. The Micron GDDR6X memory dies are also the only graphics DRAM that can be mass-produced while feature PAM4 signaling.
What is interesting is that Micron quotes that its GDDR6X memory can hit speeds of up to 21 Gbps whereas we have only got to see 19.5 Gbps in action on the GeForce RTX 3090. It is likely that AIBs could utilize higher binned dies as they are available. Micron also confirms that they plan to offer speeds higher than 21 GB/s moving in 2021 but we will have to wait and see whether any cards will utilize them.
It’s not just faster speeds but Micron’s GDDR6X provides higher bandwidth while sipping in 15% lower power per transferred bit compared to the previous generation GDDR6 memory.
Micron GDDR6X Memory
Feature | GDDR5 | GDDR5X | GDDR6 | GDDR6X |
---|---|---|---|---|
Density | From 512Mb to 8Gb | 8Gb | 8Gb, 16Gb | 8Gb, 16Gb |
VDD and VDDQ | Either 1.5V or 1.35V | 1.35V | Either 1.35V or 1.25V | Either 1.35V or 1.25V |
VPP | N/A | 1.8V | 1.8V | 1.8V |
Data rates | Up to 8 Gb/s | Up to 12Gb/s | Up to 16 Gb/s | 19 Gb/s, 21 Gb/s, >21 Gb/s |
Channel count | 1 | 1 | 2 | 2 |
Access granularity | 32 bytes | 64 bytes 2x 32 bytes in pseudo 32B mode |
2 ch x 32 bytes | 2 ch x 32 bytes |
Burst length | 8 | 16 / 8 | 16 | 8 in PAM4 mode 16 in RDQS mode |
Signaling | POD15/POD135 | POD135 | POD135/POD125 | PAM4 POD135/POD125 |
Package | BGA-170 14mm x 12mm 0.8mm ball pitch |
BGA-190 14mm x 12mm 0.65mm ball pitch |
BGA-180 14mm x 12mm 0.75mm ball pitch |
BGA-180 14mm x 12mm 0.75mm ball pitch |
I/O width | x32/x16 | x32/x16 | 2 ch x16/x8 | 2 ch x16/x8 |
Signal count | 61 – 40 DQ, DBI, EDC – 15 CA – 6 CK, WCK |
61 – 40 DQ, DBI, EDC – 15 CA – 6 CK, WCK |
70 or 74 – 40 DQ, DBI, EDC – 24 CA – 6 or 10 CK, WCK |
70 or 74 – 40 DQ, DBI, EDC – 24 CA – 6 or 10 CK, WCK |
PLL, DCC | PLL | PLL | PLL, DCC | DCC |
CRC | CRC-8 | CRC-8 | 2x CRC-8 | 2x CRC-8 |
VREFD | External or internal per 2 bytes | Internal per byte | Internal per pin | Internal per pin 3 sub-receivers per pin |
Equalization | N/A | RX/TX | RX/TX | RX/TX |
VREFC | External | External or Internal | External or Internal | External or Internal |
Self refresh (SRF) | Yes Temp. Controlled SRF |
Yes Temp. Controlled SRF Hibernate SRF |
Yes Temp. Controlled SRF Hibernate SRF VDDQ-off |
Yes Temp. Controlled SRF Hibernate SRF VDDQ-off |
Scan | SEN | IEEE 1149.1 (JTAG) | IEEE 1149.1 (JTAG) | IEEE 1149.1 (JTAG) |
NVIDIA GeForce RTX 30 Series Cooling Design & Thermals
NVIDIA has developed one of their best and most powerful Founders Edition cooling design to date for the GeForce RTX 30 series graphics cards. NVIDIA explained that higher performance requires a new form of cooling solution and as such, it has prepared a unique cooling solution for its next-gen cards which will keep GPUs running cool while staying quiet by utilizing several new & existing tech.
The Founders Edition cooling makes use of a full aluminum alloy heatsink which makes use of a hybrid vapor chamber with dual-sided axial-tech based fans. The cooler heatsink is coated with a nano-carbon coating and should do a really good job at keeping the temperatures in control.
The design is interesting in the sense that not only does it goes all out with a fin and heat pipe design. This is the first design of its kind since the original Founders Edition GeForce GTX 780 that makes use of a much larger heatsink area.
It also comes with a unique fan placement, one on the front and one at the bottom. This push & pull fan configuration which as it is referred to is said to push heat out of the exhaust vents much more effectively. There will be some air that will be blown out inside the case from the back of the card itself but that shouldn’t be a major cause of concern as modern CPU Air or Liquid coolers do a really good job venting out air from within the case.
Acoustically, the new Founders Edition design is quieter than traditional dual axial coolers, while still delivering nearly 2x the cooling performance of previous-generation solutions. The aforementioned NVLink and power design changes help here, creating more space for airflow through the largest fin stack seen to date, and the larger bracket vents improve airflow in concert with individually shaped shroud fins. In fact, wherever you look, every aspect of the Founders Edition cards are designed to maximize airflow, minimize temperatures, and enable the highest levels of performance with the least possible noise.
In terms of cooler noise and performance, the GeForce RTX 3080 operates at a peak temperature of 78C when hitting its peak TBP of 320W with a noise output of just 30dBA. For comparison, the Turing Founders Edition coolers peak out at 81C with a noise output of 32dBA when hitting their TBP of 240W (RTX 2080 SUPER). In NVIDIA’s own testing, they reveal that the GeForce RTX 3080 averages at around 1920 MHz with a GPU power draw of 310W and a peak temperature of 76C.
This is also where NVIDIA gets its 1.9x efficiency figure from as the RTX 3080 can deliver over 100 FPS while being cooler and quiet versus the 60 FPS of its Turing gen predecessor.
NVIDIA GeForce RTX 3090, RTX 3080, RTX 3070 Founders Edition Gallery:
NVIDIA GeForce RTX 3090 & RTX 3080 Graphics Card PCB & Power – Designed To Be Overclocked!
One of the biggest changes on the Founders Edition GeForce RTX 3090 graphics cards is the PCB design. The GeForce RTX 3090 & GeForce RTX 3080 comes with a unique & compact PCB package that is unlike anything we’ve seen in the consumer space before. But being compact doesn’t mean that the cards don’t pack a punch. There’s some serious horsepower on these compact PCBs that NVIDIA has designed.
The PCB features over 20 power chokes which put it is a more premium design than the flagship non-reference RTX 20 series cards. The GPU is powered by 18 phases while the memory receives power from 2 phases. NVIDIA touts this PCB as an overclocking marvel with unprecedented GPU overclock headroom that most users can leverage from to gain even faster performance. But as pointed out earlier by us, the Founders Edition PCB is not the reference design and that will come with a standard rectangular PCB. Water block manufacturers have also confirmed this which we reported here.
In addition to that, GeForce RTX 30 series Founders Edition cards will be featuring the 12-pin Micro-Fit 3.0 power connectors. These connectors don’t require a power supply upgrade as the cards will ship with bundled 2x 8-pin to 1x 12-pin connectors so you can run your latest graphics card without any compatibility issues.
The placement of the 12-pin connector on the PCB is also noteworthy. It is placed in a vertical position and judging by the PCB design, we can tell why NVIDIA moved to a single 12-pin plug instead of the standard dual 8-pin design. There’s limited room on the PCB to do stuff and as such, it was necessary to go for a more small and compact power input.
NVIDIA GeForce RTX 30 Series Performance, Launch & Prices
NVIDIA has also shared some additional performance numbers for its GeForce RTX 3090, GeForce RTX 3080, and GeForce RTX 3070 graphics cards which you can see below.
There aren’t any performance numbers that NVIDIA is sharing right now but from what has been showcased, the GeForce RTX 3070 is faster than an RTX 2080 Ti, the RTX 3080 is a good bit ahead of the RTX 2080 Ti and the RTX 3090 is about as much as 50% faster than the RTX 2080 Ti which is very impressive for the full lineup stack.
In other news, NVIDIA has already showcased a new performance demonstration of its RTX 3080 in Doom Eternal which absolutely wrecks the GeForce RTX 2080 Ti and also revealed that the card is able to handle 4K gaming at maxed-out settings with ease, delivering north of 60 FPS in several AAA gaming titles.
As for the launch and prices, NVIDIA has announced that the GeForce RTX 3080 will be first to hit retail on the 17th of September followed by the GeForce RTX 3090 on 24th September and lastly, the GeForce RTX 3070 in October. The graphics cards will retail at prices of $1499 US (RTX 3090), $699 US (RT 3080), and $499 US (RTX 3070). Custom models will stick to the reference prices while the more premium models will feature higher prices.
Credits: wccftech