Analysis of Ampere Rumors by Jouni Osmala
Table of Contents
This is my analysis of rumors of what Ampere is going to be. wcctech article
1. SM Design changes
Each instruction is 16 wide vector operation. And SM has 4 of sub-blocks each having these things, except sharing cache and Texture units Here's my calcutions of what numbers shown in the leaks mean in hardware.
| Turing | Ampere | |
|---|---|---|
| Num of schedulers per SM | 4 | 4 | 
| Registers/Sheduler | 16384 | 16384 | 
| Execution ports/sheduler | 2 | 2 | 
| FP32 inst/cycle | 1 | 2 | 
| Tensor inst/cycle | 1 | 2 | 
| Integer inst/cycle | 1 | 1 | 
| LD/ST | 4 | 4 | 
| Special Fuction Unit(SFU) | 1 | 1 | 
| L1 cache/localmem | 96kb | 128kb | 
| Texture units | 4 | 4 | 
| RTcores | Basic | improved | 
The significant power consumption that goes to scheduling and moving data to execution units hasn't changed, nor have load/store unit count. So this is not doubling the sm. SM still has 256kb of registers, that serves same number of execution ports. This isn't doubling the SM this seems more like 20-33% increase in SM size just to deliver what rumors claim. And because GPU die has many other units than just SM the percentage increase of overall die size from these changes is smaller.
What according to that rumor has changed is that certain instructions types can be issued to both pipelines at same time. Lets dive into FP32 first. Turing added integer to be able to execute in parallel with fp32, it according to nvidia gave it on average 36 by concurrency on shaders. Now nvidia just made that pipeline capable of executing the FP32 in addition to integer, so that would potentially increase the IPC from 1.36 to 2, in computationally intensive shaders that doesn't stall on other issues. Besides performance benefit the FP32 flops numbers is what most people speak when they talk about cudacores and performance and this change doubles the numbers, with smaller cost increase. Now what I would of expected for next gen was making Integer pipeline capable of handling all the FP operations that are cheap in hardware and leave the multiplier to the other pipeline.It would significantly increase the utilization of integer execution port and free the other pipeline FMA operations, and get most of the benefits with fraction of the costs. However doubling the tensor cores in addition to full sized FP32 in the diagram suggests that this is full FP32 pipeline, but not guarantees it.
Then the tensor cores, they claim double the Tensor core performance, and its width is also doubled in the diagram. And what could be seen is that Nvidia has realized that tensor instructions are used in bursts, so that other pipeline is often idle when tensor instructions are used, so they could double that number without paying twice the cost.
L1 cache/shared memory increase is what should be expected. It is still smaller than register files, and increaseso to next power of 2, and helps to feed the increased computation for the GPU side.
2. Overall rumored designs
The 60 SM version with 10 memory busses, has been shown as 3080 before the "double the cuda cores/sm rumor", and that rumor changed the claim that it would be 3080 ti. Personally I think what probably has happened is that leakers don't know what it would be called and named it based on their perception of information.
Lets compare it with existing. L2 cache is not stated in any of the leaks, but nvidia has made each L2 cache segment work with specific memory controller so lets assume it is multiplied by number of memory controllers, and either doubled or staying the same for each segment.
| TU104 | TU102 | GA103 | GA104 | 2080TI | |
|---|---|---|---|---|---|
| SM | 48 | 72 | 60 | 48 | 68 | 
| SMcost(GA*1.25) | 48 | 72 | 75 | 60 | 72 | 
| Memorycontrollers | 8 | 12 | 10 | 8 | 11 | 
| L2 cache MB | 4 | 6 | (5/10?) | (4/8?) | (6/5.5)? | 
| ROP | 64 | 96 | 80 | 80 | 88 | 
| NVLINK(s) | 1 | 2 | 2 | 1 | 2 | 
Now each chip, has besides the items mentioned have other stuff that maybe bigger or smaller, that we don't know. If we consider all the things that GA103 has it's chip cost looks a lot more like die shrunk optimized 2080 ti, which name could be anything.
Would GA103 bandwidth starved? 2080 ti has 14Ghz memory, mass produced GDDR6 currently offer 16Ghz. So full GA103 should have atleast littlebit more bandwidth than 2080 ti. 16/14*10/11 = ~4% increase in bandwidth.
Then if we take a 1.5 multiplier for compute for each SM we get 90/68 we get 1.32 ratio between 2080 ti and full GA103. Because we don't know anything about frequencies I don't make any assumptions on those. Turing lineup has several SKU:s that have worse bandwidth/compute ratio than 2080 ti. Like 2060 or 2080. And Ampere has slightly bigger L1 cache/local memory, and if it is used as local memory it potentially reduces the memory bandwidth requirements in addition to L2 cache bandwidth requirements. So overall the ratio shouldn't be as bad as it sounds from first glance when comparing number of cuda cores to memory channels of 2080 ti. If we consider costs, adding more memory bandwidth is expensive, but increased transistor counts allow more compute at same costs. This is what have changed.
Now GA103 looks like a 600-999$ card instead of 1300$ card, that performs somewhat better than 2080 ti on average, if nvidia calls it 3080 or 3080 ti shouldn't be the interesting issue.
GA100 the only rumors are 826mm² on 7nm process. It makes it really expensive die, which makes sense for professional cards. If we take the GA103 SM design and put it for that die sized system, it would look more like 120SM design. The die area per SM could be higher for adding more professional features so it could be smaller, but I assumed some of the die area would go for that in the 120SM number. The end result would more than double the flops of high end Turing design.That would need double the bandwidth of GA103 and would definitely prefer even more for the professional applications. If we take that as requirement, it would be 640 bit for the GDDR6 or going for HBM. HBM reduces the power consumption of memory and memory controllers and reduces die size of memory controllers, while adding a lot more bandwidth. Lets consider something like 400-500GB/s per stack, a 4 stack design would be 1.6 to 2 TB/s. Those professional cards probably result some "gaming" cards from partially disabled dies, but those would be extremely pricey, because cost of adding HBM2.
Now those rumors on overall suggest that nvidia has design that is much cheaper than 2080 ti while being somewhat better and design that is a lot more costly than 2080 ti but insanely better.
3. Competitive pressure?
The amount of competition can result different SKU:s . Here are couple possibilities NVidia could do depending on how high threat they see big navi is to their business. Nvidia needs months to make these decisions, while what each chip has physically is years in advance.
| high | price | low | price | |
|---|---|---|---|---|
| Titan | GA100 (3-4stack) | 2000$+ | GA100 (2/3stack) | 1600\(/2400\) | 
| 3080ti | GA100 (2-3stack) | 1200$ | GA103 | 900$ | 
| 3080 | GA103 | 700$ | GA103/cut | 700$ | 
| 3070 | GA103/cut | 500$ | GA104 | 500$ | 
All GA100 dies in this list are cut,fully functional dies GA100 are only sold for several thousands of dollars more than any card in this list. As you can see 3080 ti can deliver smaller improvement and be cheaper or deliver far bigger improvement and be more expensive. And they could improve price/performance and absolute performance at same time while increasing the 3080 ti:s price relative to its predecessor to win against competition. They could also make their Titan class either replacement for people who bought 2080 ti or for semi professionals. The low competitive scenario has lower prices for equivalent name, and high competitive scenario has lower prices for equivalent chips. This is to show that it doesn't really matter if they reduce or increase the 3080 ti price, what matters more is performance relative to price. And the scenario where they get praise for lowering prices is where they actually ask higher prices for same chips.
4. Disclaimer
This is just analysis of situation based on rumors, and those rumors very well could be wrong. Simply there is not enough knowledge to be certain about anything. And pricing well, nobody really knows, but you can always make guestimates based on profit margins and costs.