At this point, it seems all but confirmed that the 3090 will have a memory bandwidth of 936 GB/s, 5248 CUDA cores, and quite similar clocks to a 2080ti.
This is interesting to me, since it means that the raw cores*clocks increase is just around 21%, while the memory bandwidth increase is 52%. NVidia has basically never, at least in recent memory,
released a card that has much more bandwidth than it actually needs
regressed in effective utilization of bandwidth across generations
and usually actually gets more final performance out of a given amount of bandwidth with each new architecture (thanks to improved caches, compression, or both).
Even if we assume that the final point has reached an engineering/algorithmic limit and the efficiency in terms of BW is only equally good as on Turing, that still means a ~25% IPC increase is required for the bandwidth provisioning to make sense. We have heard a lot about how the memory is expensive, also makes the board more expensive, and increases power consumption, so they’re not putting it on there just because.
A >25% IPC increase in a single generation is huge, and not something which is likely to just happen with incremental refinement. So what is going on? I can only see a few possible scenarios:
the workloads they envision for the future are more BW intensive than current ones. This would be a direct reversal of the trend over many years where workloads got generally more compute-intensive. It could of course be a function of raytracing, but I think that's, if anything, more cache/latency heavy. Tensor operation are essentially dense MMULs, so even if they pump out a ton of FLOPs I still don't think they would be external BW limited too severely since there's a lot of reuse potential (O(N²) mem traffic for O(N³) ops).
NV totally messed up the balance on their entire new product line. I just don't think that's likely.
There's actually something to the vague "2xFP32" rumour.
Now, the third is what I want to discuss. I initially dismissed this completely (you can find my post about it) for a few reasons, primarily (i) that it’s a huge departure from the incarnation of Ampere we already know and (ii) it seems silly to make a CUDA core superscalar when you could instead simply put more of them on the card. These are very simple cores after all, right?
The first one is still a valid point I believe, but I’m not so sure about the second one any more.
CUDA cores have gotten a lot more complicated in terms of control logic. In the early days they were little more than glorified SIMD lanes, but now there is some potential for independent scheduling, all the per-warp magic stuff, and much more. So maybe replicating them or making them wider isn't actually as good a tradeoff as it seems anymore.
Turing actually made each CUDA core superscalar already by introducing an independent INT ALU, and that was very successful in some workloads (and seems to be getting incrementally more successful with more compute-focused and modern engines). It's the reason we see Turing pull ahead much more in some scenarios than others, even outside of RT or any other new architecture features.
It could not be a full-functionality FP32 duplication, but just some subset of instructions. I don't know enough about hardware, die sizes per instruction type etc. to know if this actually makes sense, but I feel like NV would have sufficient data about the workload composition of shaders in games to determine that. (Which is also important since this entire thing presumes there are actually enough independent FP ops in each SIMT instruction stream on average that you can actually make meaningful use of your superscalar CUDA core)
So, what do you think?
Am I reading too much into a few rumoured numbers? (probably)
Is it just a change in workloads? (maybe)
Did they get the IPC improvement in another way? (ideas?)
Did NV simply mess up?
Am I missing something obvious?
PS: I should note that I’m not a hardware engineer in any way, just someone who has been doing low-level GPU programming and optimization for a long time and has a basic CS education on HW aspects. So anything that relates to the nitty gritty of hardware might be way off and I appreciate corrections