mamba paper No Further a Mystery
Discretization has deep connections to continuous-time programs which can endow them with additional Qualities including resolution invariance and mechanically making sure the product is correctly normalized. functioning on byte-sized tokens, transformers scale poorly as every token should "attend" to each other token leading to O(n2) scaling legi