Is it possible to have a NEON/SIMD optimized RSA implementation?

Came across https://onlinelibrary.wiley.com/doi/pdf/10.1002/sec.1706 (Efficient arithmetic on ARM-NEON and its applicationfor high-speed RSA implementation) recently. A purely software based RSA implementation, albeit portable, is usually relatively very slow, so perhaps we can speed it up using NEON/SIMD CE for SOCs that support it, which should be quite common these days?