Intuitively double Softmax could produce a sharper distribution.
softmax(softmax(X))
But it's not true if putting a logarithm in between:
P=[p1,p2,...pn]=softmax(X)
Plog=logsoftmax(X)=logP
softmax(Plog)i=∑k=1nelogpkelogpi
=∑k=1npkpi
=pi
softmax(logsoftmax(X))=softmax(Plog)=P=softmax(X)
Therefore: softmax(log(softmax(X))) == softmax(X)