The FlipDiagonal() method assumes that the first 2 axis correspond to spatial dimensions, i.e., (X,Y,C), as expressed in line 513:
This obviously creates a problem when training, since the usual convention is (N,C,X,Y). I am working on a fix for this, so it automatically detects which channels correspond to spatial dimensions. I am aware this fix must be also translated to update_properties() for supervised learning purposes.