* added support of b2b bmm
* fixed arguments and params structures
* added batch_count argument
* removed SplitKSerial and added new test case with b2b bmm
* fixed support of Kbatched and added new test case with batch stride
* added batch support for bias and scale
* make test
* small changes
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>