Sgd batch size is used when the training set is too big then you try to randomly select training instances so for example on each epoch you use 128 instances, the whole purpose is to speed up learning process, and it actually works
My opinion, all these parameters must be set empirically according to your data. For instance, to CIFAR-10 dataset a batch size of 128 works fine, on the other hand, to ImageNet (32x32 version) a batch size of 64 achieves better results. In addition, parameters involving the learning rate (i.e., base decay) are more sensitive regarding the datasets used. For instance, an initial learning rate of 0.1 to ImageNet is able to achieve a satisfying accuracy, while to CIFAR-10 by using this learning rate the network do not converge.