Aerial scene classification, which aims to automatically label an aerial image with a specific semantic category, is one of the most challenging tasks in computer vision. With the improvement of spatial resolution of aerial images, the texture details of ground objects become increasingly abundant. Although many classification algorithms have been proposed in recent years, accuracy and completeness of the classification results are still imperative problems to be solved. We propose an algorithm via a two-stage voting fusion strategy. First, the superpixels obtained by simple linear iterative cluster algorithm are treated as the input of a random forest classifier. Then, the region growth strategy is applied to complete the clustering of similar local regions, and the learning of mosaic zero pieces is simultaneously realized. Finally, a Bayesian framework is exploited to fuse the classification results in order to get the final classification results. Experiments on two benchmarks demonstrate the superiority of the proposed method (TSS) compared with other state-of-the-art methods, whether in terms of the integrity of big targets or the accuracy of small objects.