Gram-negative bacteria have evolved an extraordinary array of secretion systems to either export substrates into target cells or the surrounding environment. These substrates differ significantly from their structures and functions to the secretory pathway they use, so it is particularly difficult to design a program to accurately predict substrate type. Currently, several platforms are available to predict the secretory pathway of a given substrate, but they are severely limited in their practicality because they are restricted by their substrate range and/or are not amenable to large-scale sequence input. Considering these limitations, a universal predictor has remained elusive until now.
In this work, we present an integrative prediction system, BastionX, to comprehensively and accurately predict each type of secreted substrate in Gram-negative bacteria in high throughput. BastionX outperforms existing substrate predictors by three remarkable upgrades: 1) BastionX incorporates the first predictor for type II secreted substrates, includes more accurate predictors for types I, III, IV, and VI, and brings state-of-the-art performance for each single substrate predictor through a stacking strategy to intelligently combine multiple machine learning algorithms with a wide array of feature encoding methods; 2) In the output window, BastionX lists the most likely secretory pathway (if any) used by a given protein and includes additional prediction scores for each of the other pathways; 3) BastionX can be performed in high throughput using an efficient and extensible distributed framework, which outperforms the existing singe server based predictors by up to 10 times. With an additional standalone toolkit provided, BastionX can readily conduct sequence analysis locally and easily be integrated into a user’s own pipeline for downstream analysis. Taken together, BastionX can simultaneously annotate thousands of protein sequences with their possible substrate types, and therefore provide a global map of how secreted substrates are distributed in bacterial genomes.