Accelerating the performance of sequence alignment using machine learning with RAPIDS enabled GPU
Keywords:
deep learning, graphical processing unit, machine learning, multiple sequence alignment, random forestAbstract
In bioinformatics, sequence alignment is a useful way of identifying similarities in the DNA sequences to identify common evolutionary and structural relationships. Currently, with the number of sequences increasing in sequence databases, traditional methods take too much time to align two or more sequences simultaneously, because these methods are sequentially-based. Even when sequential algorithms are modified so that alignment can be done parallelly, this modification is unable to reduce the time proportionally, as the number of sequences in databases is increasing at an exponential rate. This limitation can be overcome by machine learning if sequences are treated as big data, and knowledge from such large-scale data can be gained. If machine learning techniques are combined with the capabilities of GPUs, processing time is reduced due to the parallel architecture of GPUs. Thus, a GPU-based approach is proposed to accelerate multiple sequence alignment, yielding significant accuracy improvement. An efficient model is proposed and implemented to predict classes of biological sequences. Many challenges are overcome by applying pre-processing to sequence data, which is necessary for machine learning techniques to work. This model uses the embedding method for the representation of DNA sequences and combines the capabilities of GPUs with a random forest algorithm. Results show that the model yields a high accuracy of 99.5%, with a reduction in time required to align sequences. Compared with other CPU-based methods, this GPU-based model takes less computation time. Fast and accurate alignment is vital in evolutionary studies, which can help in designing new drugs or modifying existing drugs for new diseases.
References
Abd–Alhalem, S. M., El-Rabaie, E. S. M., Soliman, N., Abdulrahman, S. E. S., Ismail, N. A., El-samie, A., & Fathi, E. (2021). DNA Sequences Classification with Deep Learning: A Survey. Menoufia Journal of Electronic Engineering Research, 30(1), 41-51.DOI: 10.21608/mjeer.2021.146090
Al-Hyari, A. Y., Al-Taee, A. M., & Al-Taee, M. A. (2013). Clinical decision support system for diagnosis and management of chronic renal failure. In 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT). IEEE. https://doi.org/10.1109/AEECT.2013. 6716440
Andalon-Garcia, I. R., & Chavoya, A. (2017). PaMSA: A Parallel Algorithm for the Global Alignment of Multiple Protein Sequences. International Journal of Advanced Computer Science and Applications, 8(4). 513-522. https://doi.org/10.14569/ijacsa.2017.080468
Chen, X., Wang, M., & Zhang, H. (2011). The use of classification trees for bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 55-63. https://doi.org/10.1002/widm.14.
Chou, K. C., & Shen, H. B. (2006). Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research, 5(8), 1888-1897. https://doi.org/10.1021/pr060167c
Di Gangi, M., Lo Bosco, G., & Rizzo, R. (2018). Deep learning architectures for prediction of nucleosome positioning from sequences data. BMC bioinformatics, 19(14), 127-135. https://doi.org/10.1186/ s12859-018-2386-9
Gao, J., Tian, L., Lv, T., Wang, J., Song, B., & Hu, X. (2019). Protein2vec: Aligning multiple ppi networks with representation learning. IEEE/ACM transactions on computational biology and bioinformatics, 18(1), 240-249. DOI: 10.1109/TCBB.2019.2937771
Gunasekaran, H., Ramalakshmi, K., Rex Macedo Arokiaraj, A., Deepa Kanmani, S., Venkatesan, C., & Suresh Gnana Dhas, C. (2021). Analysis of DNA sequence classification using CNN and hybrid models. Computational and Mathematical Methods in Medicine, 2021. https://doi.org/10.1155/ 2021/1835056
Gupta, C. P., Bihari, A, & Tripathi, S. (2019). Human Protein Sequence Classification using Machine Learning and Statistical Classification Techniques. International Journal of Recent Technology and Engineering, 8(2), 3591-3599. 10.35940/ijrte.B3224.078219
Gupta, D., Khare, S., & Aggarwal, A. (2016). A method to predict diagnostic codes for chronic diseases using machine learning techniques. In 2016 International Conference on Computing, Communication and Automation (ICCCA). IEEE. https://doi.org/10.1109/CCAA.2016.7813730
Haque, M. R., Islam, M. M., Iqbal, H., Reza, M. S., & Hasan, M. K. (2018). Performance evaluation of random forests and artificial neural networks for the classification of liver disorder. In 2018 international conference on computer, communication, chemical, material and electronic engineering (IC4ME2). IEEE. https://doi.org/10.1109/IC4ME2.2018.8465658
Jia, P., Xuan, L., Liu, L., & Wei, C. (2011). MetaBinG: Using GPUs to accelerate metagenomic sequence classification. PloS one, 6(11), e25353. https://doi.org/10.1371/journal.pone.0025353
Jiang, H., Ganesan, N., & Yao, Y. D. (2018). CUDAMPF++: A proactive resource exhaustion scheme for accelerating homologous sequence search on CUDA-enabled GPU. IEEE Transactions on Parallel and Distributed Systems, 29(10), 2206-2222. https://doi.org/10.1109/TPDS.2018.2830393
Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., & Lu, Z. (2007). MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic acids research, 35(suppl_2), W339-W344. https://doi.org/10.1093/nar/gkm368
L’heureux, A., Grolinger, K., Elyamany, H. F., & Capretz, M. A. (2017). Machine learning with big data: Challenges and approaches. IEEE Access, 5, 7776-7797. https://doi.org/10.1109/ACCESS.2017. 2696365
Li, Y., Huang, C., Ding, L., Li, Z., Pan, Y., & Gao, X. (2019). Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods, 166, 4-21. https://doi.org/10.1016/ j.ymeth.2019.04.008
Liu, B., Long, R., & Chou, K. C. (2016). iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics, 32(16), 2411-2418. https://doi.org/10.1093/bioinformatics/btw186
Liu, B., Wang, S., Dong, Q., Li, S., & Liu, X. (2016). Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE transactions on nanobioscience, 15(4), 328-334. https://doi.org/10.1109/TNB.2016.2555951
Liu, B., Yang, F., & Chou, K. C. (2017). 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Molecular Therapy-Nucleic Acids, 7, 267-277. https://doi.org/10.1016/j.omtn.2017.04.008
Liu, Y., Hong, Y., Lin, C. Y., & Hung, C. L. (2015). Accelerating smith-waterman alignment for protein database search using frequency distance filtration scheme based on cpu-gpu collaborative system. International journal of genomics, 2015. https://doi.org/10.1155/2015/761063
Lopez-del Rio, A., Martin, M., Perera-Lluna, A., & Saidi, R. (2020). Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Scientific reports, 10(1), 1-14. https://doi.org/10.1038/s41598-020-71450-8
Lu, S., Hong, Q., Wang, B., & Wang, H. (2020). Efficient resnet model to predict protein-protein interactions with gpu computing. IEEE Access, 8, 127834-127844. https://doi.org/10.1109/ACCESS. 2020.3005444
Mahmud, M., Kaiser, M. S., Hussain, A., & Vassanelli, S. (2018). Applications of deep learning and reinforcement learning to biological data. IEEE transactions on neural networks and learning systems, 29(6), 2063-2079. DOI: 10.1109/TNNLS.2018.2790388
Millán Arias, P., Alipour, F., Hill, K. A., & Kari, L. (2022). DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PloS one, 17(1), e0261531. https://doi.org/10.1371/journal. pone.0261531
Min, S., Lee, B., & Yoon, S. (2017). Deep learning in bioinformatics. Briefings in bioinformatics, 18(5), 851-869. https://doi.org/10.1093/bib/bbw068
Mirzaei, S., Sidi, T., Keasar, C., & Crivelli, S. (2019). Purely structural protein scoring functions using support vector machine and ensemble learning. IEEE/ACM transactions on computational biology and bioinformatics, 16(5), 1515-1523. https://doi.org/10.1109/TCBB.2016.2602269
National Center for Biotechnology Information. (n.d.). Retrieved from https://www.ncbi.nlm.nih.gov
Naznin, F., Sarker, R., & Essam, D. (2012). Progressive alignment method using genetic algorithm for multiple sequence alignment. IEEE Transactions on Evolutionary Computation, 16(5), 615-631.
Nguyen, N. G., Tran, V. A., Phan, D., Lumbanraja, F. R., Faisal, M. R., Abapihi, B., & Satou, K. (2016). DNA sequence classification by convolutional neural network. Journal Biomedical Science and Engineering, 9(5), 280-286. https://doi.org/10.4236/jbise.2016.95021
Park, K. H., Ryu, K. S., & Ryu, K. H. (2016). Determining minimum feature number of classification on clear cell renal cell carcinoma clinical dataset. In 2016 International Conference on Machine Learning and Cybernetics (ICMLC). IEEE. https://doi.org/10.1109/ICMLC.2016.7873005
Rangwala, H., & Karypis, G. (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23), 4239-4247. https://doi.org/10.1093/bioinformatics/bti687
Rashed, A. E. E. D., Amer, H. M., El-Seddek, M., & Moustafa, H. E. D. (2021). Sequence Alignment Using Machine Learning-Based Needleman–Wunsch Algorithm. IEEE Access, 9, 109522-109535. https://doi.org/10.1109/ACCESS.2021.3100408
Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016). Deep learning for health informatics. IEEE journal of biomedical and health informatics, 21(1), 4-21. https://doi.org/10.1109/JBHI.2016.2636665
Sun, J., Palade, V., Wu, X., & Fang, W. (2013). Multiple sequence alignment with hidden Markov models learned by random drift particle swarm optimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 243-257. https://doi.org/10.1109/TCBB.2013.148
Warris, S., Yalcin, F., Jackson, K. J., & Nap, J. P. (2015). Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS. PloS one, 10(4), e0122524. https://doi.org/10.1371/journal. pone.0122524
Welivita, A., Perera, I., Meedeniya, D., Wickramarachchi, A., & Mallawaarachchi, V. (2018). Managing complex workflows in bioinformatics: an interactive toolkit with gpu acceleration. IEEE Transactions on NanoBioscience, 17(3), 199-208. https://doi.org/10.1109/TNB.2018.2837122
You, Z. H., Zhu, L., Zheng, C. H., Yu, H. J., Deng, S. P., & Ji, Z. (2014, December). Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. In BMC bioinformatics. BioMed Central. https://doi.org/10.1186/1471-2105-15-S15-S9
Zhang, C., Zhang, F., Guo, X., He, B., Zhang, X., & Du, X. (2020). imlbench: A machine learning benchmark suite for CPU-GPU integrated architectures. IEEE Transactions on Parallel and Distributed Systems, 32(7), 1740-1752. https://doi.org/10.1109/TPDS.2020.3046870
Zhang, J., & Zong, C. (2015). Deep Neural Networks in Machine Translation: An Overview. IEEE Intelligent System, 30(5), 16-25. https://doi.org/10.1109/MIS.2015.69
Zhu, X., Li, K., Salah, A., Shi, L., & Li, K. (2015). Parallel implementation of MAFFT on CUDA-enabled graphics hardware. IEEE/ACM transactions on computational biology and bioinformatics, 12(1), 205-218. https://doi.org/10.1109/TCBB.2014.2351801
Zhu, Z., Wang, Z., Li, D., Zhu, Y., & Du, W. (2020). Geometric structural ensemble learning for imbalanced problems. IEEE transactions on cybernetics, 50(4), 1617-1629. https://doi.org/10.1109/ TCYB.2018.2877663
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.