Accelerating the performance of sequence alignment using machine learning with RAPIDS enabled GPU

Authors

  • Karamjeet Kaur School of Engineering and Technology, Sharda University, Greater Noida, U.P., 201306, India
  • Anil Kumar Sagar School of Engineering and Technology, Sharda University, Greater Noida, U.P., 201306, India
  • Sudeshna Chakraborty Galgotias University, Greater Noida, U.P., 203201, India

Keywords:

deep learning, graphical processing unit, machine learning, multiple sequence alignment, random forest

Abstract

In bioinformatics, sequence alignment is a useful way of identifying similarities in the DNA sequences to identify common evolutionary and structural relationships. Currently, with the number of sequences increasing in sequence databases, traditional methods take too much time to align two or more sequences simultaneously, because these methods are sequentially-based. Even when sequential algorithms are modified so that alignment can be done parallelly, this modification is unable to reduce the time proportionally, as the number of sequences in databases is increasing at an exponential rate. This limitation can be overcome by machine learning if sequences are treated as big data, and knowledge from such large-scale data can be gained. If machine learning techniques are combined with the capabilities of GPUs, processing time is reduced due to the parallel architecture of GPUs. Thus, a GPU-based approach is proposed to accelerate multiple sequence alignment, yielding significant accuracy improvement. An efficient model is proposed and implemented to predict classes of biological sequences. Many challenges are overcome by applying pre-processing to sequence data, which is necessary for machine learning techniques to work. This model uses the embedding method for the representation of DNA sequences and combines the capabilities of GPUs with a random forest algorithm. Results show that the model yields a high accuracy of 99.5%, with a reduction in time required to align sequences. Compared with other CPU-based methods, this GPU-based model takes less computation time. Fast and accurate alignment is vital in evolutionary studies, which can help in designing new drugs or modifying existing drugs for new diseases.

References

Abd–Alhalem, S. M., El-Rabaie, E. S. M., Soliman, N., Abdulrahman, S. E. S., Ismail, N. A., El-samie, A., & Fathi, E. (2021). DNA Sequences Classification with Deep Learning: A Survey. Menoufia Journal of Electronic Engineering Research, 30(1), 41-51.DOI: 10.21608/mjeer.2021.146090

Al-Hyari, A. Y., Al-Taee, A. M., & Al-Taee, M. A. (2013). Clinical decision support system for diagnosis and management of chronic renal failure. In 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT). IEEE. https://doi.org/10.1109/AEECT.2013. 6716440

Andalon-Garcia, I. R., & Chavoya, A. (2017). PaMSA: A Parallel Algorithm for the Global Alignment of Multiple Protein Sequences. International Journal of Advanced Computer Science and Applications, 8(4). 513-522. https://doi.org/10.14569/ijacsa.2017.080468

Chen, X., Wang, M., & Zhang, H. (2011). The use of classification trees for bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 55-63. https://doi.org/10.1002/widm.14.

Chou, K. C., & Shen, H. B. (2006). Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research, 5(8), 1888-1897. https://doi.org/10.1021/pr060167c

Di Gangi, M., Lo Bosco, G., & Rizzo, R. (2018). Deep learning architectures for prediction of nucleosome positioning from sequences data. BMC bioinformatics, 19(14), 127-135. https://doi.org/10.1186/ s12859-018-2386-9

Gao, J., Tian, L., Lv, T., Wang, J., Song, B., & Hu, X. (2019). Protein2vec: Aligning multiple ppi networks with representation learning. IEEE/ACM transactions on computational biology and bioinformatics, 18(1), 240-249. DOI: 10.1109/TCBB.2019.2937771

Gunasekaran, H., Ramalakshmi, K., Rex Macedo Arokiaraj, A., Deepa Kanmani, S., Venkatesan, C., & Suresh Gnana Dhas, C. (2021). Analysis of DNA sequence classification using CNN and hybrid models. Computational and Mathematical Methods in Medicine, 2021. https://doi.org/10.1155/ 2021/1835056

Gupta, C. P., Bihari, A, & Tripathi, S. (2019). Human Protein Sequence Classification using Machine Learning and Statistical Classification Techniques. International Journal of Recent Technology and Engineering, 8(2), 3591-3599. 10.35940/ijrte.B3224.078219

Gupta, D., Khare, S., & Aggarwal, A. (2016). A method to predict diagnostic codes for chronic diseases using machine learning techniques. In 2016 International Conference on Computing, Communication and Automation (ICCCA). IEEE. https://doi.org/10.1109/CCAA.2016.7813730

Haque, M. R., Islam, M. M., Iqbal, H., Reza, M. S., & Hasan, M. K. (2018). Performance evaluation of random forests and artificial neural networks for the classification of liver disorder. In 2018 international conference on computer, communication, chemical, material and electronic engineering (IC4ME2). IEEE. https://doi.org/10.1109/IC4ME2.2018.8465658

Jia, P., Xuan, L., Liu, L., & Wei, C. (2011). MetaBinG: Using GPUs to accelerate metagenomic sequence classification. PloS one, 6(11), e25353. https://doi.org/10.1371/journal.pone.0025353

Jiang, H., Ganesan, N., & Yao, Y. D. (2018). CUDAMPF++: A proactive resource exhaustion scheme for accelerating homologous sequence search on CUDA-enabled GPU. IEEE Transactions on Parallel and Distributed Systems, 29(10), 2206-2222. https://doi.org/10.1109/TPDS.2018.2830393

Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., & Lu, Z. (2007). MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic acids research, 35(suppl_2), W339-W344. https://doi.org/10.1093/nar/gkm368

L’heureux, A., Grolinger, K., Elyamany, H. F., & Capretz, M. A. (2017). Machine learning with big data: Challenges and approaches. IEEE Access, 5, 7776-7797. https://doi.org/10.1109/ACCESS.2017. 2696365

Li, Y., Huang, C., Ding, L., Li, Z., Pan, Y., & Gao, X. (2019). Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods, 166, 4-21. https://doi.org/10.1016/ j.ymeth.2019.04.008

Liu, B., Long, R., & Chou, K. C. (2016). iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics, 32(16), 2411-2418. https://doi.org/10.1093/bioinformatics/btw186

Liu, B., Wang, S., Dong, Q., Li, S., & Liu, X. (2016). Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE transactions on nanobioscience, 15(4), 328-334. https://doi.org/10.1109/TNB.2016.2555951

Liu, B., Yang, F., & Chou, K. C. (2017). 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Molecular Therapy-Nucleic Acids, 7, 267-277. https://doi.org/10.1016/j.omtn.2017.04.008

Liu, Y., Hong, Y., Lin, C. Y., & Hung, C. L. (2015). Accelerating smith-waterman alignment for protein database search using frequency distance filtration scheme based on cpu-gpu collaborative system. International journal of genomics, 2015. https://doi.org/10.1155/2015/761063

Lopez-del Rio, A., Martin, M., Perera-Lluna, A., & Saidi, R. (2020). Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Scientific reports, 10(1), 1-14. https://doi.org/10.1038/s41598-020-71450-8

Lu, S., Hong, Q., Wang, B., & Wang, H. (2020). Efficient resnet model to predict protein-protein interactions with gpu computing. IEEE Access, 8, 127834-127844. https://doi.org/10.1109/ACCESS. 2020.3005444

Mahmud, M., Kaiser, M. S., Hussain, A., & Vassanelli, S. (2018). Applications of deep learning and reinforcement learning to biological data. IEEE transactions on neural networks and learning systems, 29(6), 2063-2079. DOI: 10.1109/TNNLS.2018.2790388

Millán Arias, P., Alipour, F., Hill, K. A., & Kari, L. (2022). DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PloS one, 17(1), e0261531. https://doi.org/10.1371/journal. pone.0261531

Min, S., Lee, B., & Yoon, S. (2017). Deep learning in bioinformatics. Briefings in bioinformatics, 18(5), 851-869. https://doi.org/10.1093/bib/bbw068

Mirzaei, S., Sidi, T., Keasar, C., & Crivelli, S. (2019). Purely structural protein scoring functions using support vector machine and ensemble learning. IEEE/ACM transactions on computational biology and bioinformatics, 16(5), 1515-1523. https://doi.org/10.1109/TCBB.2016.2602269

National Center for Biotechnology Information. (n.d.). Retrieved from https://www.ncbi.nlm.nih.gov

Naznin, F., Sarker, R., & Essam, D. (2012). Progressive alignment method using genetic algorithm for multiple sequence alignment. IEEE Transactions on Evolutionary Computation, 16(5), 615-631.

Nguyen, N. G., Tran, V. A., Phan, D., Lumbanraja, F. R., Faisal, M. R., Abapihi, B., & Satou, K. (2016). DNA sequence classification by convolutional neural network. Journal Biomedical Science and Engineering, 9(5), 280-286. https://doi.org/10.4236/jbise.2016.95021

Park, K. H., Ryu, K. S., & Ryu, K. H. (2016). Determining minimum feature number of classification on clear cell renal cell carcinoma clinical dataset. In 2016 International Conference on Machine Learning and Cybernetics (ICMLC). IEEE. https://doi.org/10.1109/ICMLC.2016.7873005

Rangwala, H., & Karypis, G. (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23), 4239-4247. https://doi.org/10.1093/bioinformatics/bti687

Rashed, A. E. E. D., Amer, H. M., El-Seddek, M., & Moustafa, H. E. D. (2021). Sequence Alignment Using Machine Learning-Based Needleman–Wunsch Algorithm. IEEE Access, 9, 109522-109535. https://doi.org/10.1109/ACCESS.2021.3100408

Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016). Deep learning for health informatics. IEEE journal of biomedical and health informatics, 21(1), 4-21. https://doi.org/10.1109/JBHI.2016.2636665

Sun, J., Palade, V., Wu, X., & Fang, W. (2013). Multiple sequence alignment with hidden Markov models learned by random drift particle swarm optimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 243-257. https://doi.org/10.1109/TCBB.2013.148

Warris, S., Yalcin, F., Jackson, K. J., & Nap, J. P. (2015). Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS. PloS one, 10(4), e0122524. https://doi.org/10.1371/journal. pone.0122524

Welivita, A., Perera, I., Meedeniya, D., Wickramarachchi, A., & Mallawaarachchi, V. (2018). Managing complex workflows in bioinformatics: an interactive toolkit with gpu acceleration. IEEE Transactions on NanoBioscience, 17(3), 199-208. https://doi.org/10.1109/TNB.2018.2837122

You, Z. H., Zhu, L., Zheng, C. H., Yu, H. J., Deng, S. P., & Ji, Z. (2014, December). Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. In BMC bioinformatics. BioMed Central. https://doi.org/10.1186/1471-2105-15-S15-S9

Zhang, C., Zhang, F., Guo, X., He, B., Zhang, X., & Du, X. (2020). imlbench: A machine learning benchmark suite for CPU-GPU integrated architectures. IEEE Transactions on Parallel and Distributed Systems, 32(7), 1740-1752. https://doi.org/10.1109/TPDS.2020.3046870

Zhang, J., & Zong, C. (2015). Deep Neural Networks in Machine Translation: An Overview. IEEE Intelligent System, 30(5), 16-25. https://doi.org/10.1109/MIS.2015.69

Zhu, X., Li, K., Salah, A., Shi, L., & Li, K. (2015). Parallel implementation of MAFFT on CUDA-enabled graphics hardware. IEEE/ACM transactions on computational biology and bioinformatics, 12(1), 205-218. https://doi.org/10.1109/TCBB.2014.2351801

Zhu, Z., Wang, Z., Li, D., Zhu, Y., & Du, W. (2020). Geometric structural ensemble learning for imbalanced problems. IEEE transactions on cybernetics, 50(4), 1617-1629. https://doi.org/10.1109/ TCYB.2018.2877663

Downloads

Published

2022-12-26

How to Cite

Kaur, K., Sagar, A. K., & Chakraborty, S. (2022). Accelerating the performance of sequence alignment using machine learning with RAPIDS enabled GPU. Journal of Current Science and Technology, 12(3), 462–481. Retrieved from https://ph04.tci-thaijo.org/index.php/JCST/article/view/276

Issue

Section

Research Article