Polish scientists develop ultra-fast protein analysis tool for large-scale biology research

Adobe Stock

Polish scientists have developed a new bioinformatics tool capable of analysing millions of protein sequences significantly faster than existing methods while maintaining high accuracy, a breakthrough researchers say could accelerate studies of evolution, protein function and drug discovery.

The tool, called FAMSA2, was created by Sebastian Deorowicz, Adam Gudyś and Andrzej Zieleziński in collaboration with a researcher from the Centre for Genomic Regulation.

The scientists, affiliated with Silesian University of Technology and Adam Mickiewicz University, said the software addresses a growing problem in modern biology, where advances in DNA and RNA sequencing have generated data faster than researchers can analyse it.

According to the researchers, global gene and protein databases already contain hundreds of millions of sequences and are expected to expand to billions in the near future.

“One of the fundamental bioinformatics methods is multiple sequence alignment. It provides a basis for various phylogenetic and evolutionary analyses, as well as a starting point for predicting the spatial structure and function of proteins”, Zieleziński said in an interview with PAP.

Multiple sequence alignment involves arranging protein sequences so that similar fragments line up in the same positions, allowing researchers to identify conserved regions linked to biological functions.

“This can be compared to a situation where we want to line up different versions of the same text to see which fragments are identical and which are written slightly differently. The best alignment is the one that most faithfully reflects all the changes. This has been a major problem in bioinformatics for years, especially with very large datasets. Today, with the number of available sequences growing rapidly, the majority of existing programs are no longer able to handle such a scale. This is where the idea for our study came from”, Zieleziński added.

The first version of the software, FAMSA, was released in 2016 and gained recognition as one of the fastest tools for multiple sequence alignment. The researchers said the new version was designed to cope with the much larger datasets now common in biological research.

Instead of directly comparing every sequence with every other sequence, FAMSA2 groups sequences around representative examples, allowing the analysis to be divided into smaller, manageable parts.

Gudyś said earlier approaches often selected representative sequences randomly, which could reduce stability and accuracy.

“We have mitigated this problem by randomly selecting a larger subset of sequences and then performing auxiliary grouping within it, based on which we select representatives. This improves the quality of the results”, he said.

The algorithm then constructs a so-called guide tree that determines the order in which sequences are aligned, beginning with the most similar groups and gradually combining them into a complete alignment.

The researchers said the tool’s performance also depends on deep optimisation for modern computer hardware.

“We have devoted a significant amount of time to thoroughly understanding how to fully utilise the capabilities of each processor core and RAM for the fastest possible sequence alignment. This allows us to achieve computational parallelism on three levels. We not only treat the processor as a multi-core system, but also utilise vector operations within a single core, and even more deeply, we perform calculations at the bit level”, Deorowicz said.

According to the scientists, this approach accelerates computations thousands of times compared with conventional methods and allows analyses involving hundreds of thousands or millions of sequences to be completed within hours on advanced workstations or even high-end personal computers.

The researchers also developed a compression system for storing massive alignment files that can otherwise occupy hundreds of gigabytes of disk space.

“The history of sequence alignment software is very long, and it was usually a compromise: either speed or accuracy. If a tool was fast, it made too many simplifications. If it was accurate, it could run for days. We managed to speed up the analysis without sacrificing quality, as demonstrated by tests in structural, phylogenetic, and functional applications”, Zieleziński said.

The team published FAMSA2 as open-access software, making it freely available to researchers worldwide. According to the authors, the tool has already been downloaded more than 130,000 times.

The research describing FAMSA2 was published in Nature Biotechnology (https://doi.org/10.1038/s41587-026-03095-3).

Katarzyna Czechowicz (PAP)

kap/ agt/

tr. RL

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.