Polish tool brings order to viral chaos

Adobe Stock

Polish researchers have developed Vclust, a computer program that allows to compare millions of virus sequences in a matter of hours, and organize them according to the degree of similarity. Analysing huge genetic data sets with traditional methods would take years.

'With Vclust, the analysis of a set of 15 million sequences takes about four hours, and the most accurate previously used tools would need about four years. This is an important step in the development of virology and metagenomics, because it will facilitate the identification and classification of new viruses, which in recent years have been discovered en masse thanks to modern sequence technologies', the authors of the solution emphasise in an interview with PAP.

In the journal Nature Methods a team of scientists from the Faculty of Biology of the University of the Adam Mickiewicz in Poznań and the Faculty of Automation, Electronics and Computer Science of the Silesian University of Technology, in collaboration with an expert from Friedrich Schiller University in Jena, described a tool that allows to distinguish known viruses from new ones and analyse their diversity in various environments, which is crucial for monitoring new pathogens and research on the microbiome.

The researchers explain that modern microbiology is struggling with the data flood problem. Up to a million new viruses are discovered every year, as a result of which such large sets are created that their analysis and classification is becoming a growing challenge for research teams.

'This explosion of data is due to metagenomics, a method that allows to read the entire DNA present in a given environmental sample, e.g. from the ocean, soil or human intestine. Until now, there was no tool that would allow to effectively analyse and group such large numbers of sequences. There were very accurate methods, but they could not handle such a scale of data. Therefore, we decided to create a program that would be equally precise, but much more efficient and capable of handling millions of genomes at once', explains the co -author of the publication, Andrzej Zieleziński, PhD, from the Adam Mickiewicz in Poznań.

Why are viruses so difficult?

He adds that in biology, classification of organisms - i.e. taxonomy - is usually based on comparing specific genes present in all representatives of a given group. Thanks to this, it is possible to create phylogenetic trees of organisms, group them, extract families or species and determine their degree of kinship. It is completely different with viruses.

'Unlike bacteria, viruses do not have one in common gene that could be compared. They differ too much from each other. As a result, the classic phylogenetic methods do not work. The approach based on their morphology, e.g. the shape of capsians, which turned out to be too slow and not very scalable. The only thing we could do was to compare the sequences of entire genomes, letter by letter', Zieleziński says.

This is hard to do when there are millions of such genomes. The project manager Professor Sebastian Deorowicz from the Silesian University of Technology explains that there are tools that allow to group these huge data sets, but they do it at a huge computing cost, difficult to repeat in the conditions of daily research. 'It is not that no one has done it before, but it required such large resources (e.g. supercomputers) that it would be difficult to repeat this process regularly, especially when we are dealing with growing data sets', he says.

'That is why we focused on optimisation, i.e. designing the most effective algorithms and the most efficient code, which enabled us to reduce computing time by several orders of magnitude. All this to transfer computing from supercomputers to regular workstations', he adds.

Three steps to organising viruses

Vclust works in three stages. The first stage is preliminary filtering, in which the program quickly identifies sequence pairs showing at least minimal similarity. Thanks to this, instead of comparing each sequence with any other - which would mean a trillion of possible combinations - the algorithm limits the analysis to a much smaller number, hundreds of millions of the most promising couples.

The second stage is a detailed comparison of selected sequences. The researchers use the proprietary LZ-Ai algorithm based on techniques inspired by data compression algorithms used in ZIP or RAR formats. The principle of its operation is simple: the more the two sequences are similar, the better they 'compress' together, i.e. they take up less space after processing. This effect is used as a measure of similarity.

The last stage is clustering, i.e. grouping sequences based on their similarity. Viruses whose genomes are closer to each other end up in the same group. Thanks to this, it is easier to determine which of them are related to each other and form 'families', and which are completely separate. This allows to better understand the variety of viruses and their evolutionary connections.

'Thanks to this, the program uses the power of the computer to the maximum. Everyone who tested Vclust was amazed at its speed', Zieleziński emphasises.

The creators of Vclust made sure that the tool was fully free and publicly available. It can be downloaded from the Internet and run on user's own computer. For those who do not have advanced equipment, the the researchers prepared a browser version: Vclust.org.

The tool works in a very simple way: users can paste their own sequences, start the analysis and after a short time receive the result - without the need to log in or register. Currently, the browser version allows to analyse up to a thousand sequences at once, which turns out to be quite sufficient in many cases.

Deorowicz and Zieleziński assure that the project will be developed further. 'We plan to add more functions, and in the future we would like to expand Vclust also with the possibility of analysing bacterial genome', they announce.

PAP - Science in Poland, Katarzyna Czechowicz (PAP)

kap/ agt/ amac/

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.