Technology

Search engine Nekst to bring order to the Polish Internet

It is estimated that the Polish Internet now contains around a billion documents in Polish language. These data are being organised and analysed by the creators of Polish semantic search engine Nekst. The system is designed for precise search of Polish texts.

The search engine is being developed by teams of the Institute of Computer Science PAS in Warsaw and Wrocław University of Technology. By June, researchers want to scan a total of 500 million documents (texts - including articles and pdf files), which is half of the Polish Internet. At this point, Nekst will be made available to users. Ultimately, the researchers want to scan and keep up to date information on all Polish texts on the Internet.

"Even Google and Yahoo, with all their power, probably have not collected the entire Polish Internet" - admits the project leader, Prof. Jacek Koronacki, director of the Institute of Computer Science PAS. According to his estimate, these search engines could collect, for example, only one in five Polish documents.

For now, the Polish search engine creators managed to collect 160 million Polish-language documents, so about 16 percent of the Polish Internet. As one of the creators of the search engine , Dariusz Czerski of the Institute of Computer Science PAS noted in an interview with PAP, the problem is not storing raw texts on the institute servers. Compressed texts they occupy about 3 terabytes, which would fit on the three small portable hard drives. However, these texts must be described and organised before they can be processed for the purposes of the search.

Polish search engine will use completely different algorithms than the largest international search engines. "These search engines do not have mechanisms that would mimic language understanding" - said Prof. Koronacki and emphasised that Nekst would be the first large semantic search engine for a national language in Europe.

Nekst will not search only for strings - keywords, but rather analyse collected texts in terms of most frequently occurring important words and multi-word phrases (e.g. noun phrases). This will allow the search engine to "conclude" what the text subject is and classify it as effectively as possible. "We need to build language comprehension mechanisms, which strongly differ from engineering for the English language. In particular, we must take into account the inflection and free word order" - explained Prof. Koronacki.

For example, if an internet user searches for speaker, Nekst will make a disambiguation and ask if the user searches for texts on a person who speaks (and not, for example, loudspeakers). It will also display links to sites that do not contain the phrase "speaker", but there is a root of the word "speaker" and there are other words that indicate that the site actually dedicated to speaking.

Inquiries to the search engine can be formulated in natural language, and thus be more similar to questions posed to a person. Its creators also want Nekst to not only show links to sites, but also highlight particular section of the page. If you ask when Copernicus died, it should highlight sentences containing the answer, for example "Copernicus died in 1543".

The mechanism designed to help the engine understand the Polish language is specially developed ontology, i.e. a set of terms and their associations, similar but different from the ontology of Wikipedia, and similar to the dictionary created in another project "Słowosieć" (more about this at: http://naukawpolsce.pap.pl/aktualnosci/news,394234,wielka-siec-relacji-miedzy-polskimi-slowami-udostepniona.html). This will allow the engine to categorise the sites and search for connections between them.

Scientists are also working on giving their system the capability to analyse the emotional tone of the statement. The mechanism will be able to recognize whether a given event, company or person is mentioned in a positive or negative way. "There are companies that try to study such connotations, but their capabilities are limited. We will have the entire Polish Internet at our disposal - noted Dr. Dariusz Czerski. - With such huge amounts of data, we are able to more accurately conclude which texts have emotional overtones, and which are pure information without emotional tone".

The Nekst project participants also want their system to facilitate detection of plagiarism in the future. "Currently used anti-plagiarism tools search the database of master\'s theses, or doctoral dissertations, but do not have access to the entire Polish Internet" - noted Prof. Koronacki and explained that Nekst would recognize plagiarism even if word order would be changed, words added or replaced by synonyms.

The project teams are also working on the image analysis. Document search engines will be able to recognize, to certain extent, what is presented on illustrations, which will provide additional information about the document. Nekst could also help translators: the search engine is designed to return links to Polish sites even if the inquiriy is in English or German. "We are open to cooperation with entities that can popularise our search engine, or even want to improve it" - concluded Koronacki.

The project "Adaptive problem solving support system based on content analysis of available electronic sources", whose value is almost PLN 15 million, is financed from the Operational Programme Innovative Economy.

Ludwika Tomala (PAP)

lt/ agt/ mrt/

tr. RL

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.

More on this topic

  • Photo from press release

    Researchers at AGH UST develop prototype house powered solely by hydrogen and photovoltaics

  • Photo from the Space Research Centre PAS press release

    Polish scientists to test space excavator in conditions imitating those on Moon

Before adding a comment, please read the Terms and Conditions of the Science in Poland forum.