Skip to content

O projektu / About

HR Projekt Razvoj i primjena modela za normalizaciju grafije starih latiničnih tiskanih tekstova (MONOGRAF) provodi se od 1. siječnja 2024. do 31. prosinca 2027. u Institutu za hrvatski jezik kao projekt koji financira Europska unija – NextGenerationEU u okviru Nacionalnoga plana oporavka i otpornosti 2021. – 2026. Voditelj je projekta dr. sc. Vuk-Tadija Barbarić, a suradnica je dr. sc. Marijana Horvat. Prethodio mu je projekt u okviru osnovne djelatnosti Instituta za hrvatski jezik Razvoj modela za normalizaciju grafije starih latiničnih tiskanih tekstova, koji se provodio od 1. siječnja 2021. do 31. prosinca 2023.

EN The Development and Application of a Model for Normalising the Orthography of Old Texts Printed in Latin Script (MONOGRAF) project is being carried out at the Institute for the Croatian Language from 1 January 2024 to 31 December 2027. It is funded by the European Union – NextGenerationEU under the National Recovery and Resilience Plan 2021–2026. The project is led by Dr Vuk-Tadija Barbarić, with Dr Marijana Horvat as a project associate. It was preceded by the Institute’s core-activity project Development of a Model for Normalising the Orthography of Old Texts Printed in Latin Script, implemented from 1 January 2021 to 31 December 2023.

Više o projektu / More

HR Velika prepreka stvaranju hrvatskih povijesnih jezičnih resursa jest povijesna tropismenost, a unutar nje posebno složenost latiničnih grafija. Predloženim projektom razvit će se model kojim će se prevladati navedeni problem, što će pokazati put do bržeg stvaranja povijesnojezičnih korpusa te posljedično ubrzati i inače mukotrpnu tekstološku produkciju. U korpus za analizu uključit će se dovoljno digitaliziranih starih latiničnih knjiga kako bi se došlo do brojke od milijun pojavnica, a među njima će se naći i neke gramatike. Knjige će se digitalizirati u izvornoj grafiji s pomoću OCR-a (optical character recognition) uz provođenje procedure koja će znatno umanjiti mogućnost pogrešaka u digitaliziranoj građi. Projekt će iznjedriti vjerno digitalizirane knjige visoke kvalitete (kvalitetu kontroliraju iskusni filolozi, a ne samo tehnički osposobljeno osoblje).

Projekt financira Europska unija iz fonda NextGenerationEU.

EN A major obstacle to the creation of Croatian historical language resources is historical trigraphia (the use of three scripts over time), and within it especially the complexity of Latin-script orthographies. This project will develop a model to overcome that problem, thereby opening the way to faster creation of historical-language corpora and, consequently, accelerating the otherwise painstaking work of textological production. The corpus compiled for analysis will include a sufficient number of digitised old Latin-script books to reach one million tokens, including several grammars. The books will be digitised in their original orthography using OCR (optical character recognition), following a procedure that will significantly reduce the likelihood of errors in the digitised material. The project will produce high-quality, faithfully digitised books (with quality controlled by experienced philologists, not only technically trained staff).

There is substantial scholarship on Croatian historical Latin-script orthography, but this project will build on it and complement it with precise graphematic descriptions better suited to practical computational use. The project’s results—above all the planned corpus—are expected to integrate successfully with existing valuable (“analogue” and digital) resources such as the Academy Dictionary (AR) and the Dictionary of the Croatian Kajkavian Literary Language (KR). In this context, the project will collaborate with other Institute (textological) projects, which will be able to make lasting use of its results in the future.

The following books have been selected for the corpus:

Josip Banovac, Predike od svetkovina došašća Isukrstova, 1759 (included in AR; already fully digitised)

Josip Banovac, Blagosov od polja, 1767 (included in AR)

Nikola Dešić, Raj duše, 1560 (not included in AR)

Croatian Protestants, Proroci, 1564 (included in AR, but based on Vatroslav Jagić’s edition)

Šime Starčević, Nova ričoslovica ilirička, 1812 (not included in AR; already fully digitised)

Bartol Kašić, Vanđelja i pistule, 1641 (not included in AR)

Bartol Kašić, Pismo od nasledovanja Gospodina našega Jezusa, 1641 (included in AR; digitisation is being carried out within another Institute project, De imitatione Christi na trima stilizacijama hrvatskoga književnog jezika)

Ivan Krištolovec (?), Od nasleduvanja Krištuševoga, 1710 (included in KR, but based on the 1760 edition; digitisation is currently being carried out within another Institute project, De imitatione Christi na trima stilizacijama hrvatskoga književnog jezika)

Marko Marulić, the first five editions of Judita (1521, 1522, 1523, 1586, 1627), which are treated here as a single book due to their relatively small size (included in AR, but only based on the edition in the series Starih pisaca hrvatskih)

Ivan Pergošić, Decretum, 1574 (included in KR, but based on the 1909 critical edition; particularly challenging for model development due to the large amount of German text)

Matija Antun Relković, Nova slavonska i nimačka gramatika, 1767 (not included in AR)

[An additional book from which enough text will be taken, if needed to reach a corpus size of one million tokens, is: Anton Dalmatin and Stipan Konzul, Postila, 1568 (included in AR, but based on the Glagolitic version from 1562).]

The project is funded by the European Union through the NextGenerationEU fund.