This section contains basic instructions for those in the hurry who just want to use the corpus interface for quick searches. The topics touched upon here are described at greater length in the full manual for the Polish Baroque Corpus[1]. The Sejmik Corpus to a large extent uses its tools and solutions, but it lack the division into orthographic and transliterated layers; it also uses one tagger, Concraft.
To find an occurence of some word in the corpus (e.g., marszałka, genitive of marszałek), just type in the form into the Query field and click Search.
The rationale for the system of segmentation into words (or, stricly speaking, into tokens) is somewhat expanded in an article written about National Corpus of Polish[2].
If you want to find all forms of some word, you should enter its base form, also known as the dictionary form or the lmma. The table of grammatical categories in the Baroque Corpus’ instructions[1] gives the precise methods of finding the base forms, which is often just the form that we would look up in a dictionary.
Queries for the dictionary form are written according to the template: [base="x"]
. The x
here should be replaced with the dictionary form (the quote marks have to be preserved). For example, if are to find all occurences of the word marszałek, we whould enter the query: [base="marszałek"]
.
To find a sequence of tokens, you can ask for multiple tokens in one query, as a sequence of forms or as word specifications in square brackets. (To ask for a specific word form with the square brackets syntax, use the template [orth="x"]
: for example [orth="marszałek"]
means the same as marszałek
).
If we want the search engine to match any one token, just leave the square bracket pair empty: []
.
For instance: [base="marszałek"] [] [base="koronny"]
will find all the fragments where some forms of the words marszałek and koronny are separated by any word: marszałka wielkiego koronnego, marszałkowi nadwornemu koronnemu etc.
Under the Metadata menu there are options for narrowing down the search based on features of individual documents. This is known as adding restrictions. More restrictions can be added with the Add a restriction button. Redundant restrictions can be removed with the ‘minus’ button.
You can, for example, search only among the documents from one region (the Province field) but from some period that you specify yourself (the Enactment date field). The pattern for searching should be entered in the Metadata query column if needed. The Restriction column is for choosing if the data should match the pattern (=
), should be less than the pattern (<
) etc.
The contains
restriction means that the text pattern is present in the field we are restricting. For example, the title Uniwersał królewski, zwołujący sejm walny warszawski, i sejmik w Lipnie. contains the text uniwersał (we are ignoring the letter capitalization).
The format of the enactment date is year-month-day: 1768-09-26
denotes the 26th day of September of the year 1768. But you can also specify only the year (1700
) or the month (1650-01
). If we are looking for a range of dates, the start date and end date should be separated by a space or a comma: 1590-03 1653-06-03
will find the documents enacted between the march of 1590 and June 6, 1653.
The Add a restriction button will automatically add restriction to the already entered query to not match the foreign tokens (like Latin and OCR artifacts).
When clicking on the Query builder button, a window will pop up where we can build advanced queries using the detailed data.
The results show the fragment that was found in the context. The middle, emphasized presents the fragment of the sentenced matched by the query. Aside from the words themselves there is information in square brackets: the dictionary form (the lemma) of each word and the linguistic description of the form. More details on this description are available in the Chapter 2 in the Baroque Corpus’ manual[1].
The rest of the columns give the context of the fragment inside the whole document, the unique document ID (identifier) and the date of its enactment (if it was detected by the automated system during the construction of the corpus).
If you hover over the fragment in the middle column, the bubble will appear with more information about it: the title of document (if it was automatically detected) and the name of the institution behind the document. By clicking on the middle column, you will reveal even more data and a longer fragment of the context (which you can copy to the clipboard).
[1] https://www.korba.edu.pl/manual
[2] Adam Przepiórkowski, Grzegorz Murzynowski 2011. Manual annotation of the National Corpus of Polish with Anotatornia. In: Goźdź-Roszkowski, Stanisław (ed.), Explorations across Languages and Corpora: PALC 2009. Peter Lang, Frankfurt am Main. Pp. 95-104. http://nlp.ipipan.waw.pl/~adamp/Papers/2009-palc-anotatornia/paper.pdf