Databases on the Indonesian Prefixes PE-and PEN-Karlina

Abstract


Introduction
PEN-and PE-are two nominalizing prefixes to create an agent, an instrument, or a patient.Several studies related to the prefixes' form, meaning and their corresponding verbs have been conducted to investigate PEN-and nouns from a process of affix substitution with MEN-verbal prefix (e.g., pembaca 'writer'membaca 'to read').PE-, the second prefix, derives nouns from a process of affix substitution with ber-or di-verbal prefixes (e.g., pelari 'runner'-berlari 'to run' and pesapa 'addressee'-disapa 'to be addressed').
From the semantics perspective, both forms might occur in a similar semantics role (Sneddon et al., 2010).PEN-expresses agent, instrument, or causer.For instance, from the base word kasih 'to love' an agent pengasih 'lover' is derived, pemotong 'cutter' is derived from potong 'to cut', as well as from the base word sakit 'to be sick' becomes a causer penyakit 'disease'.Words with PE-, meanwhile, express patient, agent, or instrument (e.g., sapa 'to address'-pesapa 'addressee', lari 'to run'-pelari `runner', pekasih 'love poison).
Nasalization in PEN-, denoted by 'N', shows that it has five nasalized allomorphs (e.g., PENpen-, PENpem-, PENpeng-, PENpeny-, PENpenge-).There is only one allomorph that does not follow the nasalization rule, PENpe-, which is described as very similar to the invariant PE-.As a result, non-native Indonesian may find difficulty to differentiate PE-and PEN-as one of PEN-allomorph occasionally appears in the same phonological environment (see Table 1).For example, pelari 'runner' is PE-, whereas pelukis 'painter' is PEN-although both proceed a stem initialized by the lateral liquid /l/.The only way to differentiate PEN-and PE-in this circumstance is by relating them to the corresponding verb.
The overlapping issue on these two prefixes is not yet well addressed until now.What makes it more difficult to distinguish PEand PEN-is because there has not been a consensus whether these formations are derived from one or two prefixes (Denistia, 2018).What might be the reason of this inconclusive finding of PE-and PEN-is due to a few numbers of observations.Therefore, a set of databases are needed to explore this phenomenon from the quantitative perspective.Recent studies on these prefixes conducted analyses based on corpus data (Denistia & Baayen, 2019, 2022a, 2022b, Denistia et al., 2022).Their research focused on investigating whether PE-and PEN-are allomorphs from their productivity, computational learning, and semantics distribution respectively.One of their significant findings concluded that PE-and PEN-should be treated as two different prefixes due to their different productivity and semantics.PEN-is found more productive than PE-.In addition, although both PE-and PENcreates agents; PEN-is productive in creating instruments, while PE-is productive in creating patients.Moreover, the number of derived words with PEN-(and all of its allomorphs) is linearly dependent on the number of base words for MEN-allomorphs.PE-, however, is an outlier in the linearity of the base words' productivity.Apart from productivity analysis, using semantics distribution (Mikolov et al., 2013), Denistia et al. (2022) measured the similarity of all possible combination between PE-and PEN-.They found that PE-and PEN-are semantically discriminable.PE-and PEN-cosine similarity is significantly different only across prefixes.Furthermore, compared to derived words with PEN-, words starting with PE-have meanings that are more similar to their noun bases.
This paper provides a detailed explanation of the materials and database used in Denistia & Baayen (2019) and Denistia et al. (2022).Theoretical grounding on how the information in database were classified (e.g., the classification of PE-and PEN-, allomorph of PEN-, semantics role, cosine similarity, tokens frequency in the corpus) is described.The tools used to generate two database of PE-and PEN-are also elaborated in this paper.The information and explanation provided in this paper are structured in a way that I hope to be generally contributive in both corpus and quantitative linguistics analysis.
In what follows, I first introduce the main corpus and tools.In the next section, I present the databases.Finally, I conclude the study in the final section.Along with this paper, two databases are made available for public and can be downloaded at http://bit.ly/PePeNProductivity and http://bit.ly/PePeNSemVector.

Leipzig Corpora Collection
The Leipzig Corpora Collection corpus, which includes a range of Indonesian textual registers from 2008 to 2012, including newspapers, the web, and Wikipedia (Goldhahn et al., 2012) (Goldhahn et al., 2012;Quasthoff et al., 2006).It uses available online newspapers to crawl as a method for gathering text data [http://www.abyznewslinks.com].
In addition, it uses a framework for parallel Web crawling utilizing http://www.httrack.comas the Web site copier.Another way that was conducted to collect the corpus is by crawling the World Wide Web randomly, utilizing FindLinks [http://wortschatz.unileipzig.de/findlinks/](Heyer & Quasthoff, 2004).
Besides, UDHR [http://www.ohchr.org]and Wikipedia [http://sourceforge.net/projects/wikiprep/] were also used as its resource, resulting in more texts in various languages that are covered for this corpora.The text data in the corpora has been preprocessed using the HTML-Stripping in order to take the data containing the well-formed sentences, LangSepa created by Pollmächer (2011) so that each language would be clustered separately, and www.sonderzeichen.deto generate the sentence boundary.To sidestep the copyright issue and to make it impossible to recreate the original material, the phrases were jumbled.The Indonesian Leipzig Corpora Collection corpus is made available online at https://corpora.unileipzig.de/en?corpusId=ind_mixed_2013.

Indonesian Morphological Parser (MorphInd)
The MorphInd parser (Larasati et al., 2011), which has an overall accuracy of 84.6%, was used to perform morphological analysis on the words in the PePeN Database.It was run in non-compound mode.Before starting the parser, I manually fixed 200 words beginning with PE-or PEN-that had typos (see Table 2 for illustrations) and added the frequency of the typos to the frequency of the words.Additionally, using the dictionary as the gold standard manual verification, MorphInd's recall for detecting PE-and PEN-was 0.82 and its precision for doing so was 0.98.
The R open-source programming language, version 3.3.3,was used to process the data in R Studio (R Team, 2015).R is an open source that can be downloaded at https://cran.r-project.org for free (available for Windows, Mac, and Linux users).
Word to Vector was made used to convert all the lemmatized words in the corpus into a vector.Each word in the corpus was encapsulated in high-dimensional vectors so that a vector will represent a word (Turney & Pantel ( 2010)).Cosine similarity, which is length-normalized and is equal to the inner product of the vectors, was used to calculate the degree of semantic similarity between two lemmas, based on the distributional information of the words (their cooccurrences with other words in huge corpora).The similarity of the cosine of the angle  is cosine similarity between  ⃗⃗ and  ⃗⃗⃗⃗ .
In the PePeN CosSim Database, the results of computing the cosine similarity value for each conceivable pair combination of words from the set of PE-, PEN-, and their base words were stored.Lemma1, Lemma2, Cosine similarity (the cosine similarity value between Lemma 1 and Lemma 2), and Derived-Base Cosine Similarity (cosine similarity measure of the derived word with its base word) are all included in the database.Finally, I collected a total of 358224 permutation of derived words with PEN-and 59810 permutations of derived words with PE-together with their cosine similarity to their base words (see Table 4 for list of example entries of this database).Words with a token frequency less than 5 were not included in this database.

Results and Discussion
PePeN Database includes a total of 3090 words; 2818 words with PEN-, 267 words with PE-, and 4 words with the unproductive variant PER-, Benjamin (2009).The latest prefix is not discussed in this paper.For the sake of the quantitative analysis, both PePeN Database and PePeN CosSim Database provide the information on how many times the words with PE-or PEN-and their base words occur in the corpus; usually called as 'token frequency' (see Table 5).The mentioned frequencies are the word's overall frequency and are not segmented by meaning.

Classifying PE-and PEN-
There are two ways to differentiate PEand PEN-.The first one is by applying the phonological condition on PEN-and its six allomorphs: PENpen-, PENpeng-, PENpem-, PENpeny-, PENpe-, and PENpenge-.The phonological context influences the nasal allomorphy of PEN-.The phonological conditioning of PEN-allomorphs is summarized by Ramlan (1985), Sugerman (2016) There are some exceptions of these phonological condition given by Sneddon et al. (2010).If the stem is borrowed from other languages, some bases with initial /k/, /s/, /t/, /p/ are not lost.Thus, the derived words as a result of borrowing becomes more accepted as an Indonesian word as in the stem klasifikasi 'classification' to be pengklasifikasi 'classifier'.
Although derived nouns with MEN-can be further modified with the suffixes -i or -kan, derived nouns with PEN-do not.Nevertheless, the verbs with MEN-/-i or MEN-/-kan affixes may have semantics that are similar to the derived nouns.For instance, pewawancara, 'interviewer', is related to mewawancarai 'to interview someone'.Also, although the corresponding verbs with BER-can be extended by -an or -kan suffixes, derived nouns with PE-do not carry the suffixes.

Base Word of PE-and PEN-
Indonesian nouns, verbs, and adjectives can be monomorphemic or polymorphemic.Kridalaksana (2007) explained that nouns are classified into abstract or concrete, animate or inanimate, countable or uncountable, as well as collective or non-collective.In term of verbs, they can be characterized by adding dengan and adjective which function as an adverbial of manner (referring to the -ly suffix in English).For instance, berlari 'to run' can be modified into berlari dengan cepat 'to run fast'; therefore, berlari is a verb.Verb formations are classified into transitive or intransitive, active or passive or anti-active or anti-passive, reciprocal or nonreciprocal, reflective or nonreflective, copulative or equative, and performative or constant.With regards to adjectives, they could be indicated by tidak 'not' as the negation, premodifiers (e.g., sangat 'very', agak 'pretty', lebih 'more'), and that they could modify nouns.They are classified into predicative or attributive and gradual or nongradual adjectives.
Table 7 shows examples of the base word and base word category in the database.In PePeN Database and PePeN CosSim Database, the dictionary and MorphInd were used to decide what base word category of the PE-and PEN-nouns.There might be a conflict in determining the base word category between those two tools.Upon that case, I followed the base word category information provided by the Indonesian dictionary.However, in the case where the information on the word category of the base is not provided in the dictionary, I used the MorphInd parser identification.I did not provide a further classification on each type (such as whether the verb is transitive or intransitive, or whether the noun is animate or inanimate).

Semantics Role of PE-and PEN-
Manual verification of all PE-and PENwords was not doable.Therefore, I did a manual annotation for the semantic role for all derived words with PE-and PEN-and checked against the usage in the corpus for at least one token, as well as the dictionary (Alwi, 2012).One of the implications of this limitation is that the ambiguity in assigning a semantic role to PE-and PEN-words which express multiple semantic roles could not be resolved.Thus, it is possible that there are cases for which a semantic role was realized in the corpus with no semantic role registered in the database.
Table 8 shows various readings for PEand PEN-formations.As in English, -er nominalizations may have a range of semantic roles (e.g., printer, which has both an instrument and agent reading) (G.Booij, 2010;G. Booij & Lieber, 2004).I did not distinguish between impersonal agent in this research.The term impersonal agent was introduced by Booij (1986) for 'radio station' of the Dutch word zender which also has both an instrumental interpretation, 'transmitter', and an agentive meaning, 'one who sends'.Although it is commonly known that PENcreate agents, patients, and instruments (Sneddon et al., 2010), the database contains a small number of instances of causer (e.g., penyakit 'disease') and location (e.g., penghujung 'the end').Semantic roles that are not registered in the database may nonetheless be used in the corpus, which is plausible and perhaps likely.Words with more than one semantic role have multiple entries in the database, one row per role (cf.Table 8, rows 1-6).Occasionally do the prefixes PEN-and PE-attach to the same base word; often, the form with PE-alludes to a profession in a semantic sense, whereas the word with PEN-does not (cf.Table 8, rows 7  and 8).In some instances, the form with the prefix PEN-expresses the agent, causer, or instrument, while the form with the prefix PEexpresses the patient or agent (cf.

Morphological Variation of PE-and PEN-
In Indonesian, there are bound morphs for possession of nouns, (first -ku, second -mu, and third person singular -nya), subject (first kuand second person singular kau-) and object (first -ku, second -mu, and third person singular -nya) marking on verbs (Sneddon et al., 2010).These bound morphemes fulfill the contextual inflection, an inflection which is not dictated by syntax, proposed by Booij (1996).Additionally, there are two suffixes that can be added to verbs or nouns to indicate emphasize (-lah) or query (-kah).Clitics are the term given to bound morphemes, which are phonologically condensed versions of free pronouns (Kridalaksana, 2008).Therefore, I will refer to these morphs as inflectional because they alter existing words rather than creating new ones, much to how English adverbs modify verbs.
Reduplication creates different semantic functions on verbs and adjectives, including intensification and iteration respectively, as well as to convey the plural for nouns.(Rafferty, 2002;Chaer, 2008;Dalrymple & Mofu, 2012;Sugerman, 2016).According to Booij (1996), reduplication as well as -lah, -kah and -pun instantiate inherent inflection.Although it may have syntactic relevance, inherent inflection is the kind of inflection that is not required by the syntactic context.In the database, reduplication is more like syntactic modification than to word formation.Hence, reduplicated forms were classified as inflectional because their semantics are still related to a plurality (e.g., intensifier or iterative).Some examples on the inflection are listed in Table 9.

Conclusion
Given the fact that there have been many qualitative descriptive about the Indonesian PE-and PEN-prefixes, some questions on how to discriminate them remain unanswered.PEN-has 5 allomorphs: PENpen-, PENpem-, PENpeng-, PENpeny-, PENpenge-that follow the nasalization rule and there is only one allomorph, PENpe-, that is not nasalized.A case arises when these two are in a contest, appearing in the same phonological environment.Moreover, there has been an inconclusive agreement among theories whether these nominalizing prefixes are one or two independent formations.This paper provides detailed information on two databases, namely PePeN Database and PePeN CosSim Database, as the contribution to a quantitative approach for Indonesian linguistics.Taken from Leipzig Corpora Collection, I used several tools and programming language to classify the database from its prefix, allomorph, base word, base word class, semantics role, inflection, as well as cosine similarity.These databases could be used to conduct a further study on PEand PEN-formations.This study, however, is limited to only two nominalizing prefixes, PE-and PEN-.Indonesian has other nominalizing affixes (e.g., -an as in luar `outside' to luaran `outcome', Makmur `prosperous' to ke-/-an as in kemakmuran `prosperity').In addition, PENcould also attach to the suffix -an to form peN-/-an circumfixes (e.g., tinggal `stay' to peninggalan `heritance').Another noun could also be derived from per-/-an, such as unbah `to change' to perubahan `a change'.Therfore, some explanation on databases of other nominalizing affixes would be useful for further research.

Table 1 .
Words with PE-and PEN-that have similar phonological condition

Table 3 .
The MorphInd parser output examples

Table 5 .
Sample entries of PePeN Database

Table 6 .
Examples of the correspondingPEN-with MEN-and PE-with BER-.

Table 7 .
Examples of PePeN base word and base word category.

Table 9 .
Examples of inflection in PePeN database.