swell-project

Correction annotation guidelines

Introduction

The purpose of the correction annotation

The purpose of the correction annotation is to make the learner corpus searchable for different types of deviations from a standard language norm. The annotation of the learner texts according to the correction-taxonomy of the SweLL project is hence an important step in making the learner language assembled in the corpus analyzable for research and educational purposes.

The purposes of this document

This document has three interconnected purposes:

The organization of the document

The rest of this document is divided into two main sections: one section with general directions for the correction annotation, and one section on the SweLL correction annotation taxonomy.

General directions for the correction annotation

What is a correction? What is annotated?

By correction we mean a difference between the original learner text and the normalized version of the text. The correction annotation is thus a categorization of such differences.

This means that the correction annotation only indirectly indicates properties of the source text. What is directly indicated is the relationship between the original version and the normalized version of the text. The correction annotation is thus highly dependent on the preceding normalization.

While the purpose of the correction annotation is to make the texts searchable for deviations from a standard language norm, such deviations are only possible to categorize on the basis of assumptions of the learner’s intended content, along with judgements of the acceptability and suitability of standard language expressions communicating that content. This means that any annotation of “deviations”, “errors” etc. in learner texts is actually an annotation of a relationship between (a segment of) the analyzed text and an assumed standard language version of the text (segment). By the choice of the term correction annotation (rather than, for instance, error annotation) the SweLL project emphasizes these conditions of learner language analysis. The normalized text versions make explicit this necessary assumption about the standard version of the text to which the original text is related. (The principles underlying the normalization are described in the normalization guidelines.)

A consequence of the fundamental principle of annotating corrections, understood as differences between the original text and one specific interpretation of this text, rather than “errors in general”, is that certain clear deviations from the norms of written standard Swedish in the original texts are left without annotations – because they are not deviations in relation to the normalized text. This occurs for instance when a misspelled word in the original text has been corrected to another word altogether. Such a correction will be annotated as an instance of a faulty choice of word (L-W), and since the word in the original text cannot be analyzed as a misspelling of the word in the normalized text, the spelling error will be left without annotation.

Tags

The correction annotation is created by marking links between the original text and the normalized text with tags representing the categories in the SweLL correction annotation taxonomy, which is presented below. The tags are available in the Svala annotation tool in a list to the left on the screen.

A link may be tagged with one or several tags.

The links on which the tags are placed are visualized as lines in the Svala annotation tool. They run between elements of the original text and corresponding elements of the normalized text. The linked text elements typically consist of one token in each text version, but they may also consist of more than one token in one of the texts or in both texts. Moreover, links may run between a text element in one text version and “nothing”. This is normally the case when tokens have been added or removed in the normalized version of the text.

During the normalization process, preliminary links between the original and the normalized texts are created automatically. These links may be adjusted during the correction annotation process. The correct link must be in place before a tag is inserted. Links may be adjusted in the following way:

In order to break a link, place the marker on one of the linked text elements or on the link itself, and press the orphan button on the menu to the left.

In order to create a new link, mark all of the tokens which you wish to link (at least one token in the original text and at least one token in the normalized text) by clicking the tokens while holding ctrl (PC) or shift (Mac). Then press the group button on the menu to the left.

Document comments

The Document comment field provides an opportunity to comment on deviations from the standard norm regarding text properties which cannot be adequately reflected by tags on individual links. This field may for instance be used when the verb tense choices in the text are inconsistent at a global text level, but when corrections of individual verb forms have generally not been made, since there is consistency more locally in the text.

The field may also be used for any kind of general comment on the text which the annotator regards as essential for the future corpus user.

The Swell correction annotation taxonomy

In this section the Swell correction annotation taxonomy is presented, and directions on how to apply and interpret the tags associated with the correction categories are provided.

First, the general structure of the taxonomy is described. After that, the annotation categories are presented in the order in which they occur in the Svala annotation tool. After the presentation of all the correction categories follows a section on a few specific categorization issues, cutting through several correction categories. This includes the following sections:

The general structure of the taxonomy

The SweLL correction annotation taxonomy contains five main categories of correction types:

In addition to the correction tags included in these five main correction categories, the Svala tool provides six other tags, including tags for corrections made as a consequence of other corrections (C), corrections not covered by any of the defined correction categories (Unid), unintelligible strings (X), strings cited from a foreign language (Cit-FL), and finally two tags for notes and comments – OBS! for internal work notes, and Com! for comments intended for the corpus users.

In the following the annotation categories and their tags will be presented in the order in which they appear in the Svala annotation tool – i.e. the O tags first, followed by the four other main correction categories in alphabetic order (L, M, P, S), and, finally, the remaining six tags under the heading Other tags.

O – Orthographic corrections

The O tags represent the orthographic correction category. It includes three sub-categories.

O (regular spelling correction)

The plain O tag is used for regular spelling corrections, i.e. when the string of letters is different in the original text and the normalized text, due to a spelling mistake.

Examples

O-Cap (upper/lower case)

The O-Cap tag is used for corrections regarding the choice between upper and lower case letters.

Examples

O-Comp (spaces and hyphens between words)

The O-Comp tag is used for corrections which involve the removal of a space between two words which have been interpreted as making up a compound in the normalized text version, or, more rarely, the adding of a space between two words. It may also be used for corrections regarding the use of hyphens in compounds.

This tag should only be used for corrections concerning the mere orthographic rendering – with or without a space or a hyphen – of a compound or a multi-word expression, and not for corrections which are rather to be interpreted as involving an actual alternation between a compound and a multi-word expression. The latter case is covered by the S-Comp tag. (See the S-Comp section, and the section on compounds vs multi-word expressions.)

Examples

In the following example, the change from a comma to a long dash between Finland and Sverige is tagged with P-W, and the change from a space to a hyphen between Sverige and historien is tagged with O-Comp:

Corrections which involve both a removal of a space and a change of the form of the first part of the compound are tagged both with O-Comp and L-Der:

L – Lexical corrections

The L tags represent the lexical correction category. It includes four sub-categories.

L-Der (word formation)

The L-Der tag represents the correction category deviant word formation. It is used for corrections of the internal morphological structure of word stems, both with regard to compounding and to derivation.

The L-Der tag is exclusively used for links between one-word units, where the normalized word has kept at least one root morpheme from the original word, but where another morpheme has been removed, added, exchanged or had its form altered.

Examples

(See the special section on verbal particles and reflexives.)

Note (1): Corrections of verbal particles of phrasal forms of particle verbs are not tagged with the L-Der tag, but with the L-W tag. (See the special section on verbal particles and reflexives.)

Note (2): The L-Der tag is not used for corrections which involve the addition of a -t suffix to an adjective which is used as an adverb, since this correction category is treated as a morphological correction with its own tag, M-Adj/adv.

Note (3): When the correction of a word tagged with L-Der involves a change of phrase type or part of speech, the correction is tagged with S-Type, in addition to L-Der.

Examples

Changes from båda to både, or the other way around, are tagged as L-Der (and S-Type) rather than as L-W:

L-FL (foreign word corrected to Swedish)

The L-FL tag is used for words from a foreign (non-Swedish) language which have been corrected to a Swedish word. It may also be applied to words which have certain non-Swedish traits due to influence from a foreign language.

Examples

The L-FL tag is used for corrections with the following characteristics:

L-Ref (anaphoric expressions)

The L-Ref tag is used for anaphoric expressions (particularly pronouns and pronominal adverbs) which have been corrected in order to have the grammatical form (gender, number, reflexive/non-reflexive), semantic content (masculine/feminine, directional/locational etc.), and specificity which suits its correlate and its textual position.

The L-Ref tag has higher priority than the M-Num, M-Gend and L-W tags.

The L-Ref tag has lower priority than the M-Def tag, and should only be applied in cases when the M-Def tag cannot be applied.

Examples

The L-Ref tag may also be used when a noun which is used anaphorically has been exchanged for a pronoun, or the other way around, in order for the specificity of the anaphoric expression to suit its textual position:

L-W (wrong word or phrase)

The L-W tag represents the correction category wrong word or phrase. It is used when a word or phrase in the original text has been replaced by another word or phrase in the normalized version. The L-W tag is thus placed on strings which are exchanged rather than corrected (see note below for further explanation).

The L-W tag is only applied when at least one of the strings (the original string and the normalized string) is (an attempt at) a word or a fixed phrase.

The L-W tag has lower priority than the L-Ref tag.

One word replaced by one word:

One word replaced by a multi-word expression:

Multi-word expression replaced by one word:

Multi-word expression replaced by another multi-word expression:

Phrasal verb replaced by another phrasal verb, both verb and particle exchanged:

Compound particle verb replaced by a phrasal verb, both verb and particle replaced:

Verbal particle replaced by another verbal particle, but verb kept; the verbal particle rather than the whole phrasal verb is tagged with L-W:

Fixed expression consisting of a lexical word (e.g. a noun) and a grammatical word (e.g. a preposition) replaced by another fixed expression consisting of the same lexical word but another grammatical word; the grammatical word rather than the full fixed expression is tagged with L-W:

Note (1): Corrections consisting in the mere removal or addition of the reflexive sig or a verbal particle are not tagged with L-W, but with S-R or S-M respectively – even if both the bare verb and the phrasal verb may be characterized as lexical units.

Note (2): When a correction tagged with L-W involves a change of phrase type/part of speech, the correction is also marked with the additional tag S-Type.

Examples

Note (3): An expression tagged with L-W should be replaced rather than corrected. This means:

M – Morphological corrections

The M tags represent the category morphological corrections. It covers corrections related to inflections. This includes primarily corrections of individual inflectional forms, but in some cases also corrections of more complex grammatical constructions closely related to inflectional forms. The latter concerns basic definiteness constructions (see M-Def), the periphrastic comparative and superlative adjective constructions (see M-F), and tense-related verbal constructions involving auxiliaries (see M-Verb).

The category of morphological corrections includes eight sub-categories.

M-Adj/adv (adjective corrected to adverb)

The M-Adj/adv tag is used for corrections involving the change of an adjective to its t-form, when the t-form is called for due to the adjective being used as an adverb.

Examples

The M-Adj/adv is also used for similar changes, when an adjective or adjective-like word is changed to a morphologically closely related but distinct adverb form:

Moreover, the M-Adj/adv tag is used when an adjectival form of the word liten is changed to the adverb form of the same word, i.e. lite. This holds also when the form litet is changed to the form lite. Although the form litet is occasionally used as an adverb in standard Swedish, it is too archaic to be used in most of the Swell text genres, and adverbial uses of the form litet are thus normally corrected to the form lite during normalization. Such changes are also tagged with M-Adj/adv:

M-Case

The M-Case tag is used for corrections regarding the choice of case form for nouns (nominative vs genitive) and pronouns (nominative vs accusative).

Examples

Note: When the form dem is changed to the form de used as a definite article, the correction is tagged as L-W, not as M-Case:

M-Def (definiteness)

The M-Def tag is used for corrections regarding definiteness constructions. The kinds of corrections which are involved in this correction category are:

Examples

M-F (wrong form, correct grammatical category)

The M-F tag is used when a declension/conjugation form (typically a suffix) which is used to express a specific grammatical category (e.g. number or plural) has been corrected to a form belonging to another declension/conjugation within the same grammatical category. The tag is used for the following correction types:

Nouns:

Note: Unsuffixed noun forms will be interpreted as singular when corrected to a suffixed plural form, although unsuffixed plurals exist. Corrections like båt -> båtar will thus be tagged with the M-Num tag and not with the M-F tag.

Verbs:

Adjectives:

The tag may also be used for some analogous changes of pronouns:

M-Gend (gender)

The M-Gend tag is used to mark corrections of gender forms (neuter vs non-neuter) of nouns, articles, adjectives, and pronouns with adjective-like functions.

Examples

The M-Gend tag is also used for corrections of the overuse of the distinctly masculine form of adjectives:

(Since the masculine form is never obligatory, corrections from the feminine/common form to the masculine form are not made during the normalization, and thus do never occur in the correction annotation process.)

Note: Gender corrections of pronouns which are due to their anaphoric reference will be covered by the L-Ref tag.

M-Num (number)

The M-Num tag is used to mark number corrections of nouns, articles, adjectives, participles, and pronouns with adjective-like functions.

Examples

Note: Number corrections of pronouns which are due to their anaphoric reference will be covered by the L-Ref tag.

M-Other

The M-Other tag is used for corrections involving inflectional morphology for which none of the other M tags are suited, or for ambiguous cases when different sound interpretations of the correction lead to different M tags.

The usage of the M-Other tag covers corrections between the comparational forms of adjectives, including corrections between non-morphologically related words functioning as different comparational forms of the same adjective (e.g. dålig and sämre or många and fler):

Note: Take care not to overuse the M-Other tag. The other M tags should be carefully considered before choosing this one.

M-Verb

The M-Verb tag covers corrections regarding inflectional verb forms and basic tense constructions involving auxiliaries.

The verb-related grammatical categories involved are primarily tense, mode and voice.

The verb forms involved are the non-finite verb forms (infinitive and supine), the basic tense forms (present and past), the imperative, and the s-forms (both when used in passive constructions and in other uses).

The extra-inflectional constructions involved are the tense-related constructions including the auxiliary verbs ha, skola and komma and the non-finite verb forms.

Examples

P – Punctuation corrections

The P tags represent the category of punctuation corrections, including instances of of merging or splitting sentences. It has four sub-categories.

P-M (missing punctuation)

The P-M tag is used for corrections involving the addition of a punctuation mark.

Examples

Note: Additions of hyphens between words are not included in this category, but in the O-Comp category.

P-R (redundant punctuation)

The P-R tag is used for corrections involving the removal of a punctuation mark.

Examples

Note: Removals of hyphens between the constituent words in a compound are not included in this category, but in the O-Comp category.

P-Sent (sentence segmentation)

The P-Sent tag is used for corrections involving splitting a sentence or merging two sentences into one, when this correction involves more than the pure insertion or removal of a punctuation mark – in the typical case the adding or removal of a conjunction.

Examples

In this example, the P-Sent tag is placed on a link between och in the original text and the period in the normalized text. The link between vi and Vi is tagged with C, as a consistency correction necessitated by another correction.

In this example, the P-Sent tag is placed on a link between som in the original text and the period in the normalized text. The link between hon and Hon is tagged with C, in the same way as in the example above.

P-W (wrong punctuation)

The P-W tag is used when a punctuation mark in the original text has been replaced with another punctuation mark in the normalized text.

Examples

Note (1): Instances where a space has been corrected to a hyphen between the constituent words in a compound are not marked with this tag, but with O-Comp or S-Comp. The same holds for instances where a hyphen between words has been corrected to a space.

Note (2): Instances where a hyphen has been used in the original text where a dash would be more appropriate are left uncorrected and should thus not appear as corrections to be annotated.

Note (3): Possible errors involving the incorrect placement of a space before a punctuation mark will not be corrected in the normalization process, since spaces are always inserted before punctuation marks for the sake of tokenization. Consequently, such errors will not be tagged.

Note (4): Possible errors involving the lack of a space between a punctuation mark and the following word are corrected in the normalization process (a space is inserted), but are nevertheless left untagged.

Example:

In this example, a space is inserted between the period and tack, and the t in tack is changed from lower to upper case. Tack will be tagged with O-Cap, but the insertion of the space will not be tagged.

S – Syntactical corrections

The S tags represent the syntactical correction category. It contains eleven sub-categories.

S-Adv (adverbial placement)

The S-adv tag is used for corrections involving the placement of an adverbial.

This word order tag has the highest priority of the three word order tags (S-Adv, S-FinV and S-WO), and should be applied whenever a word order correction may be interpreted as concerning the placement of an adverbial. Particularly, word order corrections regarding the relative ordering between an adverbial and a finite verb, should be marked as S-Adv rather than as S-FinV.

Examples

S-Clause

The S-Clause tag is used for corrections involving changes of the most basic clause structure. The corrections in this category may be divided into the two following main types:

  1. The structure of a clause is changed in a way which involves changing the primary syntactic function (subject, finite verb, object, egentligt subjekt (‘object-positioned subject’) and predicative) of one or more of the words involved, for instance:
  1. The structure of a phrase or a clause is changed in a way which involves adding a clause to its internal structure, for instance:

Note: When a clause is changed by 1) adding an expletive det as a subject and 2) changing a subject to an egentligt subjekt (‘object-positioned subject’), two tags are needed to mark the corrections:

S-Comp (compound vs multi-word expression)

The S-Comp tag is used for:

Note: Corrections regarding the mere orthographic rendering of a string with or without a space should not be marked with the S-Comp tag but with the O-Comp tag. (See O-Comp and the section on compounds vs multi-word expressions below.)

S-Ext (extensive, complex correction)

The S-Ext tag is used for extensive, complex corrections. The syntactic structure of the normalized text segment may rather be described as created than as corrected, and the correction often also involves the addition of lexical words. The original text gives a fair indication of the intended meaning (otherwise the correction would be X-marked), but it gives a very poor basis for assuming a specific syntactic goal structure.

Examples

Note: The S-Ext tag should only be applied in cases when a correction has actually been made, and when the original text gives fairly sound support for the interpretation presented in the normalized version. Text segments which are so difficult to interpret that they are either left unchanged or normalized on the basis of guesses rather than interpretations should be tagged with X.

S-FinV (placement of finite verb)

The S-FinV tag is used for corrections concerning the placement of a finite verb, unless the correction regards the ordering between the finite verb and an adverb, in which case the correction is tagged with the S-Adv tag. The S-FinV tag thus has lower priority than the S-Adv tag, but higher priority than the S-WO tag.

Examples

S-M (word missing, other)

The S-M tag is used when a word is missing in the original text and has been added in the normalized version. This includes the addition of reflexives and verbal particles. (See the special section on verbal paricles and reflexive.)

The S-M tag has lower priority than the S-Msubj tag and the M-Def tag, and should only be applied in cases when neither of these other tags may be applied.

Examples

S-Msubj (subject missing)

The S-Msubj tag is used to mark corrections involving the addition of a subject which is missing in a clause in the original text. This includes cases when the pronoun/subordinating conjunction som has been inserted as a subject.

The S-Msubj tag has higher priority than the S-M tag.

Examples

The S-Msubj tag should be placed on a det which has been inserted as a subject, also in the following cases:

Note: The S-Msubj tag should only be applied in those cases when the clause is already present in the original text. Thus, in the following example, the generic pronoun man is added as a subject as a part of a correction involving changing an infinitive phrase to a finite clause, and the correction, including the addition of man, is marked as S-type. The S-Msubj tag should not be applied.

S-Other

The S-Other tag is used for syntactic corrections not covered by any of the other S-tags.

S-R (word redundant)

The S-R tag is used when a word is redundant in the original text and has been removed in the normalized version. This includes the removal of reflexives and verbal particles. (See the section on verbal particles and reflexives.)

The S-R tag has lower priority than the M-Def tag, and should only be applied in cases when the M-Def tag is not applicable.

Examples

S-Type (change of phrase or clause type)

The S-Type tag is used when a phrase or clause has in its entirety been changed to another phrase- or clause-type, such as:

Note (1): The S-Type tag may be combined with the L-W tag and the L-Der tag. See these sections for examples.

Note (2): Word-order corrections covered by the word-order tags (S-Adv, S-FinV, S-WO) are not in themselves a basis for considering the correction a change of clause type (but rather as corrections to suit a clause type which has been indicated by other means). The S-Type tag should thus not be applied to a correction solely on the basis of an adverbial having been moved from a typical main clause position to a typical subordinate clause position (or vice versa). That correction is covered by the S-Adv tag. For an S-Type tag to be applied to a clause which has been changed from a subordinate clause to a main clause etc., the change of clause structure has to be indicated for instance by the addition of or removal of a subjunction.

S-WO (word order, other)

The S-WO tag is used for word order corrections which are not covered by the S-Adv or the S-FinV categories.

In corrections regarding the relative placement of a phrasal head and a modifying element (for instance a noun and its attribute), the modifying element should be marked rather than the phrasal head (e.g. min rather than bostad in the example below):

In cases when the word order change may be interpreted as moving an element “out of” a normally fairly fixed structure, that element should be marked rather than an element included in the more fixed structure (dem rather than upp in the example below).

In other cases the placement of the tag may be chosen freely between the elements which have been moved relative to each other, but the readability of the resulting visualization should be taken into consideration.

Other tags

The final group of tags, collected under the heading Other, contains six tags used for various purposes:

Four of these tags, namely Cit-FL, Com!, OBS! and X are available already during the normalization process, and are normally inserted already by the normalizer.

C – Consistency corrections

The C tag represents consistency corrections, a category which covers necessary (follow-up) corrections in the text that come as a result of a previous correction, i.e. originally there was no mistake in the segment, but due to an introduced correction in the neighboring context, a correction is necessary in the segment. By using this tag we indicate that the error was not made originally by the learner.

In some instances it may not be self-evident which one of two related corrections that should be considered as necessitating the other, but by marking one of them with C we avoid marking a single mistake in the original text as two.

Examples

The shift of word order is marked as a word order correction (S-WO). The change from definite form (bostaden) to indefinite form (bostad) of the noun is made necessary because of the shift of word order, and is thus marked as a consistency correction (C).

The insertion of the full stop is tagged as a punctuation correction (P-M). The capitalization of the following D is made necessary because of the insertion of the full stop, and is thus marked as a consistency correction (C).

The change from bostanden to bostadsområde is marked as a lexical correction (L-W) and a morphological correction (M-DEF). The gender change of the pronoun, from min (non-neuter) to mitt (neuter) is made necessary because of the change of word (from the non-neuter bostad to the neuter bostadsområde), and is thus marked as a consistency correction (C).

The removal of är is marked as S-R, while the movement of går to the finite verb position (where it replaces the likewise finite verb är) is tagged with C.

The link between första and först is tagged with S-Adj/adv (see this section) and with C, which indicates that the word order change is a consequence of the change from adjective to adverb. (An alternative normalization would be to keep the adjective form första and add the noun gången.)

The addition of saker is tagged as S-M, and the correction of vad to vilka is tagged with C, since it is the congruence with saker which determines the choice of pronoun.

Cit-FL (cited foreign word judged acceptable in the normalization)

The Cit-FL tag is used for foreign (non-Swedish) words, phrases or text segments, which have been kept by the normalizer since their usage has been judged acceptable given the norms of the text type in question. This may be the case for instance for explicitly marked citations or intentional code switching appropriate for the genre. The Cit-FL tag is thus used to mark words and text segments which have not been corrected in the normalized version, but which nevertheless are not passable as standard Swedish.

Note that the only requirement for applying this code is that the word or text segment is recognizable as another language than Swedish, and that the choice to use this other language is judged appropriate for the genre and the text. No judgement or correction of the word or text segment is made relative to the norms of the foreign language in questions. For instance, spelling mistakes are left uncorrected.

The Cit-FL tag is usually added already during normalization.

Examples

Judged as appropriate code switching:

Clearly marked citation of norwegian passage:

Com! (comments for the corpus users)

The Com! tag is connected to an edge comment field, which is open for freely composed comments. It is available already during the normalization process.

The Com! tag is used for comments on specific tokens or text sequences which are relevant for future users of the corpus, and which are thus meant to be kept in the published corpus. (Comments regarding the text as a whole or recurring properties in the text may be added in the Document comment field.)

The Com! tag may for instance be used to mark text segments which are copied from the task formulation. If a significant portion of the text consists of copied text, this should preferably be indicated also in the Document comment field, in addition to the edge comment field connected to the Com! tag which is inserted on the specific text segment or the specific text segments.

OBS! (notes and pending analyses)

The OBS! tag is, like the Com! tag, connected to an edge comment field, and is available already during the normalization process.

The OBS! tag is used to mark pending analyses to which the annotator wants to return, remarks which the normalizer wishes to pass on to the correction annotator, etc.

X (unintelligible string)

The X tag is used to mark unintelligible strings in the original text. The tag is available and usually added already during the normalization process. The marked original string may be left unchanged in the normalized version, or the normalizer may replace it with some more or less wild guess of the intended message.

The X tag may be used both in cases when there is no reasonable interpretation of the string, and when there are several somewhat reasonable interpretations, but none of these interpretations may be settled as better than the other.

Examples

The text from which this example is collected has some german or possibly dutch traits, and a fairly resonable guess is that dar rum is meant to be darum. Another, less likely, possibility is that dar rum is intended to mean där rum, but that interpretation necessitates extensive changes in the rest of the text passage in order to create a syntactically functional string. By marking dar rum with X and keeping the rest of the sentence unchanged in the normalized text version, the normalization as a whole is better adjusted to the principle of minimal change.

Stanna is a reasonable guess about the intention with sta, but not well founded enough for the correction to be marked as an orthographic correction. Many other interpretations are obviously also possible.

The X tag may be combined with the S-R tag when the unitelligible string is also redundant, and thus eliminated in the normalization:

Unid (unidentified correction)

The Unid tag is used for any type of correction which cannot be covered by any of the correction categories defined in the taxonomy.

Some specific categorization issues

Compounds vs multi-word expressions

Corrections concerning the forming of an expression as a compound or a multi-word expression are divided into two categories in the SweLL correction taxonomy: Corrections which are judged to concern the mere orthographic rendering with or without a space between two words are marked with the O-Comp tag, while corrections which are judged to concern the actual choice between a compound and a multi-word expression are marked with the S-Comp tag. Borderline or unclear cases between these two categories obviously exist, but for the most part one of the options is clearly the better one.

The O-Comp tag is primarily used for corrections of standard cases of särskrivning (the faulty writing of a compound with a space in between the two compounded words):

It may also be used for corrections which involve changing the first word of the compound into a specific compound form, in addition to the removal of the space between the two words. In such cases, the correction is marked with both the O-Comp tag and the L-Der tag:

More rarely, the O-Comp tag may be applied when a multi-word expression in the original text has been corrected through the adding of a space between two of the words:

The S-Comp tag is used whenever the correction made is more complex than the mere addition or removal of a space between two words (and possibly changing the form of the first word of a compound).

However, in some cases the S-Comp tag is more suitable than the O-Comp tag, although the correction superficially merely involves the presence of a space. This is the case when two words may be correctly formed as a compound (without a space) and as a two-word expression (with a space), and the meaning of the compound version and the two-word version are either the same or else both plausible in the context. The correction made is thus not due to the chosen expression being unthinkable, but due to the other expressions happening to be the established lexical unit:

In this case, socialfobi is a perfectly well formed compound, but it is corrected to the two-word expression social fobi because this is the established way to express the intended meaning. The mistake made by the writer is therefore not judged to be a case of having missed a space in between two words in a two-word expression, but it is judged to be a case of the writer actually having chosen the compound expression instead of the established two-word expression. The correction is thus marked with the S-Comp tag.

Moreover, in some cases the S-Comp tag may be applied although neither the original nor the normalized string is a single orthographic word. This is the case when the string in the original text may be interpreted as an instance of a compound, although it includes spaces:

In this particular case, the string in the original text may be interpreted as a compound between a two-word phrase (sociala medier) and the word användning. According to the norms of written standard Swedish, such compounds should be written as sociala medier-användning etc. While such a minimal correction is not an unthinkable solution for the normalization, the normalizer has here judged a restructuring of the NP as a better solution, and the correction should be tagged S-Comp.

Non-Swedish words and sequences

There are a number of ways to handle non-Swedish words during normalization and correction annotation. Many of the fundamental choices are made during the normalization process rather than during the correction annotation process.

The first judgement to be made when coming across a word stemming from another language in the material is naturally whether the word may be recognized as having been incorporated into written standard Swedish; in such cases the word is left uncorrected and untagged. This judgement is made during normalization.

When a word (or sequence) in an original text is recognized as belonging to a foreign language – or as having traits from a foreign language – and when this word may not be recognized as part of written standard Swedish, a number of options are at hand:

1) The word/sequence is judged as a genre appropriate usage of cited foreign language (explicitly signaled citations, code switching etc.). -> Not corrected, tagged Cit-FL during normalization.

2) The word is not judged as a genre appropriate usage of cited foreign language and is thus corrected to a Swedish word during normalization:

 a)	The form used may be interpreted as a misspelled Swedish word. -> Corrected during normalization, tagged O during correction annotation: **kaffee** -> **kaffe**; **can** -> **kan**
 
 b)	The form used may be interpreted as a Swedish word with an incorrect usage of derivational affixes etc. -> Corrected during normalization, tagged L-Der during correction annotation: **national helgdag** -> **nationell helgdag**
 
 c)	Neither a nor b applies. -> Corrected during normalization, tagged L-FL during correction annotation: **balkony** -> **balkong**; **family** -> **familj**; **gas bojler** -> **gaskokare**

Note: A word in the original text which is identifiable as a Swedish word, but which is used with another meaning in a way which is likely to be due to influence from a similar non-Swedish word, should be corrected and marked as L-W (not as L-FL):

In this example, it is likely that the incorrect usage of the correct Swedish word busiga is influenced by the word’s similarity to the English word busy – and it is partly based on this assumption that the writer’s intended meaning has been interpreted as ‘upptagna’. But since busiga is a correct Swedish word, with a distinctly Swedish morphological structure, the correction is tagged as L-W rather than as L-FL.

Verbal particles and reflexives

Several tags are used for corrections involving phrasal or compound verbs made up by a verb and verbal particle or a reflexive marker, primarily O-Comp, S-Comp, L-Der, L-W, S-M and S-R. This section provides an overview of the usage of these six tags for this category of corrections.

Appendix with illustrated examples