Dismark

Official website of Project Fondecyt 1191481

AUTOMATIC INDUCTION OF TAXONOMIES OF DISCOURSE MARKERS FROM MULTILINGUAL CORPORA
Current version:
August 5^st, 2022

This web site offers the following contents:

1. Documentation

Abstract

Project Fondecyt Regular 1191481, with title 'Automatic induction of taxonomies of discourse markers from multilingual corpora', was funded by Anid.cl and directed by Rogelio Nazar. It consists of a lexicographic research project aimed at cataloging discourse markers (DMs) by means of statistical analysis of large parallel corpora. It is based on a newly developed algorithm for the automatic induction of a multilingual taxonomy of DMs, which is then used to recursively identify and classify more units. The project started in April 2019 and finished in April 2022. This web site presents the results of the project, which include a multilingual taxonomy of discourse markers in English, Spanish, French, German and Catalan, as well as a computer program to detect and classify DMs.

DMs have been in the spotlight of linguistic theory in recent years, with an increasing number of publications devoted on the subject (e.g. Casado Velarde, 1993; Fraser, 1999; Martín Zorraquino & Portolés, 1999; Pons Bordería, 2001; Fischer, 2006; Borreguero & López, 2010; Fedriani & Sansò, 2017; Crible & Blackwell, 2020; Furkó, 2020; Loureda et al., 2020; Haselow & Hancil, 2021). Early interest on the subject began to appear in the context of discourse analysis (Halliday & Hasan, 1976; van Dijk, 1978; Halliday, 1985), where DMs were defined as particles that facilitate the interpretation of coherence relations in texts. In other words, they are instructions on how to connect propositions and organize argumentation. They are, therefore, procedural rather than conceptual or lexical units. They can pertain to different categories, such as conjunctions, adverbs, prepositional phrases, idioms, and so on. The way to classify DMs is according to their functions. Some of the most common are counter-argumentation, with expressions such as however or nevertheless, cause-consequence, such as consequently or therefore, and so on. In Spanish, a well-known taxonomy is the one by Martín Zorraquino & Portolés (1999), but there are others. Fewer are the attempts to compile extended lists of DMs. Some dictionaries have appeared (e.g. Santos Río, 2003; Briz, 2008; Fuentes Rodríguez, 2009; Holgado Lage, 2017). The main difference between our present research and these attempts is that ours is an empirical method, i.e., a bottom-up rather than a top-down approach. This is important for practical reasons, as the automation saves a lot of effort, buy also, and most importantly, for scientific reasons, as the quantitative method favors objectivity.

The method applied in this project is purely statistical, i.e. without any type of external resource apart from a corpus, not even POS-taggers. It is a minimalistic approach based on parallel corpora, in this case Tiedemann's (2016) Opus Corpus, based on co-occurrence association measures and an entropy model that quantifies the information of each DM candidate. DMs are identified according to their distribution in the corpus, and their characteristic pattern is that they are independent of the content of the text. In operational terms, this means that their occurrence cannot be used to predict the occurrence of other lexical units. They are then classified according to a novel clustering algorithm. The similarity measure used for the clustering is their shared equivalence in the parallel corpus (i.e., nevertheless and however are considered similar because they share the same equivalences in a second language). Then, the result of the clustering of the different languages is aligned to obtain a multilingual taxonomy. Once a basic taxonomy is built this way, it is then used to classify new DMs in a recursive manner. The algorithm will first classify a DM candidate by language, will then decide if it is effectively a DM and, if this is the case, it will assign a category to it.

Research Proposal

1. Introduction
2. Related work
2.1. Research on Discourse Markers
2.3. Background of automatic categorization of DMs
2.4. Background on bilingual vocabulary alignment
2.5. Background of multilingual alignment of taxonomies
3. Statement of the problem
3.1. Research question
3.2. Objectives
4. Proposed methodology
4.1. Identification of the DMs of a language
4.2. Organization of the extracted DMs into groups
4.3. Labeling of clusters with functional categories
4.4. Populating the taxonomy with new specimens
4.5. Creation of a database of DMs in English, French, Spanish, German and Catalan
5. Dissemination of project results
5.1. Communication of results in the scientific field
5.2. Dissemination in society in general
6. International cooperation
7. Lines of future work
8. Publications derived from the project
References

1. Introduction

Discourse markers (DMs) constitute a large and heterogeneous set of linguistic units, not completely delimited but composed of functional elements - not lexical - which are used to guide the reader/listener, specifying the inferences he/she has to make in order to interpret the text according to the writer/speaker's intention and the contextual constraints. These elements do not participate in the semantics of the utterance, but serve to make propositions more cohesive or to regulate interaction. Some examples of these units are the following: for example, however, nevertheless, nevertheless, that is, therefore, on the one hand, first of all, certainly, among many others.

The use of DMs is very common in both writing and speaking and they have been studied in a great diversity of natural languages, such as Spanish, English, French, German, Chinese, Finnish, Italian, Japanese and Portuguese, and many others - even in sign languages. Their use has also been studied in a variety of genres and interaction contexts, such as narratives, political speeches, medical consultations, legal speeches, diaries, radio conversations and classrooms (Maschler & Schiffrin, 2015). They are presented, in turn, in different compositions in a great diversity of discursive genres, since they play a fundamental role in argumentation, exposition, narration and interaction in general.

Despite being an omnipresent phenomenon in language, it is only recently that linguistics has approached it, and its exploration and delimitation has proved to be highly problematic (Fedriani & Sansò, 2017). There is a wide margin of disagreement among scholars regarding three aspects: 1) their delimitation as a set: definition of the necessary and sufficient conditions for a linguistic segment to be a DM; 2) their categorization: determining the functions they fulfill and which DMs correspond to each one; 3) their polyfunctionality: the fact that, depending on the context, a DM can have one function or another.

In this project we propose an analysis of DMs that is different from the introspective method that has prevailed in the study of this subject in its more than 30 years of existence in modern linguistics. In this project we propose an approach based on quantitative linguistics: an inductive and corpus-driven analysis in which we do not start from previously established categories but rather these arise from the statistical analysis of large volumes of texts. Our method is based on the analysis of large parallel corpora and the application of different statistical measures and an especially designed clustering algorithm which allow us, on the one hand, to construct taxonomies of DMs that are emergent or natural in each of the languages and, on the other hand, to align these different taxonomies. Our approach is multilingual because we generate taxonomies in Spanish, English, French, German and Catalan, and because the method could hypothetically be applied to any other language.

The first intuition guiding our analysis is that the delimitation of DMs can be indicated by their distributional behavior. It is evident that certain discourse genres privilege certain DMs, but within each genre, and because they have no lexical content, DMs can appear in any text regardless of content. From there, for their categorization, we believe that it is possible to use the parallel corpus as if it were a sort of semantic mirror that allows us to see the functions of the DMs reflected: if two DMs in Spanish, such as sin embargo and no obstante, are systematically translated by the same DMs in English (however, nevertheless, etc.), this is then a strong indicator that both of them fulfill the same function.

This proposal is methodologically simple, computationally efficient and the results are accurate. We highlight the following aspects: 1) the categories are emergent, they do not come from the literature; 2) the method is powerful because it allows obtaining information about languages without requiring explicit knowledge of them; 3) this work can have practical applications in fields such as lexicography, discourse parsing, information extraction, machine translation, computer-assisted writing or language learning.

The methodology of discourse marker extraction and classification is based on parallel corpus data aligned at sentence level. From this parallel corpus we extract DM candidates and classify them using a clustering process. Clustering methods are a set of unsupervised classification algorithms that group elements based on the similarity between their components (Jain & Dubes, 1988; Manning & Schütze, 1999; Rokach & Maimon, 2005; Kaufman & Rousseeuw, 2009; Everitt et al., 2011; Gries, 2013; Divjak & Fieller, 2014; Brezina, 2018). The objective of clustering is therefore to identify partitions in an unstructured set of objects described according to their attributes/components (unsupervised classification). This identification is based only on these attributes and does not require any manually annotated data, a detail that for this research is fundamental, as it is what justifies that the categories are natural or emergent.

Regarding the data to be used as reference corpus for classification, this research resorts to the OPUS Corpus parallel data (Tiedemann, 2016), which provides sets of parallel corpora in various domains and genres, aligned at sentence level, in TMX format, a standard in the use of translation memories (Savourel, 2005). The corpus is publicly and freely available. So far, in this project we have used the Spanish-English, French-English, German-English and Spanish-Catalan language pairs. The number of words per language exceeds the billion tokens in most cases.

To process this data, we designed an algorithm to automatically read the TMX files and extract a first list of DM candidates aligned by language (Section 4.1). This interlinguistic alignment is then used for the intralinguistic alignment using the clustering technique (Section 4.2) from where the taxonomies are generated.

The multilingual taxonomy of DMs resulting of this process (Sections 4.3-4.5) consists of a data structure that brings together sets of DMs with an equivalent function and which are organized by language, which may be completely unknown to the person applying the method. In this way we can access information such as how to counterargue in a given language, together with the list of examples that serve to do so (however, nevertheless), or to reformulate (i.e., in other words), to introduce a digression (to all this, by the way), and so on.

Different research has proposed classifications of DMs using automatic methods (Knott, 1996; Knott & Mellish, 1996; Marcu, 1997a, 1997b; Alonso et al, 2002a, 2002b; Hutchinson, 2003, 2004, 2005; Muller et al., 2016; Webber et al., 2016, 2019), but they have started from any pre-coded list of markers and covering only connecting elements. In our case, we include a wider variety of DMs and also prefer language-agnostic methods. The quantitative nature of our method offers a valuable complementary source of information to the classifications that other authors have previously derived with introspective methods.

In addition to theoretical consequences and because DMs are important signals of discourse structure (Popescu-Belis & Zufferey, 2006), this work has several practical applications, such as discourse segmentation (or parsing), information extraction or machine translation, but also in fields such as computer-assisted writing or language learning.

2. Related work

2.1. Research on Discourse Markers

As several authors have pointed out (Briz, 1993a, 1993b, Pons Bordería, 1996; 1998; Martín Zorraquino & Portolés, 1999; Cano Aguilar, 2003; Herrero, 2012), reference to some of these units was present in the more traditional grammars. For example, Bello (1847) noted that certain adverbial phrases such as ahora bien or ahora pues are transformed into continuative conjunctions to relate thoughts and the contents of certain segments of the text. Later, in the 20th century, in the closing chapter of his grammar, Gili-Gaya (1969) speaks of supra-sentential links to refer to formal mechanisms such as certain conjunctions, repetitions or anaphora which, although realized at the sentence level, make it possible to give coherence to the discourse as a unit. However, DMs have never been the focus of interest because their scope of action is usually supra-sentential, when both for traditional grammar and for 20th century linguistics until the appearance of text grammar, the unit of analysis was the sentence, at least as far as syntax or grammar is concerned. As a consequence of the above, interest in the study of DMs awakened relatively late in modern linguistics, when many linguists opened up to a functional explanation -pragmatic and discursive- of both sentence and supra-sentential elements (Llorente, 1996). However, at the moment when the subject finally began to come into focus, the difficulties were immediately evident (Portolés, 1993).

At the outset, it is possible to state that DMs are words of a grammatical and not lexical nature, since they are used to form grammatical constructions and not to refer to or predicate about entities of real or imaginary worlds nor to contribute any lexical meaning to propositions (Fraser, 1996). This leads us to consider DMs as inferential support units, since they facilitate the receiver's work of interpretation (Ler Soon Lay, 2006). However, they fulfill this function without participating at any time in the sentence structure, neither syntactically nor semantically, and it is not easy to characterize them syntactically. They come from very diverse grammatical categories, such as conjunctions (for example, but, although) and conjunctive locutions (so, so that), adverbs (besides, certainly) and adverbial locutions (therefore, of course), interjections (go, hey), performative expressions (I say, I think), prepositional syntagms (by the way, in spite of that), among others. In addition to this heterogeneity of origins, DMs operate in different domains: they connect sentences, relate and organize textual segments and operate at the interpersonal level (Aijmer, 2022). In addition to the above, we find that many of them are polyfunctional, that is, they can fulfill different pragmatic functions in the discourse according to the clues provided in each case by the context (Fisher, 2006).

It is precisely the pragmatic-contextual conditions that pose the greatest difficulty for the grammarian in the task of syntactically characterizing DMs: although they usually maintain the syntactic properties of the grammatical classes from which they come (Martín Zorraquino & Portolés, 1999), they do not clearly present features that are necessary and sufficient, since DMs have an operational scope at the level of the utterance (communicative or pragmatic) and not at the sentence or grammatical level (Waltereit, 2006). In fact, many authors consider that DMs are a pragmatic category and therefore have rules of play that are essentially different from those of sentence syntax (Martín Zorraquino & Portolés, 1999). This would be precisely what explains their role in the processing of coherence, cohesion, adequacy and effectiveness of discourse (Bazzanella, 2006).

From the point of view of the procedural approach advocated by Sperber and Wilson (1986) and Blakemore (1987), DMs are conceived as guides to the interlocutor's inferences. From this point of view, a distinction is made between conceptual meaning, which can be analyzed in lexical terms, and procedural meaning, which refers to a specific pragmatic meaning that serves as a guide to follow the inferential path appropriate to the interpretation process. This distinction, however, does not prevent some elements whose meaning is eminently procedural from maintaining a veritative-conditional type of meaning. For example, a set of adverbs and adverbial locutions, which some authors (e.g., Portolés, 1993) include within the modalizing DMs (such as clearly, certainly, among others), offer a conceptual representation that modifies the meaning of the whole utterance in which they appear. In this sense, DMs would function, rather, as metadiscursive signals that structure and organize the discourse for the benefit of the interlocutor. They are effective metalinguistic elements at the textual level (Lenk, 1997): they signal metalinguistic, metacommunicative, metapragmatic and metadiscursive mental processes (Maschler, 1994) that allow the speaker to step out of his propositional frame and metacommunicate his attitudes, positioning and feelings (Östman, 1981).

Functional criteria are the ones that seem to prevail in descriptions, at least in Spanish (e.g., Casado Velarde, 1993). It must be said, however, that the degree of disagreement between these authors is also important. Each considers different elements, concepts and properties to categorize DMs.

Naturally, the lack of consensus among specialists about the delimiting properties of the general category of DMs and the features that define the subcategories affects the certainty that can be had in these classifications (Loureda & Asín, 2010). It is a challenge, therefore, to find a way to overcome the subjectivity of the analyst, to clearly delimit the object of study and to establish taxonomies.

This is precisely the interest of the present research proposal. The particularity of this project is that it resorts to a fully inductive methodology, based on data obtained from large textual corpora for the extraction and classification of linguistic elements that can function as discourse markers.

2.3. Background of automatic categorization of DMs

An automatic classification may involve the introduction of an objective measuring instrument as a means of overcoming the discussions involved in the subjectivity inherent in the introspective method commonly used in manual classifications of DMs. However, the main obstacle to an automatic classification is the lack of consensus among specialists on what are the class-delimiting properties of DMs. Alonso et al. (2002a) have attributed this lack of consensus to the preeminence of deductive approaches, with a significant bias for an underlying theory.

The first proposals for formal mechanisms to detect and systematize DMs were made in the 1990s by Knott and Dale (1995). Later, Hutchinson (2004) applied machine learning algorithms to characterize discourse connectors and reported high accuracy results with respect to a gold standard. However, the learning of the models, as usual in this type of algorithms, depended on manually annotated instances, which on the one hand demands manual work prior to the applicability of the method and, on the other hand, necessarily implies a bias relative to the annotators. In their case, moreover, the method was restricted to units used to connect propositions and the classes to which connectors were assigned were manually decided, thus necessarily biased. Also, the evaluation of the results was based on previous manual classifications, so they were only able to recreate a similar classification or, perhaps, to validate the existing one.

Alonso et al. (2002b) proposed an approach to solve this problem, presenting the construction of a computational lexicon of DMs through the application of clustering techniques, with the aim of grouping instances of connector usage extracted from a large corpus. To do so, they relied on two sets of attributes: the first was derived from a manually coded lexicon of connectors containing syntactic and rhetorical information, the latter expressed in relations such as reinforcement, concession, consequence or enablement; the second, on the other hand, was derived from the surface processing of the text for the instances and defined attributes such as the position of the connector in the segment, the words surrounding it, the presence of negation, among others. The result was that the clusters obtained contain, mainly, instances in which the connectors have similar syntactic behavior. Although this proposal solves, in part, the previous problems, the fact that the selection of attributes was made from information extracted from a hand-coded lexicon of connectors, implies that the agglomerated categories are barely corroborated with corpus data and do not come from them in an emergent manner. In addition to this restriction, the categories are limited, again, to the connecting function fulfilled by some DMs.

More recently, Muller et al. (2016) automatically obtained empirically grounded connector clusters based on the significance of the association between connectors and pairs of verbal predicates in context. To form the clusters, they used co-occurrence data collected from the Gigaword corpus of English (Graf & Scieri, 2003) and produced triplets of two predicates and a connector, along with their number of occurrences. They used the Penn Data Tree Bank (Prasad et al., 2008) list of DMs as a basis, without a priori clustering. Thus, they produced a matrix of {number of verb pairs} x {104 connectors} dimensions, so that each verb pair was represented by a set of 104 attribute values. As in the previous case, extraction was limited to connecting elements and from a pre-coded list, again leading to an inevitable bias in the results.

2.4. Background on bilingual vocabulary alignment

The first antecedents of bilingual vocabulary extraction from corpora date back to the nineties of the last century and were based on the alignment of parallel corpora. Already in our century, however, the most successful approaches were proposals for bilingual vocabulary extraction from comparable (Gaussier et al., 2004) or even unrelated (Rapp, 1999) corpora.

With respect to the alignment of parallel corpora, Véronis (2000) pointed out that the first antecedent of parallel corpus use actually dates back to 1822 with the decipherment of the Rosetta Stone carried out by Jean François Champollion. To this can be added as a second antecedent the work of Weaver (1955), who already formulated the essence of the idea, although his contribution was not understood by his contemporaries, because the methods that prevailed at that time, and until the 1990s in the field of machine translation, were not of a quantitative-statistical type but were based on rules. In the 1990s, however, there was a shift towards statistical methods, which proved to be very robust (Gale & Church, 1991).

Gale & Church (1991) started to generate bilingual vocabularies based on the calculation of cooccurrence in aligned sentences and this was followed by a long list of publications with different ideas to improve the results (Church, 1993). They highlight ideas such as creating a virtuous circle between bilingual lexicon extraction and sentence alignment, feeding back results to improve the performance of each algorithm (Kay & Röscheisen, 1993). Other authors, on the other hand, have started from the original algorithm of Brown et al. (1991) to try to optimize it (Och & Ney, 2003). In the field of parallel corpus alignment, the tendency to combine quantitative approaches with explicit grammatical knowledge (Hiemstra, 1998) has also been proven.

It can be said, however, that this is a field that has matured and whose techniques have stabilized. A plateau seems to have been reached in terms of the performance quality of the algorithms, which has become quite high. With a near-perfect sentence alignment it is now possible to obtain a bilingual vocabulary of over 90% accuracy (Nazar, 2012).

2.5. Background of multilingual alignment of taxonomies

We find in the work of Jung (2008) and Jung et al. (2008) one of the first attempts to align taxonomies by looking for correspondences or similarities first between terms within the same taxonomy and later to align ontologies in different languages, presenting their first results in Korean and Swedish. More recently, publications along the same lines have started to appear. Declerck and Gromann (2012) proposed strategies for the alignment of lexical components in a multilingual ontology. Thomas et al. (2014) raise the issue of the alignment of multilingual taxonomies specialized in the financial field in French and Spanish using a logical reasoner. Finally, Mahdisoltani et al. (2015) present a multilingual alignment system for Wikipedia categories based on information that they extract mainly from the English version of WordNet, as a fundamental structure, and that they augment from the category system and the Wikipedia "infoboxes", information that they can project towards the different languages offered by this resource.

Although most of these taxonomies and ontologies contain nouns or nominal syntagms, i.e., elements of a syntactic and semantic behavior totally different from our object of study, we believe that these experiments represent a valuable methodological precedent. It is, in any case, an area of research that is still underdeveloped, despite the high practical impact that an automated solution would have. The investment in projects such as the alignment of lexical ontologies in European languages (EuroWordNet projects LE-2 4003 and LE-4 8328) gives an idea of this impact.

3. Statement of the problem

3.1. Research question

How to obtain a multilingual taxonomy of discourse markers from corpora.

3.2. Objectives

3.2.1. General Objective

To propose a methodology for the automatic induction of taxonomies of discourse markers in Spanish, English, French and German from large multilingual parallel corpus data.

3.2.2. Specific Objectives

To design a statistical, inductive and language-independent methodology for extracting discourse markers.
To design an inductive methodology to form categories of discourse markers.
To apply this methodology to extract a multilingual taxonomy of discourse markers in Spanish, English, French and German.

4. Proposed methodology

The main objective of research project Fondecyt Regular 1191481 (“Inducción automática de taxonomías de marcadores discursivos a partir de corpus multilingües”) was to design a methodology for the automatic extraction of a classification of discourse markers (DMs), both in their categories and in the members that constitute them, maintaining as a principle the non-use of linguistic resources already existing for a particular language, that is, through inductive processing of corpus data. The only input is a parallel corpus (PC), so it is a method with the power of generalization, since it can be applied to any language without major adjustments. It is also an innovative method, since most of the research carried out with PC to date has been oriented towards the search for translation equivalences from one language to another. In contrast, the use of PC as an input for the extraction and organization of a taxonomy of DMs from the same language using an inductive approach is something that is being proposed for the first time with this project. In addition to obtaining the categories, the most valuable result is undoubtedly the exhaustive inventory of DMs that the method offers. From these results, it is possible to estimate that the total population of lexicalized DMs in a European language is close to one thousand units. In the case of the languages analyzed here, most of them are already in the project's database.

The general procedure is divided into four stages. The first consists of separating, from the vocabulary of a language, all those units that correspond to DMs. The second is to establish which are the categories of these DMs or, in other words, how they can be grouped. The third consists of recognizing the functions of these groups of DMs, if it is a category already known, in order to give names to the groups that are formed. The fourth, finally, is to use the taxonomy created to identify and classify new DM candidates.

4.1. Identification of the DMs of a language

The method chosen as the most effective, simple and generalizable is the filter by amount of information. It was thus possible to successfully establish the first division of the vocabulary into lexical and functional categories thanks to the characteristic distribution of the latter in the corpus, a pattern determined by the fact that the appearance of a DM in a text is independent of the content concerned. In operational terms, they show a uniform distribution and appear in the company of a large set of other words and are therefore not very informative, since their occurrence does not generally help to predict the occurrence of other units. In contrast, a lexical content word such as democracy will show a more restricted word company, with units such as respect, rights, freedom, etc. Figures 1 and 2 illustrate this difference, comparing the word democracy, in the first case, with DM anyway. The cooccurrence profiles are different and, in the first case, the area under the curve is larger in relative terms (i.e., ignoring the absolute frequency).

Figure 1: Frequency distribution of words cooccurring with words co-occurring with democracia (democracy).

**Figure 2:** Distribución de frecuencias en el caso de
*de todas maneras*.

In order to capture this property, we define a coefficient (1), where $x$ is a DM candidate; $R_x$ the set of co-occurring words; $m(x)$ the set of contexts of $x$ and $R_{x,i}$ is the frequency of the word at position $i$ from the list of the $n$ most frequent in those contexts (in our experiments, $n = 20$ ).

$\displaystyle I(x) = \frac { \log_2 \sum_{i=1}^n R_{x,i } } { \log_2 \vert m(x)\vert}$

(1)

At one extreme, this coefficient produces a very low value for function words such as articles, conjunctions, prepositions, etc. At the opposite extreme are, on the other hand, the more specialized vocabulary words. The cutoff value of this continuum between lexical ( $L$ ) and functional ( $F$ ) category is given by an arbitrary parameter $k$ (2).

$\displaystyle C(x) = \left\{\begin{array}{lr} L \quad I(x) > k \\ F \quad \mbox{otherwise } \end{array}\right.$

(2)

4.2. Organization of the extracted DMs into groups

For this task, we developed a new clustering algorithm that is characterized by its computational efficiency, which is necessary to be able to process such a large data set. As a similarity measure to group the extracted DMs in the previous step, this algorithm uses the common equivalence of the DMs in a second language, hence the recourse to the PC. This is effectively using the PC as a semantic mirror. For example, the DMs however and nevertheless are similar because both have sin embargo and no obstante, among others, as equivalents in Spanish. Obtaining these first cross-linguistic equivalences in the parallel corpus is then used to find the corresponding intralinguistic similarities. To find these equivalences we apply an association coefficient (3) based on the co-occurrence of the elements in the aligned segments . With this coefficient we establish the probability that the DM $i$ in language $x$ is equivalent to the DM $j$ in language $y$ .

$\displaystyle A(DM_{x,i},DM_{y,j}) = \frac{f(DM_{x,i},DM_{i,j}) } { \sqrt{f(DM_{x,i})} . \sqrt{f(DM_{y,j})} }$

(3)

Once the list of interlinguistically aligned DM pairs is obtained, the clustering process begins to obtain the intralinguistic similarities. For each pair in this set, such as the pair por esa razón $\sim$ for that reason, it is checked whether either of the two has been present in a previously examined pair. If a pair por esta razón $\sim$ for that reason, is later found, it may be assumed that por esta razón y por esa razón are similar, without the need to look at their lexical or orthographic similarity. When such a similarity is found, the new DM is added to the already formed cluster. Figure 3 illustrates a moment in the process, when a new DM for that reason is added to a previously formed cluster.

**Figure 3:** Capture of a moment of the clustering process

**Figure 4:** Animated illustration of the clustering process

4.3. Labeling of clusters with functional categories

The previous step results in an indeterminate number of DM clusters in each language, although at this point the system only identifies them with arbitrary numerical codes. In order to give these clusters meaningful names, a list of arbitrary names for the most general categories was compiled along with two to three DM examples for each. This allows automatic labeling of the clusters by grouping them into these categories, which is achieved by calculating the intersection (4) between these examples ( $E$ ) and the members of each cluster ( $CDM$ ).

$\displaystyle sim(E_p,CDM_q) = \frac { \vert \vec{E_p} \cap \vec{CDM_q} \vert } { \vert \vec{E_p} \vert }$

(4)

4.4. Populating the taxonomy with new specimens

Once a basic DM taxonomy has been formed, it is used to classify new DMs recursively. Given a candidate DM, it is first classified by language, then it is decided whether it is a DM and, if so, assigned a category. For both tasks we use the same PC. If the candidate is a genuine DM, its status will be indicted by the PC. For example, a candidate as de la misma manera will have as equivalents according to the PC elements such as in the same way, likewise, similarly, etc., which are already registered in the taxonomy as English DMs. It is then concluded that the candidate is genuine and belongs to the same category.

4.5. Creation of a database of DMs in English, French, Spanish, German and Catalan

At present, the results of the project imply the creation of a multilingual DM taxonomy, with 2,897 elements, 619 in Spanish, 733 in English, 556 in French, 677 in German, and 312 in Catalan, divided into 70 functional categories. Table 1 shows an example of one of the clusters, labeled with the category of counterargumentative connectors. The full multilingual taxonomy resulting from the project is also available for download.

Table 1: Example of one of the clusters corresponding to the category of counterargumentative connectors.

Language	DMs

English	all the same; although; and yet; but; but still; despite all; despite the fact that; despite these; despite this; even if; even so; even though; however; in spite of all; in spite of the fact; nevertheless; nonetheless; that being said; that said; though; while; yet
Spanish	a cambio; ahora bien; al contrario; aparte de eso; a pesar de ello; a pesar de eso; a pesar de esto; a pesar de todo; aun así; aun cuando; aun en; aunque; bien que; con todo; de cualquier forma; de cualquier modo; de todas formas; de todas maneras; de todos modos; dicho esto; en cambio; en lugar de eso; en vez de eso; incluso aunque; no obstante; pero; pero aun así; pese a ello; pese a todo; por el contrario; si bien ; sin embargo; todo lo contrario; y sin embargo
French	cependant; et pourtant; mais encore; mais toujours; malgré cela; malgré tout; même ainsi; même si ; néanmoins; pourtant; toutefois
German	aber immer noch; aber nicht; aber trotzdem; allerdings; auch wenn; auftreten müssen; dachte; dennoch; jedoch; obwohl; selbst wenn; sogar; trotzdem; trotz der tatsache; trotz dieser; trotz dieses
Catalan	al contrari; ans al contrari; ben al contrari; de qualsevol manera; de tota manera; de totes formes; de totes maneres; en comptes d'això; en lloc d'això; i no obstant això; malgrat això; no obstant; pel contrari; però tot i així; tanmateix; tot el contrari; tot i així; tot i això

A manual evaluation campaign of these results was carried out by native speakers of each language, collaborators or technical staff of the project with periodic meetings to maintain uniformity of criteria. The evaluation process consisted of the following two phases. The first was to determine the accuracy of the algorithm, defined as the proportion of correct DMs in the extracted taxonomy. For this we checked the completeness of the database to record cases where it is not a genuine DM, is poorly segmented, or misclassified. In all languages we found an error rate of less than 5% except in German, where the error reached 16%, mainly due to segmentation problems. The second phase consisted of the evaluation of coverage, defined as the proportion of DMs of a language that are contained in the taxonomy. For this we took random samples of 10 texts per language and quantified the proportion of DMs appearing in those texts that were not in our taxonomy. In this way we were able to estimate the total coverage of this resource at 88%.

5. Dissemination of project results

5.1. Communication of results in the scientific field

The results of the project have already been disseminated in scientific publications. There are already general descriptions of the methodology [Nazar 2021a, Nazar 2021b]. Others deal with more specific aspects, related to establishing divisions in the vocabulary of a language with methods for gender detection [Nazar $\bgroupet al.\end{tex2html_bgroup}$ 2020], number [Nazar y Galdames 2019], hyperonymy [Nazar $\bgroupet al.\end{tex2html_bgroup}$ 2021, Nazar 2021c] and syntactic patterns [Robledo y Nazar (submitted)]. Methods have also been applied for the analysis of polyfunctionality [Alvarado y Nazar (submitted)]. In parallel to methodological developments, they also conducted studies on the use of DM in real speaker populations [Asenjo y Nazar 2020]. Along with the publication of articles, the results of this project were presented at 10 international congresses in the field. In addition, 12 undergraduate theses and one doctoral thesis were defended based on data and methods developed in the project.

5.2. Dissemination in society in general

In addition to dissemination in scientific forums, this project has also invested in disseminating the results of the project to society in general, with a strong commitment to digital media. The first measure consisted in the development of this website (http://www.tecling.com/dismark), which contains the complete DM database in the languages we have worked with, detailed documentation and an implementation of the DM classifier prototype, which allows the user to reproduce experiments by entering one or more DM candidates.

In addition to this website, we have invested in the creation of short audiovisual pieces, explaining in 10-15 minutes different phases and possible applications of the project, for which we have inaugurated a YouTube channel, accessible from the project website. To date, and after a period of four months since we started this type of material dissemination, we have more than 1,700 reproductions in total, between the videos on that channel and appearances of our videos on the channels of other colleagues.

Video presentation of the DM Dictionary (in Spanish):

Description of the methodology (in Spanish):

Video presentation of the project in English:

6. International cooperation

In terms of international cooperation, the visits of Professors David Lindemann (University of the Basque Country) and Antonio Balvet (University of Lille), which had to be carried out virtually due to the health contingency. Nevertheless, the objective of these visits, which was to support the coordination of the data review teams in German and French, respectively, was achieved without major inconvenience. It is also worth mentioning here the short research stay of the main researcher at Universidad Nacional de Cuyo (Argentina), with the objective of collaborating with the research group that carries out the Corpus Espada Project. Given the interest that this group has in the study of textual genres, there is a common area of interest between both projects, which is the study of the correlation between the use of DMs and different genres with the aim of classifying texts by genre.

7. Lines of future work

This project set out to design and evaluate a methodology, and for this reason it was applied only to a small group of languages. Now that the method is established, a first phase of future work will consist of replicating experiments in different languages. Moreover, the project was limited to the synchronic study of DMs, but the algorithm is designed to operate dynamically, so the diachronic study presents itself as an attractive possibility for future work to document the process of DM grammaticalization.

8. Publications derived from the project

Alvarado, C. y R. Nazar. submitted. Detección de marcadores discursivos: el caso de los conectores causal-consecutivos y su polifuncionalidad. ...

Asenjo, S. y R. Nazar. 2020. Marcadores discursivos en niños de 7 años con trastorno específico del lenguaje: Estudio descriptivo. RLA. Revista de lingüística teórica y aplicada, 58(1):93–114.

Nazar, R. 2021a. Automatic induction of a multilingual taxonomy of discourse markers. Electronic lexicography in the 21st century (eLex 2021) Post-editing lexicography, páginas 440–454.

Nazar, R. 2021b. Inducción automática de una taxonomía multilingüe de marcadores discursivos: primeros resultados en castellano, inglés, francés, alemán y catalán. Procesamiento del Lenguaje Natural, 67:127–138.

Nazar, R. 2021c. Kind: Un proyecto de inducción automática de taxonomías léxicas. Anales de Lingüística, 2(7):175–201.

Nazar, R., A. Balvet, G. Ferraro, R. Marín, y I. Renau. 2021. Pruning and repopulating a lexical taxonomy: experiments in spanish, english and french. Journal of Intelligent Systems, 30(1):376–394.

Nazar, R. y A. Galdames. 2019. Formalización de reglas para la detección del plural en castellano en el caso de unidades no diccionarizadas. Linguamática, 11(2):17–32.

Nazar, R., I. Renau, N. Acosta, H. Robledo, M. Soliman, y S. Zamora. 2020. Corpus-based methods for recognizing the gender of anthroponyms. Names, páginas 1–17.

Robledo, H. y R. Nazar. submitted. A proposal for the inductive categorisation of parenthetical discourse markers in spanish using parallel corpora. International Journal of Corpus Linguistics, ...

People involved in the project

Lead researcher :
Rogelio Nazar

Co-researcher:
Irene Renau

Collaborator:
Hernán Robledo

International collaborators:
David Lindemann
Antonio Balvet

Research assistants:
Nicolás Acosta
Andrea Alcaíno
Patricio Arriagada
Scarlette Gatica
Ricardo Martínez
Natalie Mies
Mathilde Guernut
Maureen Noble
Valentina Ravest
Diego Sánchez
Jana Strohbach

Theses funded by this project:

PhD Thesis:

Hernán Robledo Nakagawa (2021). Categorización de los marcadores del discurso del español: una propuesta inductiva guiada por corpus paralelo. PDF

Undergraduate theses:

Héctor Ramos (2022).Polifuncionalidad de los marcadores discursivos conectores en textos científicos y literarios.

Gabriela Paz Cacciuttolo Provoste (2022). Estudio cuantitativo de variables estilísticas en la traducción literaria.

Pedro Andrés Bolbarán Gálvez (2022). Análisis descriptivo de la incidencia de errores en transcriptores automáticos.

Camila Pérez Lagos (2022). Extracción y clasificación de neologismos especializados en un corpus de divulgación científica de astronomía.

Javiera Silva Espinoza (2022). Estudio comparativo del aprendizaje de neologismos en hispanohablantes en una traducción y una interpretación.

Bahony Saavedra Tapia (2022). Análisis comparativo para establecer la relación entre la condición de un préstamo y su categoría gramatical.

Belén Guerrero Carreño (2021). Análisis contrastivo del uso de préstamos simples en las variantes chilena y peninsular del castellano.

Camila Alvarado Barbosa (2020). Detección de marcadores discursivos: el caso de los conectores causal-consecutivos y su polifuncionalidad.

Sara Asenjo Sotelo (2019). Estudio descriptivo del uso de marcadores discursivos en niños de 7 años con trastorno específico del lenguaje.

José Bahamonde Cano (2019). La terminología médica en la traducción audiovisual: análisis de la terminología médica en subtítulos profesionales y fansub de textos audiovisuales no especializados.

Paolo Caballería Rodríguez (2019). Modelo algorítmico para el reconocimiento automático de referentes y la resolución de anáfora pronominal en textos en español.

Yerko Leiva (2019). Análisis de modalizadores en el discurso de revistas psiquiátricas chilenas y españolas.

References

Aijmer, K. (2002). English discourse particles: Evidence from a corpus.John Benjamins, Amsterdam/ Philadelphia.

Aijmer, K., Foolen, A., and Vandenbergen, A.-M. (2006). Pragmatic markers in translation: a methodological proposal. In Fischer, K., editor, Approaches to discourse particles, pages 101–114. Elsevier, Oxford.

Alonso, L., Castellón, I., Gibert, K., and Padró, L. (2002a). An empirical approach to discourse markers by clustering. In Proceedings of the 5th Catalonian Conference on AI: Topics in Artificial Intelligence, pages 173–183.

Alonso, L., Castellón, I., and Padró, L. (2002b). Lexicón computacional de marcadores del discurso. Procesamiento del lenguaje natural, 29:239–246.

Bazzanella, C. (2006). Discourse markers in italian: towards a ‘compositional’ meaning. In Fischer, K., editor, Approaches to discourse particles, pages 449–464. Elsevier, Oxford/Amsterdam.

Bello, A. (1970[1847]). Gramática de la lengua castellana. Sopena, Buenos Aires.

Blakemore, D. (1987). Semantic constraints on relevance. Blackwell, Oxford.

Borreguero, M., López, A. (2010). Marcadores del discurso: de la descripción a la definición. Vol 45. Iberoamericana.

Braune, F. and Fraser, A. (2010). Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 81–89. Association for Computational Linguistics.

Brezina, V. (2018). Statistics in corpus linguistics: A practical Guide. Cambridge/New York: Cambridge University Press.

Brinton, L. J. (1996). Pragmatic markers in English: Grammaticalization and discourse functions. Mouton de Gruyter, Berlin/New York.

Briz, A. (1993a). Los conectores pragmáticos en la conversación coloquial (I): su papel argumentativo. Contextos, 11(21-22), 145-188.

Briz, A. (1993b). Los conectores pragmáticos en la conversación coloquial (II): su papel metadiscursivo. Español Actual, 9, 39-56.

Briz, A., Pons, S., and Portolés, J. (2008). Diccionario de partículas discursivas del español (dpde)[online]: http://www.dpde.es

Brown, P., Lai, J. C., and Mercer, R. L. (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL), pages 169–176, Berkeley.

Calsamiglia, H. and Tusón, A. (1999). Las cosas del decir. Manual de análisis del discurso. Ariel, Barcelona.

Cano Aguilar, R. (2003). Función sintáctica, significación gramatical y valor léxico en la conexión supraoracional. In J. L. Girón Alconchel, F. J. Herrero Ruiz de Loizaga, S. Iglesias Recuero & A. Narbona Jiménez (eds.), Estudios ofrecidos al profesor José Jesús de Bustos Tovar (Vol. 1, pp. 297-314). Madrid: Editorial Complutense.

Casado Velarde, M. (1993). Introducción a la gramática del texto del español. Arco/Libros, Madrid.

Church, K. (1993). Charalign: a program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1–8, Columbus, Ohio.

Cook, H. (2001). Particelle/particles. In Duranti, A., editor, Culture e discorso. Un lessico per le scienze umane, pages 251–255. Meltemi, Roma.

Crible, L., and Blackwell, S. E. (2020). Introduction: Discourse-pragmatic markers in speech and sign. Journal of Pragmatics, 156, 24-27.

Cuartero, J. (2002). Conectores y conexión aditiva. Los signos incluso, también y además en español actual. Gredos, Madrid.

Daille, B. and Morin, E. (2005). French-english terminology extraction from comparable corpora. In International Conference on Natural Language Processing, pages 707–718. Springer.

Declerck, T. and Gromann, D. (2012). Towards the generation of semantically enriched multilingual components of ontology labels. In Proceedings of the 3rd International Conference on Multilingual Semantic Web-Volume 936, pages 11–22. Citeseer.

Degand, L., Cornillie, B., and Pietrandrea, P., editors (2013). Discourse markers and modal particles. Categorization and description. John Benjamins, Amsterdam.

Divjak, D. and Fieller, N. (2014). Cluster analysis: Finding structure in linguistic data. In D. Glynn & J. A. Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (pp. 405-442). Amsterdam/Philadelphia: John Benjamins.

Everitt, B. S. and Hothorn, T. (2009). A handbook of statistical analyses using R. Chapman and Hall/CRC, Boca Raton, FL.

Everitt, B. S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster analysis. Chichester: John Wiley & Sons.

Fedriani, C. and Sansò, A. (2017). Pragmatic markers, discourse markers and modal particles: What do we know and where do we go from here? In Fedriani, C. and Sansò, A., editors, Pragmatic markers, discourse markers and modal particles. New perspectives, pages 1–33. John Benjamins, Amsterdam/Philadelphia.

Fischer, K. (2006). Towards an understanding of the spectrum of approaches to discourse particles: introduction to the volume. In Fischer, K., editor, Approaches to discourse particles, pages 1–20. Elsevier, Oxford/Amsterdam.

Fraser, B. (1990). An approach to discourse markers. Journal of pragmatics, 14(3):383–398.

Fraser, B. (1996). Pragmatic markers. Pragmatics, 6(2):167–190.

Fraser, B. (1999). What are discourse markers? Journal of pragmatics, 31(7):931–952.

Fraser, B. (2006). Towards a theory of discourse markers. In Fischer, K., editor, Approaches to discourse particles, pages 189–204. Elsevier, Oxford/Amsterdam.

Fuentes Rodríguez, C. (1987). Enlaces extraoracionales. Alfar, Sevilla.

Fuentes Rodríguez, C. (1998). Ejercicios de sintaxis supraoracional. Arco/Libros, Madrid.

Fuentes Rodríguez, C. (2009). Diccionario de conectores y operadores del español. Arco/Libros, Madrid.

Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Third Workshop on Very Large Corpora.

Furkó, B. P. (2020). Discourse markers and beyond: Descriptive and critical perspectives on discourse-pragmatic devices across genres and languages. Cham: Palgrave Macmillan.

Gale, W. A. and Church, K. (1991). A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL), pages 177–184, Berkeley.

Gamallo, P. (2005). Extraction of translation equivalents from parallel corpora using sense-sensitive contexts. In Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT'05).

Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 526. Association for Computational Linguistics.

Gili Gaya, S. (1969). Curso superior de sintaxis española. Bibliograf, Barcelona.

Gomez Guinovart, X. and Simoes, A. (2009). Parallel corpus-based bilingual terminology extraction. In Proceedings of the 8th International Conference on Terminology and Artificial Intelligence. IRIT (Institut de Recherche en Informatique de Toulouse).

Graff, D. and Cieri, C. (2003). English Gigaword LDC2003T05. Web Download. Linguistic Data Consortium, Philadelphia.

Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction. Berlin/Boston: De Gruyter Mouton.

Halliday, M.A. and Hasan, R. (1976). Cohesion in English. London: Longman.

Halliday, M.A.K. (1985). An introduction to functional grammar. London: Arnold. Haselow, A., and Hancil, S. (eds.). (2021). Studies at the grammar-discourse interface: Discourse markers and discourse-related grammatical phenomena. Amsterdam/Philadelphia: John Benjamins.

Hiemstra, D. (1998). Multilingual domain modeling in twenty-one: automatic creation of a bi-directional lexicon from a parallel corpus. Technical report, Technical report, University of Twente, Parlevink Group.

Herrero, J. L. (2012). Los conectores en la historia del español: la formación del paradigma consecutivo. Salamanca: Luso-Española de Ediciones.

Holgado Lage, A. (2017). Diccionario de marcadores discursivos para estudiantes de español como segunda lengua. New York: Peter Lang.

Hutchinson, B. (2003). Automatic classification of discourse markers on the basis of their co-occurrences. In M. Stede & H. Zeevat (eds.), Proceedings of the ESSLLI Workshop: The meaning and implementation of discourse particles (p. 1). Vienna: University of Groningen.

Hutchinson, B. (2005). The automatic acquisition of knowledge about discourse connectives. Ph.D. thesis, The University of Edinburgh.

Hutchinson, B. (2004). Acquiring the meaning of discourse markers. In Proceedings of the 42nd Annual Meeting on ACL, page 684.

Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data . Englewood Cliffs, NJ: Prentice-Hall.

Jucker, A. and Ziv, Y. (1998). Discourse Marker: Description and Theory. John Benjamins, Amsterdam.

Jung, J. J. (2008). Taxonomy alignment for interoperability between heterogeneous virtual organizations. Expert Systems with Applications, 34(4):2721–2731.

Jung, J. J., Håkansson, A., and Hartung, R. (2009). Indirect alignment between multilingual ontologies: A case study of korean and swedish ontologies. In KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications, pages 233–241. Springer.

Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Hoboken, NJ: John Wiley & Sons.

Kay, M. and Röscheisen, M. (1993). Text-translation alignment. Computational linguistics, 19(1):121–142.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., and Suchomel, V. (2014). The sketch engine: ten years on. Lexicography, 1(1):7–36.

Knott, A. (1996). A Data-Driven Methodology for Motivating a Set of Coherence Relations. PhD thesis, University of Edinburgh.

Knott, A. and Dale, R. (1995). Using linguistic phenomena to motivate a set of coherence relations. Discourse processes, 18(1):35–62.

Knott, A. and Mellish, C. (1996). A feature-based account of the relations signalled by sentence and clause connectives. Language and Speech, 39(2-3), 143-183.

Lakoff, R. (1972). Language in context. Language, 48(4):907–927.

Lenk, U. (1997). Discourse markers. In Verschueren, J., editor, Handbook of pragmatics. Installment, pages 1–17. John Benjamins, Amsterdam/Philadelphia.

Ler Soon Lay, V. (2006). A relevance-theoretic approach to discourse particles in singapore english. In Fischer, K., editor, Approaches to discourse particles, pages 149–166. Elsevier, Oxford/Amsterdam.

Llorente, M. T. (1996). Organizadores de la conversación: operadores discursivos en español. Salamanca: Universidad Pontificia de Salamanca.

Loureda, Ó. and Acín, E. (2010). Cuestiones candentes en torno a los marcadores del discurso en español. In Loureda, Ó. and Acín, E., editors, Los estudios sobre marcadores del discurso en español, hoy, pages 7–59. Arco/Libros, Madrid.

Loureda, Ó., Rudka, M. and Parodi, G. (eds.). (2020). Marcadores del discurso y lingüística contrastiva en las lenguas románicas. Madrid/Frankfurt am Main: Iberoamericana/Vervuert.

Manning, C. and Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.

Marcu, D. (1997a). The rhetorical parsing, summarization, and generation of natural language texts. Ph.D. thesis, University of Toronto.

Marcu, D. (1997b). The rhetorical parsing of unrestricted natural language texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 96-103). Madrid, Spain.

Martín Zorraquino, M. A. and Portolés, J. (1999). Los marcadores del discurso. In Bosque, I. and Demonte, V., editors, Gramática descriptiva de la lengua española, Vol. 3, pages 4051–4213. Espasa-Calpe, Madrid.

Maschler, Y. (1994). Metalanguaging and discourse markers in bilingual conversation. Language in Society, 23(3):325–366.

Maschler, Y. and Schiffrin, D. (2015). Discourse markers: Language, meaning, and context. In Tannen, D., Hamilton, H. E., and Schiffrin, D., editors, The handbook of discourse analysis, pages 189–221. John Wiley & Sons, Chichester.

McEnery, A. M. and Oakes, M. P. (1995). Sentence and word alignment in the crater project: methods and assessment. In Proceedings of the EACL-SIGDAT Workshop: from texts to tags, Issues in Multilingual Language Analysis(ACL), pages 77–86, Dublin, Ireland.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, Boston, MA.

Montolío, E. (2001). Conectores de la lengua escrita: contraargumentativos, consecutivos, aditivos y organizadores de la información. Ariel, Barcelona.

Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual corpora. In Conference of the Association for Machine Translation in the Americas, pages 135–144. Springer.

Muller, P., Conrath, J., Afantenos, S., and Asher, N. (2016). Data-driven discourse markers representation and classification. In TextLink–Structuring Discourse in Multilingual Europe. Károli Gáspár University of the Reformed Church in Hungary, Budapest, page 93.

Nazar, R. (2012). Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario. Linguamática, 4(2):45–56.

Nazar, R., Wanner, L., and Vivaldi, J. (2008). Two-step flow in bilingual lexicon extraction from unrelated corpora. In Proceedings of the 12th Conference of the European Association for Machine Translation, pages 22–23.

Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational linguistics, 29(1):19–51.

Östman, J.-O. (1981). 'You Know': A discourse-functional study. John Benjamins, Amsterdam/Philadelphia.

Pons Bordería, S. (1996). La presencia de los enlaces extraoracionales en la tradición gramatical española: la descripción de algunas conjunciones. Otros valores conversacionales. ELUA. Estudios de Lingüística, 11, 261-283.

Pons Bordería, S. (1998). Conexión y conectores: estudio de su relación en el registro informal de la lengua. Anejo XXVII de la revista Cuadernos de filología. Universitat de València, Valencia.

Pons Bordería, S. (2000). Los conectores. In Briz, A. and Val.Es.Co, editors, ¿Cómo se comenta un texto coloquial?, pages 193–220. Ariel, Barcelona.

Popescu-Belis, A. and Zufferey, S. (2006). Contrasting the automatic identification of two discourse markers in multiparty dialogues. ISSCO Working Paper 65.

Portolés, J. (1993). La distinción entre los conectores y otros marcadores del discurso en español. Verba, (20):141–170.

Portolés, J. (2001). Marcadores del discurso. Ariel, Barcelona.

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, A. L., Joshi, A., and Webber, B. (2008). The penn discourse treebank 2.0. In Proceedings of the 6th LREC 2008.

Ranger, G. (2018). Discourse Markers: An Enunciative Approach. Springer International Publishing, Cham, Switzerland.

Rapp, R. (1999). Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 519–526. Association for Computational Linguistics.

Robledo, H. and Nazar, R. (2018). Clasificación automatizada de marcadores discursivos. Procesamiento del Lenguaje Natural, 61:109–116.

Robledo, H., Nazar, R., and Renau, I. (2017). Un enfoque inductivo y de corpus para la categorización de los marcadores del discurso en español. In Proceedings of the 5th International Conference “Discourse Markers in Romance Languages: Boundaries and Interfaces”, pages 91–93. Université Catholique de Louvain, Belgium.

Rokach, L. and Maimon, O. (2005). Clustering methods. In Maimon, O. and Rokach, L., editors, The Data Mining and Knowledge Discovery Handbook, pages 321–352. Springer, Boston, MA.

Santos Río, L. (2003). Diccionario de partículas. Salamanca: Luso-española de ediciones.

Savourel, Y. (2005). TMX 1.4 g Specification. Localization Industry Standards Associacion (LISA). http://www.gala-global.org/oscarStandards/tmx/tmx14b.html.

Schiffrin, D. (1987). Discourse markers. Cambridge University Press, Cambridge.

Sperber, D. and Wilson, D. (1986). Relevance: communication and cognition. Harvard University Press, Cambridge, MA.

Tiedemann, J. (2016). Opus - parallel corpora for everyone. In Proceedings of the 19th Annual Conference of the EAMT: Projects/Products, page 384.

Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., and Trón, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies In The Theory And History Of Linguistic Science Series 4, 292:247.

Véronis, J. (2000). From the rosetta stone to the information society: A survey of parallel text processing. In Parallel Text Processing: Alignment and Use of Translation Corpora, pages 1–24. Kluwer.

Waltereit, R. (2006). The rise of discourse markers in italian: a specific type of language change. In Fischer, K., editor, Approaches to discourse particle, pages 61–76. Elsevier, Amsterdam.

Webber, B., Prasad, R., Lee, A. and Joshi, A. (2016). A discourse annotated corpus of conjoined VPs. In Proceedings of LAW X – The 10th Linguistic Annotation Workshop (pp. 22–31). Berlin: Association for Computational Linguistics.

Webber, B., Prasad, R., Lee, A. and Joshi, A. (2019). The Penn Discourse Treebank 3.0 Annotation Manual. Tech report, University of Pennsylvania.

Weiss, D., Alberti, C., Collins, M., and Petrov, S. (2015). Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 323–333. Association for Computational Linguistics.

Wierzbicka, A. (2003). Cross-Cultural Pragmatics: The Semantics of Human Interaction. Mouton/de Gruyter, Berlin/New York.