TEXT·A·GRAM

Text analysis from the point of view of text grammars


Announcement (April 3, 2026): We have just inaugurated ``Wicacho'', a new semantic tagger based on Wikipedia, that will replace the one we have now in place here. For now, the new tagger has its own website ( http://www.tecling.com/wicacho ) but we will put it here as well very soon.

(back)

ANALYSIS OF DISCOURSIVE MARKERS

This function is used to automatically find, classify and label discourse markers in a text, following the taxonomy of markers developed by Martín Zorraquino and Portolés (1999). The labels are differentiated by colour, visual marks such as underlining, bolding and their respective categorisation in brackets on the right-hand side of the marker found.

Discourse markers

In order to maintain cohesion and coherence within a text, discourse markers are required, among other elements (Calsamiglia & Tusón, 1999; Fraser, 1999). The definition and classification of these markers tends to be diverse, as their terminology and parameters tend to change constantly depending on the approach and context in which they are used. According to Martín Zorraquino and Portolés (1999), discourse markers are units that can have discursive uses, emphatic uses, expressive values, among others. They would include types of words such as prepositions, adverbs, conjunctions, and other grammatical elements that can perform these functions, even if they are not their usual ones.

Discourse markers are invariable linguistic units, grammaticalized in the language and used in both conversational and written contexts. They are elements that do not seek to perform a syntactic function in the predication of the sentence, and whose purpose is to guide the reader through the text, maintaining coherence between ideas.

Classification

As mentioned above, for the creation of this tool, it was decided to use the classifications described by Martín Zorraquino and Portolés (1999), who divide discourse markers into five main types. Since the tool focuses on written text, conversational markers are not included.

a) Connectors: these are defined as discourse markers that seek to link two segments within the discourse semantically and pragmatically, guiding the reader to understand this relationship. This classification has three subclassifications: additives, which link two elements in the same direction; consecutives, which explain the cause-and-effect relationships between two segments; and counterarguments, markers that seek to eliminate possible conclusions regarding the previous sequence.

b) Information structurers: these indicate the discursive organisation of the text, so they have no argumentative meaning and only focus on the structure of the writing. These are also divided into three subcategories: commentators, which seek to introduce information based on a comment regarding what has been said previously; organisers, which seek to highlight the order of the elements of the text; and digressors, markers that introduce a comment, but in a lateral or separate way.

c) Argumentative operators: these are markers that condition the possibilities of argumentation in the segment in which they are imposed in relation to the previous one. Unlike the others, these only have two subclassifications: argumentative reinforcement, which strengthens the argument mentioned in contrast to other possibilities, and concretion, which seeks to show an example to support what has been said.

d) Reformulators: these seek to reformulate the previous statement through the following one, that is, they seek to say something but in other words. These are divided into four subclassifications: explanatory, which introduce an explanation of the previous statement; rectifying, which correct or define the element referred to more appropriately; distancing, which seek to distance themselves from the commitment made in the text; and recapitulative, which conclude or summarise what has been expressed.

The taxonomy of discourse markers used by TEXT·A·GRAM is that of Dismark: http://www.tecling.com/dismark

References


Calsamiglia, H. & Tusón, A. (1999). Las cosas del decir. Barcelona: Editorial Ariel.
Fraser, B. (1999). What are discourse markers? En Journal of Pragmatics, 31, 931-952.
Martín Zorraquino, M. A. y Portolés, J. (1999). Los marcadores del discurso. En I. Bosque & V. Demonte (eds.) Gramática descriptiva de la lengua española, vol. 2. Madrid: Espasa, pp. 4051- 4213.

(back)

===================================

This is another product of Group Tecling.com
There will soon be a paper describing the new version semantic tagger (the one we have not yet installed here but will soon):
• Nazar, R.; Renau, I. (In press). Wikipedia used as a semantic tagger: some preliminary results in Spanish. Procesamiento del Lenguaje Natural, n. 76.

These other papers describe the rest of the text analyses performed by the software:
• Nazar, R. (2024). Statistical modeling of discourse genres: the case of the opinion column in Spanish. SN Computer Science 5(959):1-11.
• Nazar, R.; Renau, I.; Robledo, H. (2024). Dismark and Text·a·Gram: Automatic identification and categorization of discourse markers in texts. In: Cecilia-Mihaela Popescu & Oana-Adriana Dut,ă (eds.), Discourse Markers in Romance Languages. Crosslinguistic Approaches in Romance and Beyond. Berlin: Peter Lang.

This is open source-software


This is the code of the module of the semantic tagging Tatatag:
https://tecling.com/textagram/tatatag-source.zip

Please be aware that this version is already deprecated (March 7, 2026) and will soon be replaced by the new one (Wicacho).

Evaluation dataset


This is the evaluation of the old version (Tatatag). It will also be replaced soon. The data used for evaluation consist of Wikipedia pages, 179 in Spanish and 241 in English.
https://tecling.com/textagram/evaluationData.zip

The data here is presented in html format. It includes pages about 19th Century British politicians and Argentine generals of the same period. Files are numbered and each one is associated with an 'evaluation' file, where results are evaluated. At first, we did the evaluation ourselves, but then we opted to use Gemini API to evaluate results. The evaluation is not pefect but is pretty close to human. We are in the process of evaluating the evaluation. More on this soon.
You can also browse the evaluation data here.

The modules for the analysis of discourse markers, deixis and modalization are here:
https://tecling.com/textagram/text·a·gram.zip

This code allows for the local execution of the program, which allows to analyze many documents at the same time. The code is composed of a Perl script and some csv tables. It uses R for the graphs (the online version, instead, uses the GD Graph library).

Concept and development: Rogelio Nazar

Collaborators: Javier Obreque, Diego Sánchez, Hernán Robledo, Paolo Caballería, Nicolás Acosta, Scarlette Gatica, Andrea Alcaíno, Ignacio Lobos and Irene Renau.

Documentation: Andrea Alcaíno & Rogelio Nazar