Bifid - Parallel corpus alignment
Update July 21, 2024: We are updating the server
We are doing some long awaited maintenance service of the hardware hosting this
website. Be patient until we finish, in a few minutes.
Bifid is a program that takes a set of documents with their translations
and performs different functions:
- It separates the set of documents in the two languages
- It aligns each document with its translation
- It aligns the sentences in each pair of documents
- It extracts a bilingual vocabulary from the aligned sentences
- It export results in csv and tmx formats
- It imports tmx documents, in case you already have your corpus
aligned at the sentence level
and what you want is to obtain a bilingual vocabulary.
- The bilingual vocabulary includes multi-word expressions.
Give it a try:
Here you have a nice little parallel corpus in English
and Spanish extracted from
Revista Chilena de Neuropsiquiatría.
Download the zip file and upload it again
to your account.
You can also upload a tmx file if you have it already,
and in this way by pass the document and sentence alignment.
Here is an example file from
Opus corpus:
emea.tmx.zip (warning: this is a large file
and it takes time to process).
Lastly, if you want to try with a different pair of languages, here is
subset of the Canadian Hansards, with English and French.
Bifid had been online since 2004 (yes, it's going to be 20 years now)
but lately its server had gone down and it was neglected.
But here it is now, again, restored to its former glory!
We are planing some kind of celebration for its 20th birthday (none
remembers the actual date so we will celebrate the whole year).
We will be updating on this soon.
Some (old) publications on the project:
Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels.
Procesamiento del Lenguaje Natural, n. 47.
Nazar, R. (2012). Bifid: un alineador de corpus paralelo a nivel de documento, oración y vocabulario.
Linguamatica, vol. 4, no. 2.
If you have questions, feel free to send email: rogelio dot nazar at pucv dot cl