Nothing Special   »   [go: up one dir, main page]

Files in the text-fabric file format (.tf) store a column of feature values that correspond to nodes and edges in a graph, which together represent annotated text. So, one could say that .tf is a Markup format.

Annotated Text
In the humanities, primary research data often takes the form of texts. Many of these texts are historical artefacts and a lot of knowledge is needed to interpret them. Annotations are a preferred way to represent this knowledge. They may convey detailed linguistic information at the word level, but they can also link persons, places, materials, and concepts found in the text to external descriptions.

Texts are always structured, and annotations need an addressing mechanism to target the specific portions in the text that they are about. The annotations tend to form bodies of knowledge in themselves, and need to be shared and distributed as separate entities.

Data model
Text-Fabric is a tool that facilitates this exchange of data. In order to do so, it defines a model [TF model] for annotated text. In this model, text is an annotated graph: a system of nodes and edges between nodes, where nodes and edges are linked to other information by means of features. The nodes stand for textual concepts, such as words, sentences, chapters, and the edges for relationships between these portions of text. Features are mappings from nodes or edges to values. Nodes themselves are just integer numbers, and edges are just pairs of numbers.

This model is very close to Linguistic Annotation Framework [LAF ISO standard]. The main differences are that LAF prefers to be represented in XML and Text-Fabric is XML-free, and that a LAF dataset may reside in a single or in separate files at the choice of the corpus designer, while a Text-Fabric dataset always stores a single feature in a single file.

Node features
A node feature is a mapping from numbers to values: a column of values, where the position in the column corresponds to the number of the node.

Edge features
An edge feature can be seen as a mapping from nodes to other nodes, where a value may be supplied for each connection. Edge features are also columns of values, where the postion in the column corresponds to the number of the node where the edges start.

File format
Text-Fabric defines an efficient way to store features in files [TF file format]. Each feature occupies a single file. A Text-Fabric dataset is just a flat collection of feature files.

Feature files are plain text files in Unicode using the UTF8 encoding. At the beginning there is a bunch of lines starting with @ . These are metadata lines and can contain arbitrary key-value pairs. Then there is a blank line. After that, all lines are data lines. In a feature file, the contents of a data line is the value of the feature for the node corresponding to the line number. Here is a partial feature file from the example corpus banks [TF example].

@node
@compiler=Dirk Roorda
@description=the letters of a word
@name=Culture quotes from Iain Banks
@purpose=exposition
@source=Good Reads
@status=with for similarities in a separate module
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@version=0.2
@writtenBy=Text-Fabric
@dateWritten=2019-06-03T07:30:41Z
Everything
about
us
everything
around
us
everything
we
know
and
can
know
of
is
composed
ultimately
of
patterns
of
nothing

This is the basics, there are some optimizations [TF optimizations] that deal with sparse features: features that only assign values to a subset of the nodes.

Extension
Feature files typically have extension .tf .

Tools
Text-Fabric is also a library [TF API] by which you can process text and annotations. It understands the .tf file format and offers an API to load and save feature files and to compute with the data contained in them. Text-Fabric compiles .tf files into binary .tfx files which are optimised to load very fast. These .tfx files are just a convenience but are not suitable for archiving and should not be considered a preferred or even acceptable format. They are dependent on the computer where they have been generated.

Text-Fabric is by no means required to make sense of .tf files. The format is so transparent that several users bypass the tool Text-Fabric and have written their own programs (in languages other than Python) to ingest .tf files.

Corpora
A number of corpora [TF Corpora] have already been converted to Text-Fabric, such as the Hebrew Bible, various Cuneiform tablet collections, the Quran, and more. For all these corpora there are dedicated tutorials [TF tutorials] that show the practice that Text-Fabric supports.

References
TF model: Model – Text-Fabric (archived version

TF file format: Format – Text-Fabric (archived version)

TF optimizations: Optimizations – Text-Fabric (archived version)

TF example: Banks: convert.ipynb (archived version)

TF API: TF – Text-Fabric (archived version)

TF Corpora: Corpora – Text-Fabric (archived version)

TF Tutorials: tutorials (archived version)

LAF ISO Standard: ISO 24612:2012 – Linguistic annotation framework (LAF)

Text-Fabric is a preferred format for file type Programming languages.