The following are just some preliminary, informal hints.
Rational choices in the electronic representation of data presuppose their analysis in terms of criteria such as the following:
One major distinction between kinds of data is between digital and analog data. Digital data code everything they represent in terms of figures (inside the computer, in binary digits); analog data bear an iconic resemblance to the phenomena they represent. Digital computers can only process digital data (because a bit is either set or not set, but not 37% set). Analog data therefore have to be digitalized before they enter the digital computer (and they may be reanalogized if they leave it in order to be perceived).
When talking of computerized (digital) data, we will nevertheless distinguish between analog and symbolic data, according to their function for the user:
The technical counterpart to these various kinds of data are data types. These are specific constellations and interpretations of sequences of bits and bytes. Here are some examples of data types used at the beginning of the 3rd millennium:
The above analog data types are, at the same time, file types, which reflects the fact that the internal structure of such files is generally opaque to the user.
Symbolic data can always be represented as text; analog data cannot. That means that the user can, within certain limits, choose the data type for his symbolic data. More on this below.
Since the computer only processes bits and bytes, any structure of the data is defined by the user (esp., the programmer). Aspects of the structure concern data types and their configuration. Such a configuration may consist of data of the same or of different types. It may become complex by nesting at several levels. Here are some examples:
Alphanumeric data can, in principle, be treated as text (in a string). For instance, a date, instead of being stored as a long real number, may be stored as a text string of the form ‘23/04/2006’. Choice of the data type for storing a piece of data depends on the purpose it is meant to serve and on the kinds of operations that one wishes to execute on the data:
Symbolic data are represented formally if the file has a technical structure that corresponds to aspects of the logical structure of the data. If the file has no such technical structure, it is an informal representation of data.
If one assigns a piece of symbolic data to its specific data type (e.g. character, long integer, real, date, boolean etc.), one has chosen a formal representation. The software then guarantees that the data will be processed in a consistent way. For instance, it guarantees that a date will not contain a month figure above 12. If one assigns a piece of symbolic data the data type ‘text string’, that is so far an informal representation. Such a piece of data can become inconsistent and may be operated upon in inappropriate ways by human interference. In scientific contexts, it is therefore advisable to bestow a formal representation on one's data.
Storing data in a text file of some word processor (e.g. an Open Office Text file or an MS Word file) means to opt for an informal representation. The advantage is easy access to the data for unsophisticated users (those who use the computer for games, internet browsing and, at most, as a typewriter). However, even this advantage has narrow limits. For instance, one may store a bilingual glossary (see above for its logical structure) in a text file by using a tabulator between the lemma and its gloss, like this:
integer ganze Zahl
One way of accessing such a glossary is the attempt to find an entry by its English lemma. If, however, the search string happens to appear on the German side, too (like ‘date’ in the example), then the text processor cannot tell the two columns apart and highlights the German word (Datei). Much less can the word processor sort such a list alphabetically or transform it into the converse (German-English) glossary. Therefore this method is inadvisable (no matter how many glossaries in the third millennium AD still have such a form).
A minimum of formal structure would be to arrange the glossary in a two-column table, like this:
Under such presuppositions, even a (contemporary) text processor can perform the above user's operations.
Naturally, the amount of formal structuring that one bestows on one's data also depends on their quantity. If one's data are limited to one Shakespearean sonnet, then sure the most efficient way of guaranteeing flexible data retrieval is to learn it by heart. However, if the data exceed a certain size, it becomes worthwhile to store them in a formal structure. For instance, if the above glossary grows into a dictionary of several thousands of entries, its appropriate place is in a database.
There used to be a sharp distinction between a text file and a database (file); and it reflected, at the technical level, the logical distinction between informal and formal representation. This, however, has changed drastically since markup languages have come up. These are formal languages which mark the logical structure of text files. Their marks (“tags”) are interspersed with the text data themselves. The effect is that the file is, technically speaking, a text file (it may, in fact, be even a pure ASCII file), but at the same time it may have a formal structure similar to a database.1 Markup languages (like SGML, XML etc.) are not treated here (there is sufficient information on the internet). Since they are some decades younger than databases and (2011) yet being developed, such text files still demand more programming effort from their user and are slower in processing than database files; but that may change soon.
The distinction between a database file and a text file may still be helpful in distinguishing two principal kinds of textual data:
By ‘chunks of linguistic data’ are meant such elements as morphs, words, syllables, proverbs etc., deprived of any context. They are easily stored in a database.
Running text, as in a sentence, paragraph, chapter, book or collection, differs from chunks of linguistic data by one property that is essential from the point of view of computer engineering: it does not have a fixed length and does not consist of a fixed number of elements, but is just sequential in structure. It is just for this reason that the file type ‘text file’ was invented. There are ways of fitting a running text into a database; but that remains a makeshift.
Nevertheless, (taking up the distinction between standard database and free-field-structure database made in the section on databases) storing a running text in a free-field-structure database is, for many linguistic purposes, an appropriate solution. Here is one way of doing it. It is, again, oriented towards use of the program Toolbox (SIL). Implementing it as an XML file would be technically equivalent.2
As just said, this is a makeshift. On the one hand, the horizontal structure of the text is dismembered and represented only in the consecutive numbering of the records. If the record IDs got lost, the file would reduce to a collection of chunks of linguistic data. On the other hand, much of the information assembled in a record of this kind has a vertical structure in the sense that, e.g., the interlinear gloss of a morph refers to a certain morph contained in another field of the same record. This kind of vertical alignment in what is essentially a text file is not easily processed algorithmically; Toolbox provides some rudimentary help.
1 The essential difference is that the formal structure of a markup language file is a hierarchy (a meronymy), while the formal structure of a database generally is not.
2 One of the export formats of Toolbox 1.5 is, in fact, an XML file.