Making Dictionary Files for Stata
A dictionary file is basically a template or set of instructions that tells Stata how to read your data. You can write it in any text editor (e.g., TextEdit) and then save the file with a .dct extension so that Stata knows it's a dictionary file.
What you do next depends on your data. However, there will be a number of things you need to know in order to tell Stata "what's what." The first line in a dictionary file will tell Stata where to find the data. This will take the form of dictionary using [nameoffile].[extension on file] {
.dta should be the format of your data if it is a raw data file.
Next, you explain your variables to Stata. This description includes both aspects of how the data is and aspects of how you want it to be when Stata converts it to a Stata-type data file. Each variable will have its own line. These lines start with an underscore _
and the word column
followed by the number of the column in which the data starts, the type, the name and the number of characters. Below are three types of examples. While they look slightly different than the aforementioned case, they all follow the same general idea. In every case, the very last line of the dictionary file needs to be a return, so after you type the final bracket (}) hit enter moving the cursor to the next blank line. Otherwise, Stata will not know how to read your dictionary and instead you will see the unexpected end of file error in the Results window.
1. Basic data might look very much like this:
123456
232456
323456
While to the human eye this is fairly ugly, it is actually very simple to write a dictionary file for. Lets pretend this data is saved as sample.raw. My dictionary file would have the following content:
dictionary using sample.raw {
_column(1) byte var1 %1f
_column(2) byte condit %2f
_column(4) int last3 %3f
}
2. The second way raw data might be organized is a way that makes a lot more sense to the human eye, but, somewhat ironically, a lot less sense to the program. It is set up in columns and might look like this:
Arkansas 10 34 3
Texas 12 31 2
Minnesota 11 29 2
Washington 15 33 1
Since the above example uses tabs to break the columns, no dictionary file is actually necessary. Instead, you can simply type infile str10 state prod satis rank using columns.raw
This will tell Stata that the first variable is a string (nonnumeric group) of a maximum of ten letters, that should be called "state" and that this variable is followed by three more numeric variables named prod, satis, and rank respectively. As long as the space between the columns is whitespace (such as tabs or spaces) you do not need to create a dictionary file.
3. The last, and perhaps oddest, case involves data that is split up over multiple lines. For example, maybe you have a list of addresses but they're formatted in the style of address labels rather than databases. In this case the data might look something like the following:
Albert Johnson
bloodtype: C status: d
contact: 206-434-5065
Jenny Fiskens
bloodtype: A status: n
contact: 504-322-4056
Once more, this is data that for a human is both easily read and understood. However, for a program like Stata, there is a lot to discuss before it will understand what is happening. The two big differences here are one instance of data spread out over multiple lines, and a lot of letters that don't need to be there. While this might look complicated, it's not really so bad.
dictionary using medical.data { _column(1) str15 name %15s _newline _column(11) str1 blood %1s _column(21) str1 status %1s _newline _column(9) str12 phone %12s {