State of Affairs
State of affairsHere, we present a short overview of activities since the start of the project at 1 March 2003. This will be done separately for each project.
A. Life Courses
B. Census data
A. Life courses (HSN)The database of the Historical Sample of the Population of the Netherlands (HSN) covers the entire country and will contain micro-level data on the life courses of over 40,000 individuals born between 1863 and 1922. These life courses are to include data on each successive family situation in which the individuals lived, all the addresses they lived at, as well as data on the religion and occupational title of each subject and of every person with whom they co-resided (and, for married subjects, data on the occupational title and place of residence of family members of the subject's spouse). The project will be carried out during the period March 2003 - February 2008.
Organization and personnel
In the months before the 1st of March 2003 the whole organization of the HSN was discussed and it was decided to make a more sharp distinction between the different parts of the work. The scope of the new project made it possible to distinguish between A) The collecting of the data in the region of birth of the Research Person (RP), B) Evaluating and coding the incoming copies of population registers including the mailing process necessary to get information about RP's who migrated out of the region of birth, C) Entering the data into the database and D) Semi-automatic checking of the data on errors and inconsistencies and eventually correcting them and in the end creating data-output.
In the period January-February 2003 the data-collection process was started. We appointed personnel in all parts of the country to start the data gathering process. Secondly we organized the handling of the incoming copies of population registers, a.o. by developing software to manage the routing of the material and to collect management information on the progress of the data stream. This program called 'Management and mail' became operational in July. Two data typists had been contracted to start with data entry. By 1 October this number was increased to seven (all part-time). Data entry works with a new data entry program which had already been implemented in the 2002. This data entry program was revised to overcome some bugs and to further sophisticate entry of some variables (HSN 4.04). In June 2006, a renewed version was launched: HSN 4.05.
At the end of 2002 the HSN-organization numbered about 25 employees of whom about 80% was financed by way of unemployment schemes. At the moment, this number has been enlarged to almost 40 persons of whom about 50% are financed by unemployment schemes. Most of the extra 25 staff have been hired on part-time basis in order to spread data collectors as much as possible across the country and to diminish risks of RSI for data entry typists. On full time basis the HSN numbers about 25 employees.
The work is directed by a staff of five: a general manager, two managers of the archival work, a database-manager, and an office-manager.
Situation of the data-set on 1 September 2006
To get some substantial intermediate results the project has been subdivided into three parts:
The production process of collecting and entering data is divided into four parts:
B. Digitization of Dutch Population CensusesThe digitisation of the censuses started in mid March 2003.
The project proposal discerns the following groups of activities:
Ad 1. Preparatory workThe selection of the materials to be digitised and its pre-processing, including microfilming, have already been carried out in an earlier project and in the preparation of this proposal.
Ad 2. Technical research table recognitionThis activity has been carried out by two digitization specialists of DANS, who partly based themselves on earlier work carried out in 1997-1998, when the digitization of the census of 1899 took place. A score of commercial OCR packages has been compared: Finereader (ABBYY Bit Software Inc.), Typereader (ExperVision), Omnipage Pro (Caere), Readiris Pro 8 (I.R.I.S.). The best results in recognising tables are reached using FineReader (6.0 corporate edition) and Omnipage Pro.
All tested software works better on small tables than on large ones. If tables comprise two pages, a shift in the horizontal rows can cause problems. Spots are sometimes recognised as commas. In badly printed pages, it occurs that 6's are recognised as 0's. On the basis of the research it was decided not to pursue the possibility of automatic structure analysis of the tables any further.
For the textual parts of the censuses (introductions, appendixes, etc.) experiments with PDF-files containing the original images and OCR'd text in the background gave satisfactory results.
The application of OCR will be restricted to pages with mainly text, because it appeared not to be cost effective to use OCR for the large tables. It is not yet certain whether all small tables in the introductions can be processed.
Ad 3. Researching/preparing data storageThe data entry is being carried on workstations with two screens. On one screen the image of a table is projected, on the other a prepared Excel worksheet. For every table an Excel worksheet is prepared, in which the structure of the printed table is copied as well as possible. Per table an instruction is made on the particularities of the data entry. By using Excel formulas, copying and macros the amount of keying is reduced where possible. Formulas of totals and percentages are used to check for data entry mistakes during data input. The data-entry instructions have been made for all tables for the years 1869, 1879, 1889 and 1919.
Ad 6. Data-entry and OCRAll years except 1889 are being typed in by seven data-entry workers at DANS. The maximum number of working hours per employee per week on this job is 24. The data-entry work for the 1889 census, which is the largest census in terms of published pages, is out-sourced to a specialised company.
The progress of the data entry is followed in a simple system and in two-weekly work meetings. The data-entry of the census of 1919 is finished, 1869 is almost finished and 1879 is well under way. The data entry speed varies considerably and is between 30 minutes and 2 hours per image (two pages of a table). The speed mainly depends on the complexity of the table structure and on the number of empty cells. Data entry statistics show that the available budget should be just enough for the work to be finished.
Ad 8. Post processingPart of the checking and correction rakes place during data entry by comparing calculated totals in the spreadsheets with printed totals in the sources. This method is very useful to find data entry errors and mistakes in the sources. A special meeting was organised to discuss the possibilities to annotate and/or correct source errors.
Because the structures of the tables vary and because the data entry for the digitisation of the censuses has taken place since 1997 in three projects and by different institutions and companies, the possibilities for error checking are not uniform throughout. Tables can have row totals, column totals, subtotals and other calculated numbers such as percentages and ratios. Totals are not always based on all rows or columns. The row and column headings consist of hierarchically subdivided texts that can only be checked visually. The small tables in the introductions offer additional information that could be used for checking purposes. Sometimes, the information in two or more tables overlaps, so that additional checking is possible. Finally, printed errata are published for several tables and years.
The experiences, also from earlier projects, are that a thorough checking and correction is very labour intensive and difficult to plan ahead. Therefore, the principles of checking and correction have been formulated cautiously as follows:
- Data-entry mistakes are corrected during data entry on the basis of the comparison of printed with calculated totals.
- Printed errata are processed in the digital tables.
- Source errors are recorded in the total cells. Standard, the automatically calculated total is included in the table, and in a comment the printed total is mentioned. In this way, the consistency of the data is enhanced. This work method is preferred, because it appears that many source errors are calculation mistakes.
- When time permits, deviations of printed totals from calculated totals larger than 1000 are inspected more closely.
- No checking takes place involving other tables.
- All checks and corrections are documented.
Ad 9. Implementing retrieval/accessAn initial design for a Website on which the data of the censuses will be published has been made. The website makes use of an open-source content management system developed by DANS. The design of the website contains a number of interactive elements (possibility of comments, discussions, additions, suggestions, links, etc.), that will gradually give body to an evolving collaboratory. Work files of censuses for which the data entry is ready will be made accessible as soon as they become available. The website can be seen at www.volkstellingen.nl/en/.