ICOADS Web information page (draft) (Wednesday, 18-May-2016 19:33:26 UTC):

Appendix: Additional Details of Current ICOADS Dupelim Processing



1. Delayed-mode Dupelim processing

The Dupelim procedure used for the most recent update of ICOADS—Release 2.5 (R2.5; covering 1662-2007)—was divided into four separate steps (a-d). Additionally, an important discrete Postprocessing step (e) was applied after Dupelim (also discussed below), to convert the "intermediate" product (containing dups, flagged as such, together with other suspect reports) into the final user product. Both products however are available for users, in case e.g. users wish to study the effects of Dupelim, e.g. the weighting and scoring of approximate duplicates and the rules used to determine the most valuable record.

a) Preprocessing
All data inputs were checked, and corrected as needed, to ensure that they adhere to the ICOADS convention for representation of longitude (0.00°-359.99°E). Each input was sorted into ascending order by these respective keys: YR, MO, DY, HR, LAT, LON, ID. In general, then all the sources are collated together, enforcing the same sorting, and thus bringing together all records with identical and approximately identical dates, positions, and IDs.

b) Preconditioning
Preconditioning is performed in two stages: (i) report deletions (deletions of entire reports for various reasons, e.g. judged inferior to another source covering the same temporal and spatial ranges), and (ii) field modifications (corrections of systematic field errors, etc.). Another important aspect of (ii) is the calculation of important data elements that are sometimes missing, e.g. dew point temperature. Finally, one set of ICOADS quality control flags (i.e. NCDC-QC flags) are assigned at this stage, which can serve as the basis of selecting a "best" duplicate for retention in the following step.

c) Dupelim
The core procedure considers reports within the same 1°x1° box and within plus or minus one hour ("hour cross") or day ("day cross") as possible duplicates. It performs a check for seven weather elements—wind speed, visibility, present weather, past weather, SLP, AT, and SST—to determine the degree to which reports are duplicated ("dups").

These checks for weather elements include "allowances," which consider weather elements (or hour) to match under some circumstances even though they were not exactly equal. No reports are eliminated at this stage, but a dup status (DUPS) field is set to indicate level of dup certainty depending on a variety of factors. Later, in the Postprocessing (step e) it is determined which dups advance forward to pre-release record collection.

Quality code, as computed by the NCDC-QC (ref. Release 1, supp. J), is generally the initial basis for the selection of one duplicate report over another. Or, priority codes are used in the event of a match of two reports with equal quality codes. Finally however, a few special rules can be set and applied to selected data sources. This is to shield known high quality valued sources from potential negative impacts in the duplicate elimination process, e.g. allowing a specific source with relative humidity to pass through even though a relative humidity QC is not possible.

Because the 1970s-forward situation is somewhat simpler in terms of data characteristics, the algorithm becomes more stringent starting in 1970, e.g. disallowing hour-crosses and day-crosses. Also in very limited circumstances, "compositing" of reports can occur, i.e. specifically copying selected data fields (e.g. platform ID information) from one report into a matching duplicate report that lacks that information.

d) Postconditioning
This step subjects each marine report to another ICOADS QC procedure, called "trimming." This is where the field values are compared to global gridded climatologies and the quality is quantified. Again, fields are not eliminated but assigned QC values that are used later. This is done late in the processing workflow, after all impacts from preconditioning have been completed.

e) Postprocessing
Postprocessing creates an "intermediate" product that includes all the candidate records for a future release. This product includes the set of unique records, together with approximate duplicates for which a clear algorithmic decision was not possible, and landlocked reports. This is a research set of records. The formal public release is derived from this intermediate product based on QC settings, best guesses in the duplicate certainty assessment, and other rules learned from experience while developing previous releases. If unexpected anomalies are found in a Release, either the ICOADS team or other interested researchers turn to the intermediate product to gain insight and understanding. It is not unusual for the ICOADS team to make several passes through the intermediate product. There is a suite of Release validation tools that expose observing system and new source impacts on each new Release.

2. Real-time processing

Currently we utilize marine GTS data strictly from NOAA's National Centers for Environmental Prediction (NCEP), which have already been subjected to a "dup-merge" processing. Thus no further dupelim processing is applied. However, basic QC checks are made so that the resulting "Preliminary" ICOADS data can be formatted identical to the formal Release and processed by interfaces and users easily.

3. Further documentation and software

The original COADS Release 1 (1985) Dupelim algorithms, i.e. pre-1970s and 1970s, are described in detail in:
icoads.noaa.gov/Release_1/suppK.html
since then, for DM updates, improvements have generally been built on-top of those original algorithms. For instance we have detailed electronic documentation (e-doc) for Release 2.1 Dupelim broken out by time periods, e.g.:
icoads.noaa.gov/e-doc/other/dupelim_1784
icoads.noaa.gov/e-doc/other/dupelim_1970
R2.5 Dupelim documentation is still in preparation, and not yet public.

ICOADS has not yet generally made processing codes available, except for libraries and utilities intended for common use. One of these is our "rwimma1" Fortran program (icoads.noaa.gov/software/rwimma1), which reads the International Maritime Meteorological Archive (IMMA) format in which all ICOADS marine reports are stored; also, it can be readily adjusted to write out IMMA. In addition, we have a carefully validated and referenced Fortran library (icoads.noaa.gov/software/lmrlib) to handle historical units translations (e.g. temperatures in Reaumur/Fahrenheit to Celsius) and similar adjustments (e.g. barometer corrections).

Looking to the future however, we recognize the need (as resources permit) to develop software in other languages (e.g. some work done by the UK Met Office here: http://brohan.org/philip/job/imma/index.html) and to make publicly available, as the ISTI databank plans, final processing codes so that they will be fully open and transparent.

[Documentation and Software][Links to additional]

U.S. National Oceanic and Atmospheric Administration hosts the icoads website privacy disclaimer
Document maintained by icoads@noaa.gov
Updated: May 18, 2016 19:33:26 UTC
http://icoads.noaa.gov/dupelim_appendix.html