Sea Surface Temperature Discussions: June 2011

Friday 17 June 2011

Names for MD subsets

Principles for clear communication are:
(i) only call one concept by one name
(ii) use a name that closely describes the concept

In our multisensor matchup datasets (MMDs) data will be flagged as belonging to one of four subsets. We must agree team-wide names for each of these and use them consistently. They are:

(i) 'training set': this is the subset to be used for determining the coefficients in empirically derived algorithms, for any other form of algorithm turning, and/or as a training dataset in a supervised neural net optimisation. The training set will therefore be made available with validation (in situ) values to all algorithm developers during the algorithm selection process (the Round Robin)

(ii) 'test set': this is the subset allowing algorithm developers to get an (statistically) independent assessment of an algorithm based on the training set -- it is "reserved data" for algorithm development, with validation values included. Ideally, a developer will use it once when the algorithm tuning is done; although if the test set performance is perceived as poor, it is recognised that this might prompt another cycle of training.

(iii) 'selection set': this set is distributed to algorithm developers without validation values, and with fields sufficiently only to derive SSTs, SST uncertainties and SST sensitivity for each "blind" matchup.

(iv) 'validation set': will be used exclusively for product validation after prototyping -- in the statement of work, it is the set referred to as "reference data". No algorithm developers including the SST CCI EO team will have access to this subset of the MMD prior to product generation. The product validation will be done by team members not involved in algorithm specification. Thus the product validation will be fully independent, in terms of data and personnel.

Wednesday 15 June 2011

Update on MMS

For technical reasons not yet clear, the processing time to build the multi-sensor matchup database underpinning the multisensor matchup system is projected to be well above the few weeks expected. Martin B is still investigating the cause of this problem. In the meantime, we still aim to launch the RRDP at GHRSST, by having a sample available on DVD. This will be in the correct format, but Gary C will have to do some offline work to achieve this. Once MB's assessment is in, we will have information to decide on whether the Round Robin schedule can be retained (depends on when the complete data can be made available to external participants).