Murphy's Laws for Data
I've had the privilege of digging through some of Murphy's papers and it transpires that there is a whole collection of lesser-known variants of the Murphy's Law specifically for data.
Murphy's handwriting leaves a little to be desired, and my access was fairly limited, but from what I can gather the following laws are inviolate
Murphy's Laws for Data (ML4D)
- If data can be wrong, it will be.
- If data can be misinterpreted, it will be.
- If data can be biased, it will be.
- If data can be misformatted, it will be.
- If data can be incomplete, it will be.
- If errors in data can pass silently, some will.
- If data formats are ambiguous, all interpretations will be used.
- If data formats are unambiguous, they will be ignored.
- If summarization can destroy meaning, it will.
- If patterns can be non-linear, they will be.
- If data items can contain separators, they will.
- If data can be destroyed it will be (except when the goal is data destruction).
- The life expectancy of any datum is inversely proportional to its utility and correctness.
- The likelihood of data being correct is inversely propotional to the importance of the decisions it will be used to inform.
- Some of the data will be case sensitive.
- If input and output encodings can be different, they will be.
- Representative samples aren't.
- If data can be encoded in EBCDIC, it will be.
- If escape conventions can differ, they will.
- If the data is correct, then the checksum will be incorrect; and vice versa.
- Encryption will render the data unreadable by the encryptor and transparent to others.
- Dates are subject to their own special versions of Murphy's Laws for Data.
- Passwords are also subject to their own special versions of Murphy's Laws for Data.
- Data that demands to be graphed won't be.
- Excel will obscure all meaning in data with a combination of chart-junk and inappropriate defaults.
- Causal relationships change immediately after detection.
- The likelihood that a confidence test on data has been applied correctly is less than the stated confidence level.
- Backups become corrupted/missing at exactly the same time as their corresponding master.
- The obvious interpretation is incorrect.
- The correct interpretation is implausible.
Data will at best be incorrect, misinterpreted, misformatted, biased, incomplete, non-linear, misgraphed and quickly lost.
Footnote (Data and Plurals)
I am aware that there is a school of thought that maintains that the word data is plural and that on this basis we should say things like "the data are wrong". Neither Murphy nor I attended that school, but it is our opinion that that data supporting this view is questionable and that such usage, in this twenty-first century, is at best archaic and possibly even affected. Those of a different and more delicate sensibility are respectively requested to pass over these laws quickly to avoid undue distress.
Labels: data errors