Robert E. Gladd, CQE: Thesis Prospectus

Robert E. Gladd,
Thesis work-in-progress internet edition:

UNLV Institute for Ethics & Policy Studies

Chapter 3,
The business and “science” of suspicionless drug testing

The private commercial laboratories performing the bulk of workplace drug testing hawk their services with a vengeance, sometimes holding free “seminars” in which drug abuse statistics of dubious lineage and merit are put forth as incontrovertible fact to heighten the sense of urgency among prospective clients. And, when pressed on concerns regarding false positive rates, lab spokesmen usually respond with reassuringly vague statements such as “our lab methods are quite precise, our screening results are 99.9% accurate.”**

** The terms “accuracy” and “precision” are not synonyms. The former refers to closeness of agreement with agreed-upon reference standards, while the latter has to do with the extent of variability in repeated measurements. One can be quite precise, and quite precisely wrong. Precision, in a sense, is a necessary but insufficient prerequisite for the demonstration of “accuracy.” Do you hit the “bull’s eye” red center of the target all the time, or are your shots scattered all over? Are they tightly clustered lower left (high precision, poor accuracy), or widely scattered lower left (poor precision, poor accuracy). In an analytical laboratory, the “accuracy” of production results cannot be directly determined; it is necessarily inferred from the results of quality control (“QC”) data. If the lab does not keep ongoing, meticulous (and expensive) QC records of the performance histories of all instruments and operators, determination of accuracy and precision is not possible.

Uncritical acceptance of such assertions is testimony to a widely-held naive faith in scientific exactitude:

...science has assumed an increasingly powerful role in the execution of justice. Indeed, scientific testimony is often the deciding factor for the resolution of civil and criminal cases...As one juror put it after a recent trial in Queens, N.Y. ‘you can’t argue with science.’ (Neufeld &Colman, When Science Takes the Witness Stand, Scientific American, May 1990)

Those who work in analytical chemistry, however, know first-hand just how demanding the never-ending quest for accuracy and precision is, and how equivocal lab results can be in the absence of obsessive vigilance (and reasonable workloads). Mr. Neufeld was one of the defense attorneys defending O.J. Simpson in his criminal trial. The public has by now learned through the Simpson criminal case and the efforts of attorneys Peter Neufeld and Barry Scheck therein just how thoroughly one can indeed argue with science, particularly the discipline of analytical chemistry, which is overwhelmingly “inductive” rather than “deductive.” A finding of a given concentration of an analyte in a specimen is an indirect statistical estimate, based on a long chain of interim measurements that are themselves estimates, all of which can contribute to uncertainty even as they attempt to add clarification, and any one of which can be sufficiently in error as to weaken or invalidate the final result.

Anyone taking the trouble to avail themselves of the technical literature from NIDA—available free of charge—will find much cautionary language, caveats mostly falling on the deaf ears of War on Drugs zealots:

Accuracy is the absolutely essential ingredient of laboratory analysis. The public perception of scientific measurements is that they are indisputable. If a laboratory reports the presence of a quantity of drug in a specimen, this ruling is judged to be correct, regardless of protestations to the contrary by the subject. (R.V. Blanke, Accuracy in Urinalysis, Urine Testing for Drugs of Abuse, NIDA research monograph No. 73, 1986)

NIDA scientific officials consequently advise against implementing indiscriminate blanket or random drug testing programs in the absence of evidence of true need:

Urinalysis for detection of drug use should be considered in the context of an overall plan to reduce or prevent the negative impact of drug abuse on an industry or organization. It would be inadvisable, however, to proceed without a careful assessment of the group to be affected by testing...The plan should be tailored to the extent of the problem. If no clear indication of significant drug use at a worksite or in an organization is apparent, a program beyond a preventive educational effort may not be warranted. (ibid., Richard L Hawks, Ph.D., Establishing a Urinalysis Program–Prior Considerations)

There is good reason for such circumspection; The suspicionless drug test is part of a larger issue hotly debated in the medical community concerning the clinical and economic utility of mass screening of asymptomatic individuals for low-prevalence conditions.

NOTE: In addition to blanket (100%) testing, there is a related, yet methodologically separate issue that should be addressed concerning the use of random testing. Axiomatic to inferential statistics is that random sampling assumes the random distribution of true positives in a population. Just as most diseases are unevenly distributed throughout human strata, drug users are by no means randomly distributed throughout the workforce and the larger society. Drug use occurs predominantly in fairly well-known sociological clusters. The justification offered in defense of random testing is that it is “non-discriminatory” (meaning “democratic”) and, as such, allays concerns that managers “out to get” certain workers might otherwise target them for testing. Such an assertion is, however, disingenuous in that all employers retain the right to test employees “for cause” in addition to any programs of blanket or random testing. Moreover, given the uneven distribution of workplace drug users, proper scientific statistical procedure would require stratified sampling plans in which the strata with the lower prevalence rates would be subjected to compensatory higher rates of sampling to elevate the probability of identifying true positives, which would inevitably lead right back to cries of “discrimination.” Such methodologically sound practice would never be tolerated by those in the (mostly white-collar professional) low prevalence strata; such groups would clamor for increased sampling in the allegedly high prevalence clusters, theoretical statistical principles notwithstanding.

Should all men past the age of 40 have annual PSA screens performed (Prostate-Specific Antigen) for prostate cancer? Should all women submit to annual mammograms or pap smears, irrespective of their ages or overall health status? Should everyone have their serum cholesterol analyzed routinely? Where the prevalence of an adverse condition is low, resources are inescapably wasted on the true negatives, and the probabilities of false positive results rise as the reliability of lab results degrades under the weight of the workloads. Moreover, consider the following:

No diagnostic test or screening device is perfect. Errors of omission and commission occur...the definition of an accuracy rate can be done in a few different ways, and these are often confused in casual or uninformed communication...It is an important fact that predictive values do depend on overall prevalence rates...As the prevalence of a condition becomes rare, PPV [“Positive Predictive Value”] drops too, sometimes surprisingly so. For example, a test with sensitivity and specificity each equal to 99% is generally considered quite precise, relative to most diagnostic procedures. Yet for a condition with a not-so-rare prevalence of one per hundred, the odds on being affected [a “true positive”] given a positive test outcome are (.99/.01 x .01/.99) = 1 , i.e., among all positive results only 50% are truly affected! For a prevalence rate of one per thousand, the PPV is only about .10. These low numbers raise serious ethical and legal questions concerning action to be taken following positive test outcomes.

See Finkelstein & Levin, Statistics for Lawyers. The foregoing example is an application of Bayesian statistical analysis. The Bayesian formula for dichotomous outcomes is given by:

p(+|A)p(A)
p(A|+) = ———————————————————
[ p(+|A)p(A) + p(+|N)p(N) ]

Meaning, the probability of being truly “Affected” (a true positive) given a positive test result, or p(A|+), is equal to the probability of testing positive given that one is “Affected” p(+|A) times the proportion of “Affecteds” (the “prevalence rate”), or p(+|A) p(A), divided by the combination of that factor plus p(+|N) p(N), or the probability of testing positive given that one is a true Negative (a.k.a. an empirical “false positive”) times the proportion of true negatives, i.e., [1 - p(A)]. Plug in some numbers. Assume the lab has historically had a false positive rate of 1/1000 (0.001; remember, the lab guy said their operation was 99.9% accurate), and that the proportion of true positives in your work stratum is 1% (0.01). For the sake of simplicity, graciously stipulate that p(+|A) = 1.0, meaning zero chance of a false negative. Given such assumptions, the predictive probability of being a true positive given a positive test result—that is, p(A|+)—is (.01))[.01 + (.001 x .99)] = 0.909918, or alternatively, the predictive probability of not being a true positive even though the test says “positive” is 1 - 0.909918, or roughly 9%, not 1/10 of 1%. Were you a drug-free employee in such a cohort, your actual risk of incorrectly testing positive would be roughly 90 times what you might naively expect.

A significant concern with respect to all of the foregoing is that neither the prevalence nor the false positive rates are generally known with any degree of certainty (see the Standefer report later in this chapter), particularly the false positive rate, which, in contrast to the prevalence, is operationally specific to each lab and test parameter, and may indeed be unquantifiable in the absence of large and costly datasets of quality control sample analyses. We might ask our friend the lab spokesman: “99.9% accurate? What do you mean by that? That you correctly identify 999 out of 1,000 drug abusers? That you can calculate a spiked** sample concentration within " 1/10 of 1% of the reference value? That you suffer only one false positive for every 999 true positives? That you suffer only one false positive in every 1,000 blind matrix blanks? For each analytical parameter? What, indeed, does ‘99.9% accurate’ mean? Can we have a look at your data?”

** A “spike” is a sample containing a “known” concentration of an analyte derived from an “NIST-traceable” reference source of established and optimal purity (NIST is the National Institute of Standards and Technology, official source of all U.S. measurement reference standards). A “matrix blank” is an actual sample specimen “known” to not contain any target analytes. Such quality control samples should be run through the lab production process “blind,” i.e., posing as a normal client specimens. Blind testing is the preferred method of quality control assessment, simple in principle but difficult to administer in practice, as lab managers and technicians are usually adept at sniffing out inadequately concealed blinds, which subsequently receive special scrutiny. This is particularly true at certification or contract award time; staffs are typically put on “red alert” when Performance Evaluation samples are certain to arrive in advance of license approvals or contract competitions. Such costly vigilance may be difficult to maintain once the license is on the wall and the contracts signed and filed away.

SAMHSA/NIDA “certifies” drug testing labs for competence. Curiously, it is none other than the Research Triangle Institute—our previously cited NIDA survey analysis organization (see Chapter 2)—that also holds the contract to administer the NIDA Laboratory Certification Program. A call to the RTI number provided to me by NIDA was answered “National Laboratories Program, may I help you?” The National Laboratory Certification Program Application Form is a slim document of 16 typewritten pages containing mainly yes/no checkoff boxes (e.g., [F.5] “Is the director a full-time employee of the laboratory?”). Section “C,” pertaining to quality control, consists solely of six yes/no questions covering two pages of the application form. The accompanying instruction sheet advises that the certification program consists of the application form, three rounds of “Performance Test” (PT) sample evaluations, two on site inspections, and fees totaling $17,300.00. Performance requirements on the PT samples appear to be surprisingly lenient:

Acceptable performance for a PT shipment is no false positive result and the identification of 90% of all required drugs and/or metabolites that are used to represent a drug or drug class in the samples. In addition, the quantitative results determined for PT samples must be within "± 20% of the mean calculated for the reference laboratories for 80% of all drug challenges, and within "± 50% of the calculated mean for all samples.

No false positives, sounds reassuring; after all that’s our overriding civil liberties concern, false accusation, right? But look closely; the “90%” identification requirement provides the applicant with a safety valve allowing for up to 10% “false negatives,” so when in doubt on the PT samples, punt. Given that the NIDA specified spike concentrations of the PT matrices are typically well above backgrounds (e.g., 180 nanograms/milliliter for cocaine metabolite where, for example, its Liquid Chromatography MDC** (Minimum Detectable Concentration) in biological fluids is on the order of 20 to 50 ng/ml), and given that these PT samples are unlikely to be truly “blind,” (if even shipped as such) only the most glaringly incompetent of laboratories are likely to fail this sort of licensing process. And, according to the instructions accompanying the application, facilities so maladroit as to “not perform acceptably in the proficiency testing or any other stage of the certification process...may request reinstatement into the certification program.” The only condition for reinstatement is “the subsequent expense of repeat certification activities.”

** MDC Note: Laboratory technologies are incapable of detecting analyte concentrations all the way down to “zero” for a number of reasons, including chemical interferences in the various constituents of the sample matrices and the “noise” inherent in any electronic system (think about the “signal-to-noise ratio” specifications accompanying your stereo system). The choice of “cut-off” levels that classify results as either “positive or “negative” on the basis of their quantification above or below administratively pre-determined concentration limits is a principal factor in relative rates of false positives and negatives. Low cut-offs risk excessive false positives, whereas high cut-offs inevitably lead to a higher false negative rate. The choices must be made with consideration for the consequences of being wrong either way, balanced against the benefits of being right. Assay “sensitivity” refers to the probability that a true positive can be identified; “specificity” denotes the probability that a true negative will be so determined. These two analytical attributes are mutually inverse, and simultaneous optimization of drug test sensitivity and specificity (or, equivalently, at once minimizing the possibility of false positives and false negatives) is not economically feasible in inexpensive mass production mode. Something has to give. The same principle applies in criminal jurisprudence, wherein “sensitivity” (the allegations) must be supported beyond a reasonable doubt by “specificity,” (the particular, logically undeniable proofs). Allegations (screens) are cheap; proof (confirmation) is expensive (and made prohibitively more so in the absence of probable cause).

So; perhaps “99.9% accurate” may actually mean something like “we can analyze 90% of spiked samples within ± 20% of a “known” value 80% of the time, and within ± 50% the remaining 20% of the time.” If we know ahead of time what to expect and when to expect it.

Does “± 20%” imply laboratory ineptitude? Not necessarily. Consider the “power” formula below, which researchers use to determine sample size (“n”) required to discriminate between an expected value (such as a “cut-off”) and an experimental result:

[(Z_a - (-Z_b))s]^²n = ———————————
(µ1 - µ0)^2

The denominator (µ1 - µ0) represents the difference of the two means. The “Z” values refer to the bell curve standard score significance levels we can choose for false positives (a, or “alpha”) and false negatives (b, “beta”), respectively. The “sigma” (s) refers to the “standard deviation,” i.e., the expected variability based on prior measurements. Once again, plug in some numbers. Assume a measurement cutoff (µ0) of 100 ng/ml., with a sigma of 4%, or 4 ng/ml. Fix “n” at 1, the alpha level at 3.72 (meaning far less than a 0.001 chance of a false positive), and beta at 1.28 (for a 10% chance of a false negative). Set µ0 at 100 (the cutoff value), and solve for µ1. You get µ1 = 120, a 20% difference, the best you could be expected to do given a single run at specified probability and empirical process variability levels. And, in the trenches, a lab able to keep its process standard deviations at or below 5% under a heavy workload is doing very well indeed.** In this example we set n = 1 because production sample analysis is generally a one-run estimate of a “true” concentration. Were we to want finer discrimination, we would have to analyze a sample multiple times (look at the mechanics of the formula). This is problematic, in that samples are typically consumed in analysis. We would have to split samples up into multiple “aliquots” for re-runs, and commercial labs do not routinely go to such effort and expense. The client gets a one-shot assessment. And, with the proposed increases in sample throughput, the quality of those one-shot analyses will be hard to maintain or improve.

** Why? Recall my earlier observation: A finding of a given concentration of an analyte in a specimen is an indirect statistical estimate, based on a long chain of interim measurements that are themselves estimates, all of which can contribute to uncertainty even as they attempt to add clarification, and any one of which can be sufficiently in error as to weaken or invalidate the final result.

In a multi-step process involving many interim measurements, each with a variability component, the overall uncertainty is a type of sum of the individual fluctuations. In formal stat-speak, the total process variance is the sum of the individual variances. You then take the square root of that to come up with the collective process standard deviation. Another way of stating this is that error terms are additive, they cannot be assumed to just cancel each other out, because of a principle known as the “random walk” phenomenon in which extended runs high or low are shown to be more likely than would be intuitive. I recall the phrase “errors don’t cancel out, they just get diluted.”

Another point: assume that a process contains 100 independent steps, each of which is performed “correctly” 99.9% of the time. What is the overall probability of the process executing without a “failure”? The average? (99.9%?) Greater than that? No. It would be .999 raised to the 100th power, or 90.5%, meaning that, on average, in ten runs at least one step will “fail.” It gets even worse when the steps are not independent, and one must take into the account the consequence of any single process mishap. Consider now for a moment that a typical Gas Chromatography/ Mass Spectrometry (GC/MS) test used to confirm drug screen positives has 30 procedural steps, 27 of which involve taking measurements. The point? Accuracy and precision do not come easily or cheaply. To those who make vague and broad assertions about their operational inerrancy, I say “Show me the data!” (operational quality control data, that is.)

The impact of cumulative variance/error propagation is seen in a monograph entitled GC/MS Quantitation of Benzoylecgonine Following Liquid-Liquid Extraction of Urine, (John Gerlitz, MS, Journal of Forensic Sciences, Vol. 38, Sept. 1993, pp. 1210-13) The salient paragraph follows:

The precision of the method was evaluated by the analysis of quality control samples independently spiked at 150 ng/mL. Within-run and between-run precision were determined by analyzing the control material seven times. Within-run the mean concentration found was, BE at 142 ng/mL (CV = 3.0%). Between-run the mean concentration was, BE at 145 ng/mL. (CV = 2.7%). (p. 1212)

The CV, recall, is the coefficient of variation, alternatively called the percent standard deviation. A bit of statistical math: (0.03)(142) = 4.26 ng/mL. and (0.027)(145) = 3.92 ng/mL., the “sigmas” (expected variabilities based on the experimental distributions) for the respective experimental results. Recall that the “spike” (the reference standard concentration) was 150—not “151” or “149” (the “significant figures” issue). Are “142” and “145” statistically equivalent to 150. i.e. close enough to affirm the utility of the method? Statistical significance t-test and p-values for the foregoing work out to t = -4.97 (p < 0.01) and t = -3.37 (p <0.01) respectively at 6 degrees of freedom (n-1). Equivalently, the 99% confidence intervals for the experimental means are 142 ± 6 and 145 ± 5.5 respectively. In formal statistical terms, these results are “significantly” off the mark, low. This researcher concluded the results to be close enough, however,stating that “[T]he present guidelines of the National Institute on Drug Abuse call for a cutoff concentration of BE of 150 ng/mL. for GC/MS confirmation in urine. The procedure was found to be an accurate, reliable means for the identification and quantitation of BE at these levels.” (p. 1213)

Perhaps so; but a couple of cautionary observations are in order. First, putting aside any technical quibbles over t-test or confidence-interval statistical decision criteria, 5.3% (142/150) and 3.3% (145/150) differentials from a reference value do not qualify as—recall Justice Scalia—“99.94% accurate.” Second, this was a controlled methods development experiment in which the researcher knew what he was looking for (150 ng/mL.). These results are well within the generous NIDA accreditation PE latitude, so in that sense the method is “accurate and reliable,” but it is orders of magnitude more imprecise than would be assumed by a clinically untutored Supreme Court Justice. Moreover, this is a GC/MS (Gas Chromatography / Mass Spectrometry) quantitation experiment, using what is ostensibly the “forensic” gold standard of lab technology, the one used to confirm employment screen positives. We must ask: what kind of variability will be the norm in “blind” mass production commercial analytical settings? Are there any legitimate false positive/negative error rate concerns?

The Standefer Performance Evaluation (1990) study cites fairly recent historical false positive and false negative PE rates for specimens containing metabolite concentrations “spiked” near the administrative cut-off levels for amphetamine, benzoylecgonine, morphine, codeine, THC, and phencyclidine (PCP):

Drug class	False negative rate (%)	False positive rate (%)
Amphetamine	4.9	1.6
Benzoylecgonine	3.3	2.9
Morphine	2.3	3.3
Codeine	0.7	0.6
THC	10.5	0.7
Phencyclidine	8.6	0.4

See Forensic Urine Drug Testing, March, 1991, pp. 7-8 in The Medical Review Officer’s Guide to Drug Testing, (Robert B. Swotinsky, Van Nostrand Reinhold, NY, 1992). These data are by now a bit aged, and what counts—most importantly with respect to the interests of those tested—are the tabulations for the current quarter, but can we safely conclude that such error rates have fallen off the radar by now? That all labs are by now consistently “99.94% accurate”? Can we see your data?

Competence in the commercial lab

As alluded to above, critical to accurate laboratory specimen analysis is a solid understanding of the statistical nature of such work (review the foregoing)—most importantly the degree to which probability estimates (“is this result truly a ‘positive’?”) are impacted by distributional abnormalities, particularly in proximity to analytical cut-off limits and “MDCs” (Minimum Detectable Concentrations). Unfortunately, most chemists and lab technicians are exposed to only a cursory examination of applied mathematical statistics—the ugly and disdained stepchild of quantitative disciplines. Standard academic texts on statistics rarely venture deeply—if at all—into issues of distributional departures from the “normal” (i.e. “Gaussian,” depicted by the theoretical Bell Curve). Such is unfortunate; real-world operational requirements require a good bit more acuity with respect to the statistical factors that shape and validate (or negate) lab results. As pointed out, for example, by Dr. Lloyd A.Currie of the National Bureau of Standards (now NIST, the National Institute of Standards and Technology), an eminence in the field of quantitative radiochemical asessment:

Once we leave the domain of simple detection of signals, and face the question of analyte or radioactivity concentration detection, we encounter numerous added problems or difficulties with assumption validity. That is, assumptions concerning the calibration function or functions—i.e., the full analytic model—and the “propagation” of errors (and distributional characteristics) become crucial. [Currie, Lloyd A., Lower Limit of Detection..., National Bureau of Standards, NUREG/CR-4007, 1984, pp. 18-19]

Now, two conventional statistical “significance threshholds” are those of the “95% or 99% confidence levels” wherein it is assumed that 95% or 99% of the variation in a set of measurements is confined within ± two or ± three “standard deviations” (a.k.a. “2-sigma” or “3-sigma”) of the mean, or average value (think of the standard deviation as more or less “the average variability aound the average”). A measurement found outside such a ± 2- or 3-sigma ranges is frequently declared to be a “significant” difference from the overall population of values, with only approximately a “5% chance” or “1 % chance” of being wrong, respectively. But, real-world measurement datasets only approximate to a greater or lesser degree the theoretical Bell Curve distribution.

“Chebychev’s Theorem,” on the other hand, provides us with a lower probability bound applicable when distributional characteristics are unknown or uncertain. Pafnutii L. Chebychev (1821-94), a Russian mathematician, proved that, for any set of measurements capable of yielding a mean and standard deviation, the proportion of data within “K” sigma is always at least (1 – 1/K^2) irrespective of the “shape” of the distribution. So, at 2-sigma, the within-limits distributional proportion could be as low as 1 – 1/2^{^2} or 75%, and would most often be somewhere between 75% and the 95% of the pure Gaussian distribution. Similarly, at 3-sigma the inclusive range would be from 89% to better than 99%. So, for example, when we make a claim of having only a 5% or 1% “chance of error,” we assume a lot that might not stand up to proper methodological scrutiny. And, if one’s job and reputation are on the line, scrutiny is highly recommended.

Statistical “tests for normality” exist, but are rarely taught to anyone except those studying advanced statistics. Most of those analyzing bioassay specimens simply believe their results are accurate if they fall within pre-set (though often unrealistic) standard statistical boundaries. The following review of a random sample of college and professional analytical chemistry books reveals just why. To the extent that the topic of “statistics” is included at all, it is usually brief and basic, with little to nothing regarding probabilistic/distributional caution.

University-level analytical chemistry texts:

Modern Methods of Chemical Analysis, (1968, Wiley & Sons, NY), Pecsok, R.L. & Shields, L.D. Nothing on laboratory statistics.
Fundamentals of Analytical Chemistry, (1962, Holt, Rinehart, & Winston, NY), Skoog & West. Chapter 3, pp. 33-68, basic univariate Gaussian statistics. No mention of non-normal distribution probability considerations.
Instrumental Analysis, (1986, Allyn & Bacon, Newton, MA), Christian, G.D. & O’Reilly, J.E. Eds., Chapter 19.6 pp. 632-35. Four pages on “Statistical Considerations in Radiochemical Analysis.” No mention of non-normal distribution probability considerations.
Instrumental Methods of Analysis, 6th Edition, (1981, Litton Educational Publishers, Belmont, CA), Willard, Merritt, Dean, & Settle. Chapter 29.7, pp. 861-67, “Evaluation of Results.” Seven pages of basic statistics & lab “precision.” No mention of non-normal distribution probability considerations.
Instrumental Methods of Analysis, 7th Edition, (1988, Wadsworth Publishing Co., Belmont, CA), Willard, Merritt, Dean, & Settle. pp. 29-37. Eight pages of basic statistics & lab “precision.” No mention of non-normal distribution probability considerations.
Instrumental Methods of Chemical Analysis, (1985, McGraw-Hill, NY), Ewing, G.W. Chapter 26, pp. 480-487. Eight pages of “precision and accuracy” issues and “error propagation.” Nothing on basic statistics, and no mention of non-normal distribution probability considerations.
Chemical Intrumentation: a systematic approach, (1989, Wiley & Sons, NY), Strobel, W.R. Chapter 10, “Statistical Control of Measurement Quality,” pp 343-363. Basic univariate and bivariate Gaussian statistics, with discussion of precision, accuracy, and error propagation issues. No mention of non-normal distribution probability considerations.
Advanced Instrumental Methods of Chemical Analysis, (1993, Ellis Horwood, NY), Churacek, J. Ed. Chapter 16, pp. 406-415, “Chemometrics in the Instrumental Laboratory.” Short discussion of measuement error propagation and univariate Gaussian distribution. No mention of non-normal distribution probability considerations.
Introduction to Mass Spectrometry, (1997, Lippincott-Raven Publishers, Philadelphia, PA), Watson J. T. Chapter 18, “Sources of Error and Interference,” pp 414-19, focused on contamination and matrix interference issues. Section 7, pp 449-50, “Precision and Accuracy,” one and one-half pages on replicability of results (e.g. “RSD” or Relative Standard Deviation). Nothing else on laboratory statistics, including no mention of non-normal distribution probability considerations.
Advanced Analytical Chemistry, (1958, McGraw-Hill, NY), Meites, L., Thomas, H.C., & Baumann, R.P. Mention of Gaussian distribution on pp 354-55. No mention of non-normal distribution probability considerations.
Principles of Instrumental Analysis, 4th Edition, (1992, Saunders College Publishing, Ft. Worth, TX), Skoog & Leary. Appendix 1, pp, A1-A19, “Evaluation of Analytical Data.” Basic Gaussian statistics. No mention of non-normal distribution probability considerations.

Analytical chemistry math/statistics texts:

Use of Statistics to Develop and Evaluate Analytical Methods, ( 1985, Association of Official Analytical Chemists, Arlington, VA). Basic applied Gaussian statistics. No mention of non-normal distribution probability considerations.
Chemometrics: Applications of Mathematics and Statistics to Laboratory Systems, (1990, Ellis Horwood, NY), Brereton, R.G. Basic applied Gaussian statistics. No mention of non-normal distribution probability considerations.
Practical Guide to Chemometrics, (1992, Marcel Dekker, Inc., NY), Haswell, S.J. Chapter 2, pp. 5-15, “Statistical Evaluation of Data.” Basic applied Gaussian statistics. No mention of non-normal distribution probability considerations.
Statistics for Analytical Chemistry, ( 1984, Ellis Horwood/Wiley & Sons, NY), Miller, J.C. & Miller J.N., Basic applied Gaussian statistics. Three paragraphs (pp. 76-78, Section 3.13) on “Testing for Normality.”
Statistics for Analytical Chemists, ( 1983, Chapman & Hall, London), Caulcutt R. & Boddy, R. Basic applied Gaussian statistics, with three paragraphs addressing “Other Distributions” (Section 2.5, pp 14-16), providing a brief discussion of the effect of distributional skew on probability estimation.
Use of Recovery Factors in Trace Analysis, (1996, The Royal Society of Chemistry Information Service, Cambridge, UK), Parkany, M. One paragraph on “Propagation of Error Considerations,” page 3.

Analytical chemistry reference/continuing education texts:

Samples and Standards: Analytical Chemistry by Open Learning, (1987, Wiley & Sons, NY), Woodget. Ten pages of basic applied Gaussian statistics (pp. 41-50). No mention of non-normal distribution probability considerations.
The Laboratory Handbook of Materials, Equipment, and Techniques, (1992, Prentice-Hall, Englewood Cliffs, NJ), Coyne. G.S. Some references concerning accuracy, precision, and “significant figures,” but otherwise nothing at all regarding laboratory math/statistics.
Chemical Technicians’ Ready Reference Handbook, 2nd Edition, (1981, McGraw-Hill, NY), Shugar, G.J., Shugar, R.A., Bauman, L., & Bauman, R.S. No references to laboratory math/statistics.

Only one text in the foregoing literature review sample—Statistics for Analytical Chemists—had even the barest mention of distribution asymmetry and its potential impact on lab results. Most laboratory personnel are simply unaware that the validity of statistical inference is highly contingent upon distributional assumptions that may or may not be tenable in day-to-day operations.

OJT concerns

Regrettably, statistical naivete finds its way into both the peer-reviewed literature and the continuing education texts used for on-the-job training in the lab. A training manual from my own lab tenure—Radiochemical Methods (Geary, W., 1986, Wiley & Sons, NY)—illustrates the phenomenon. In Section 5.1.2 (“Radioimmunoassay”) a methods development monograph is presented from the journal Analyst (June 1982, Vol.107, pp. 629-33) entitled Direct Radioimmunoassay for the Detection of Barbiturates in Blood and Urine, Mason, Law, Pocock, & Moffat). Among the statistical curiosities in this paper is the following:

The distribution of background levels of cross-reactivity in 50 samples of urine from normal subjects who were not receiving barbiturate medication was positively skewed with a mean and standard deviation of 15 ± 28 ng/ml. The positive/negative cut-off for urine samples was set at 100 ng/ml., thus ensuring a >99% probability of obtaining a true positive result...

What this investigator obviously calculated was “mean + roughly 3-sigma,” or “15 + (28)(3) = 99, round off at 100.” Now, since 99.7% of data in an exactly normal distribution are within ± 3 standard deviations around the mean, one would be justified in concluding that a result greater than 100 ng/ml. was indeed a “positive” at better than “99% probability.”

Well, what’s wrong with this picture? Consider that these are ratio-level data—meaning you cannot have less than zero nanograms/milliliter of an analyte in a specimen. For the distribution of these 50 sample assays to be approximately “normal” with a mean of 15, the standard deviation could not be greater than approximately 5, as mean – (3)(5) = 0. With a sigma of 28—nearly twice the magnitude of the mean—the data are highly skewed (the author even acknowledges this), making his blithe assumption of “>99% probability” mathematically unsustainable. By Chebychev rule—in this instance 1 – 1/3^{^2} or 0.89—the worst-case confidence level would be more on the order of 89%, roughly a one-in-ten chance of a >100 ng/ml. false positive.

Since the raw data (the 50 “matrix blank” sample run results) are not provided in the monograph, we have no way to assess just what the data distribution profile might be—besides obviously skewed, with some rather high values requisite to effect a sigma so large relative to such a small mean. But, assume there were no blank results >100 ng/ml (for were there just one, the false positive rate estimate would then seem to necessarily be minimally 1 out of 50, or 2%, irrespective of any conventional statistical parameter estimates). Another little known statistical principle comes into play that gives a confidence estimate close to that of the minimalist Chebychev: the “Rule of 4.6/n.”

As set forth in If Nothing Goes Wrong, Is Everything All Right? Interpreting Zero Numerators (Hanley & Lippman-Hand, Ph.D.s, JAMA, Vol. 249, No 13, pp. 1743-45.), the confidence interval estimate appropriate for circumstances where “positives” are rare—and, in seemingly problematic fashion, fail to turn up during an investigation—is given by the simple formula [ 0 <= p <= -ln(C)/n ], where p = the true proportion of positives, C = the acceptance level for the probability of a false positive, and ln( ) is the natural, or “naperian” logarithm. So, assume a “99% confidence level,” meaning C= 0.01, a 1% chance or less of a false positive. Do the math: -ln(0.01)/50 = 9.2%, meaning that we can be 99% confident that the possible proportion of positives is from zero to 9.2%, given that we sampled 50, found none, and are bereft of defensible parametric distribution indices that augur anything better. Note how close this is to the Chebychev boundary in the foregoing example. Both Chebychev and the “-ln(C)/n” principle provide us with worst-case bounds for probabilistic estimation under conditions of uncertainty—neither of which appeared on the empirical radar of the experimenter just cited, one—like so many—content with a simplistic notion of “mean + 3-sigma.” for setting cut-off limits.

The point of all this stochastic techno-babble? Several, actually. First, when we observe less than 100 events and claim to have estimated some “percentage,” we are extrapolating. (If I go to bat 8 times and get 3 hits, that my “batting average” is at that moment “.375” is mere long-division arithmetical artifact. We will have to await my 1,000th trip to the plate to see whether the “7” and the “5”—or the “3,” for that matter—are still around) Second, given that the entire business of inferential statistics concerns extrapolating wisely from ostensibly representative random samples to their parent populations, a good deal of circumspection is in order should certain fundamental distributional assumptions not be met. Finally, we are not dealing with mere academic statistical nit-picking here; employees’ jobs, careers, and reputations are at stake.

Moving along: it get worse.

Radiochemical Methods also contains an exercise wherein students are to construct a curvilinear “calibration function” for computing RIA (radioimmunoassay) production specimen analysis of insulin—on the basis of four data points from “reference standard” analyses. Basically, the student plots the four results, connects the dots with a smooth curve, and then subsequently “computes” the “unknown” (production) sample results by visual reference: up from the “x” axis (the radiation counts/minute) to the curve, then right, horizontally over to the “y” axis (for the estimated insulin concentration). Nowhere in this exercise is there any discussion of “error bounds” i.e., the likely variation for each of the four reference data points and what such implies for “calibration function” indeterminacy. Nor—in related fashion—is there any apparent awareness that no competent statistician would ever agree to derivation of a “higher-order” (curvilinear) calibration function on the basis of just four internal tracer results. Forty perhaps (per concentration level, across the full analytical range), but not four.

One concern we ought have with respect to the foregoing: Economic pressures are putting more and more of the in-the-trenches lab work in the hands of technicians of perhaps even more diminished training and competence. That which used to be the domain of chemists with at least a 4-year degree (with questionable statistical acumen) is now expected of community college graduates with AA degrees (and likely even more deficient in statistical skills).

Moreover, the gradual shift toward expensive, high-tech automated lab equipment and away from traditional “bench chemistry”—which turns many lab “analysts” into mere mass-production machine operators—is not without problems, either: accuracy and precision concerns merely shift to the equipment manufacturers and those who maintain the instruments—not to mention those who interpret and sign off on the results of such automated processes, many of them also woefully undereducated with respect to the statistical underpinnings of their findings. Many of them eager to just believe that the technologies and procedures are inerrant.

But, it cannot just be assumed that the numbers coming out of any analytical process, whatever its apparent sophistication, are acceptably “exact” just because the instruments and processes are assertedly error-proof.

Can we have a look at your data?

Conventional Drug Testing Methodology

The current NIDA protocol for drug urinalysis specifies the use of the Enzyme-Multiplied Immunoassay (EMIT®) test to screen samples, followed by the sophisticated GC/MS technology for confirmation of screen “positives.” GC/MS, or Gas Chromatography/Mass Spectrometry is the most accurate, precise, and specific technology generally available to the clinical chemist, the only analytical method universally stipulated to conform to the federal Frye Standard for admissibility as forensic evidence. It is expensive, requiring sophisticated equipment and advanced training and skill. Immunological drug abuse screens such as the EMIT on the other hand, are inexpensive, but they are markedly non-specific, meaning they yield a relatively high proportion of false positives owing to their sensitivity to “cross-reactive” compounds with molecular structures similar to those of the target analytes. EMIT screens are known to return false positives for literally hundreds of legal substances.** It is the GC/MS method that “disconfirms” or weeds out such errors.

** see Barbara A. Smith & Jean C. Joseph, EMIT Assays for Drugs of Abuse, in Analytical Aspects of Drug Testing, Dale G. Deutsch, Ed., (New York, John Wiley & Sons, 1989), pp. 35 - 58. The EMIT® is actually a “panel” or battery of immunoassay screens for detecting amphetamine, cocaine, THC (marijuana), barbiturate, opiate, and Phencyclidine (PCP) metabolites, which are biochemical derivatives of compounds in the originally ingested drugs. The monograph lists 211 false-positive reactive substances for the EMIT amphetamine assay alone! (e.g., Alka-Seltzer Plus, Contac, Sudafed, Dimetapp, Tavist-D, Robitussin NR, Actifed and so forth). Think of a clinical screen as radar picking up an unscheduled flight, highly sensitive, but inherently non-specific: Well, it must be a plane, coming this way pretty fast; must be a jet; beyond that can’t tell you much. Don’t know whose it is, how many engines, what color, who’s on board, where it’s precisely headed...

In light of the foregoing, we must now have a look at a serious ethical issue arising from mass production indiscriminate drug testing. Quoting once more from the 1994 National Academy of Sciences findings:

Preemployment drug testing may have serious consequences for job applicants. Applicants, unlike most employees, often do not enjoy safeguards commensurate with these consequences. A particular danger of unfairness arises because screening test data are reported to companies despite the known possibility of false positive classification errors. Recommendation: No positive test result should be reported for a job applicant until a positive screening test has been confirmed by GC/MS technology.

The importance of this last point cannot be overstated, especially since there are increasing economic pressures to do away with GC/MS confirmations of pre-employment tests. It is argued that a “medical” a.k.a. “clinical” standard will suffice (i.e., unconfirmed by GC/MS). The real reason for this attitude is that employers want to pay as little as possible for drug testing, and lab owners are hard-pressed to finance sufficient equipment and to recruit and retain enough GC/MS-competent chemists to perform the volume of confirmations required of a mass screening workload. Consider a few rough estimates: assume quarterly screening of merely 10% of the U.S. civilian work force of 120 million; 48 million screens would be required annually. Assume further a prevalence of “true positives,” i.e., those with sufficient metabolite concentrations in their samples to be at all detectable, to be roughly the 8.3% that NIDA has claimed, or 10 million in the work force, implying roughly one million among those sampled each quarter. If the screens can pick up 90% of the true positives while yielding a false positive rate of perhaps 2%, (a conservative estimate) the resulting 3.6 million annually detected true positives detected plus the 880,000 false positives (2% of the 44 million yearly true negatives) implies a need for nearly 4.48 million GC/MS tests per year, or equivalently, more than 245 GC/MS runs per NIDA-certified lab per day, seven days a week, 365 days per year.

The equipment and trained labor capacity to perform anywhere near such a quantity of GC/MS solely for drug screening confirmation do not exist (ignoring for the moment the even greater issue of the resources required to manage such an enormous volume of screens). The laboratory infrastructure must also deal with the remaining analytical needs of government, industry, and health care. The volume of environmental testing required under Superfund and related EPA laws and regulations is staggering. USDA, FDA, and OSHA must continually test the food supply, pharmaceuticals, and all manner of materials for a host of health and safety criteria. Hospitals and physicians require timely and accurate clinical lab results for effective treatment of the sick and injured (Think about that the next time you are ill and your health care provider’s lab vendor is operating at chronic overload owing to their employment drug screening “easy-money” infatuation). The criminal justice system has no shortage of forensic-quality analytical work to be done on an ongoing basis. Such capacity limitations alone call into serious question the general wisdom of mass drug testing, and raise serious ethical concerns with respect to non-confirmed “clinical standard” false positive rates given the cost-minimization imperatives felt by employers and commercial laboratories alike.

Is any of this truly warranted? The NAS report asks the same question:

The use of illegal drugs in recent years is thought to pose problems so severe as to justify a ‘war on drugs.’ The current war on drugs overlooks, however, the abuse of alcohol and tobacco, which cause more deaths in the United States than all illegal drugs combined (Newcomb, 1992). Whereas illegal drugs are estimated to be responsible for approximately 30,000 premature deaths in the United States per year (Reuter, 1992), tobacco is responsible for nearly 400,000 premature deaths per year and alcohol accounts for nearly 100,000 fatalities per year (Julien, 1992).

In our zeal to combat a relatively minor fraction of the overall U.S. substance abuse problem, we blithely ignore the counsel of esteemed institutions like NAS, as well as the expertise of NIDA scientists. We propose to put millions of job applicants at undue risk of false accusation in the name of a War on Drugs and in the service of commercial laboratories under continuing pressure to cut methodological corners in pursuit of profit. Dr. Richard Hawks of NIDA, further commenting on the special situation of testing of preemployment applicants:

Most of the urine samples being analyzed in industry today are associated with preemployment applications. While many of the rights usually accorded an applicant are not necessarily those of an employee, the same rights of privacy and accuracy of analysis should be accorded these individuals (emphasis mine). (Richard L. Hawks, Ph.D., Establishing a Urine Testing Program: Prior Considerations, in Urine Testing for Drugs of Abuse, (Rockville, MD, NIDA Research Monograph Series, Monograph #73, 1986)

Contrast such a sensible and fair-minded observation—one entirely consonant with the NAS report recommendation—with those of Psychemedics Corporation executives and associates:

In general, private sector employers tend to the opinion that it is excessive to apply forensic-standard testing to job applicants, since job-seekers are frequently rejected on the basis of very subjective criteria such as unsatisfactory appearance or demeanor. This opinion is shared by a number of analysts of pending drug testing legislation for the private sector. In place of forensic standards, medical standards have been effectively applied in pre-employment situations. (Werner A. Baumgartner & Virginia A. Hill, Hair Analysis for Drugs of Abuse: Some Forensic and Policy Issues, NIDA 1990 Conference Proceedings.)

No competent Human Resources manager would share such a view. Prudent hiring practices include fastidious documentation of objective rejection criteria. Psychiatrist Dr. Robert L. DuPont, however, former Director of NIDA, Presidential “Drug Czar,” and now a “consultant” to Psychemedics, has an even more curious take on the subject:

In some settings hair testing can be done without GC/MS confirmation...Preemployment is another setting where confirmation may not be needed. The best confirmation of all is not GC/MS, but admission of drug use by the person tested. (Robert L. DuPont, M.D., Hair Testing for Abused Drugs: A Practical Guide, Institute for Behavior and Health, Inc., Rockville, MD., 1990.)

Right. The term “confirmation” in this context properly means independent verification, not affirmation under duress, given the latter’s long and ignoble history.

Dr. DuPont was once awarded a Department of Energy grant for “a study described as ‘an attempt to demonstrate that opponents of nuclear power are mentally ill.’ DuPont [says] that he will study unhealthy fear, a phobia that is a denial of reality.” (see K.S. Schrader-Frechette, Risk and Rationality, p. 14) Psychiatrists are frequently big on Denial. Dr. DuPont seems to imply that since “the cardinal symptom of drug abuse is denial,” ( DuPont, op cit ) if you use illegal drugs and claim to do so without adverse consequences, you are by definition in Denial; your very dissent proves you to be an addict. And, before we can help you (given that you manage so well to not evince any overt symptoms), we must identify you through inexpensive mass drug screening.

You might as well just confess on the basis of the “clinical”screen result; after all, where there’s smoke, there’s fire, no?

In 1995 the U.S. Supreme Court handed down a major drug testing decision in the case of Vernonia School District 47-J v. Acton et ux., (Docket 94-590, suspicionless drug testing of Vernonia, Oregon high school athletes), ruling that the institution’s interest in combatting drug abuse outweighed any right to privacy on the part of student-athletes. The “scientific expert” for the school district, noted in the ACLU’s Amicus Brief, is none other than Dr. Robert L. DuPont. Dr. DuPont first came to my attention when his paper cited above came in a two-inch thick bound volume of “scientific”papers I received from Psychemedics Corporation. In his paper Dr. DuPont waxes rhapsodic with respect to the virtues of the RIAH® drug test, and enthusiastically supports its expanded utilization. Is this man a disinterested and principled scientist or a partisan advocate of mass drug testing with a financial stake in its spread to all sectors of society?

Analytical chemistry can indeed be performed to a very high degree of accuracy and precision. But to assert that it can be done so on the cheap in mass production mode without putting innocent people at unacceptable risk is open to serious question. On January 16, 1991, Dr. Donald Cathlin, Director of the Olympic Analytical Laboratory on the UCLA campus was interviewed on National Public Radio’s “Morning Edition” by reporter Ina Jaffe (10:21 EST). She began by narrating that the lab was “among the most sophisticated drug-testing labs in the country, if not the world. Begun in 1982 to prepare for the Olympics in Los Angeles, the lab still performs more than 15,000 tests per year for the United States Olympic Committee, the National Football League, the National Collegiate Athletic Association, and the Defense Department.” Dr. Cathlin, in describing the analytical process at his facility, reported that GC/MS was used as a preliminary screening tool! Samples found to be GC/MS-positives were subjected to “another three or four days of additional chemist-chemical work...to make a final determination...”

The reason for such tender care should be obvious: no one wants to risk falsely accusing a $5 million per year ballplayer on the basis of a $20 EMIT screen. But Joe & Jane Lunchbucket et al are expected to submit en masse to a largely symbolic, wasteful process that poses serious question as to its propriety, efficacy, and potential to put drug-free individuals at risk of being falsely accused of criminal activity. Beyond the purely ethical, there are compelling scientific reasons for adhering to the “probable cause” selection standard in chemical testing for illicit drug use. Hard-core users will not be deterred by such non-cause measures (Charles Manson certainly was not), and the overwhelming majority that are in fact drug-free provide no information of value by being forced to submit; the costs of processing their samples are both figuratively and literally poured down the drain. Indeed, most of the truly “hard-core” drug users are not even in the work force, and those that are typically dwell in the transient semi- and unskilled employment sectors, not in the technical and professional domains so aggressively targeted by commercial drug testing marketers. The truly high prevalence strata were long ago identified and subjected to testing.

The campaign to extend non-cause testing to all employment sectors has everything to do with political symbolism and laboratory profitability, and nothing whatever to do with effective social policy. Such token measures are not, however, without significant costs; a 1991 GAO investigation of federal employment non-cause testing programs revealed a confirmed positive rate of 0.5% (that’s a mere 0.005), and pegged the administrative cost at $77,000 per confirmed positive. Conservatively assigned cut-off limits cannot but indicate that the analyses are in fact being performed at “probable-cause” sensitivity levels anyway (with the hope that no one will notice, and be cowed into compliance), resulting in a significant proportion of “false negatives” among casual drug users tested.**

**How are we otherwise to account for the large gap between the purported unacceptably high overall work force prevalence rates and low confirmed positive findings such as those detailed in the GAO findings? Opting for the probable-cause strategy yields better science, in that the prevalence rate among those tested would by definition be greater than 50% [ p(A) > 0.5 ], thus significantly lowering the Bayesian probability of false positive findings while enabling the labs to perform more accurate analyses on a much smaller workload (for which they could charge commensurately more to cover the costs of perfoming forensic quality analyses), a circumstance which would also diminish the proportion of false negatives, because cut-off levels could be lowered, yielding better sensitivity.

Moreover, the drug metabolite concentration levels in the probable-cause specimens would undoubtedly be significantly higher on average, making true positive determinations far less susceptible to challenge. The rationale for the ethical and legal concept of “probable cause” is not purely “political” and “philosophical.” It is implicitly rooted in sound science.

The net result of mass suspicionless drug testing, its ethical poverty aside, is an egregious waste of scientific resources, to the detriment of all who have need of high-caliber analytical laboratory services.

Food for thought: You are an administrator perusing your organization’s employee drug test reports routinely sent to you by your laboratory vendor. You assume the samples were indeed processed, and that the results are “accurate.” How can you know this with certainty? The question of competence usually focuses on methodological reliability, but there is also the possibility of outright fraud to consider, with disquieting precedent. In the late 1980’s a scandal came to light within the EPA’s Contractor Laboratory Program (CLP), wherein a number of CLP-certified labs performing environmental analyses for the government and Superfund liability clients were found to have doctored or simply fabricated many of their lab results, in one instance for more than a year. Eventually, disbarments and criminal convictions resulted, but such fraudulent practices had gone on undetected for quite some time, in a highly regulated environment with predominently savvy clients requiring forensic quality analyses for use in contamination and/or exposure litigation.

In the case of mass workplace drug screening, how can technologically unsophisticated clients know whether all samples are in fact processed—that some percentage are not simply discarded to ease backlogs? After all, everyone seems content with low positive findings. If the customers are not submitting their own QC blinds—problematic in any event, given that laboratory personnel usually collect the workers’ specimens—how can there be verification that all samples are fully processed? Do the labs stabilize and archive reserve aliquots of all samples for subsequent audit re-testing? For how long? Is such even possible, given current and proposed specimen volumes? Will SAMHSA/NIDA accreditation and oversight suffice, given its conflicting multiple roles in the “War On Drugs”? Such questions are unlikely to even come to mind for most drug testing clients. But they should.

Other matrices:

Hair testing

Hair assays are based on the principle that, in addition to urine, feces, saliva, and sweat, a routine exit pathway of metabolic biochemicals is that of the hair and nails. The process is known as “keratinization,” wherein trace amounts of a breadth of chemicals become entrapped and preserved within the hair shafts and nails. These chemicals are recoverable, identifiable, and quantifiable using sophisticated analytical methods, of which the hair drug test purports to be one. Psychemedics’ RIAH®, or RadioImmunoAssay of Hair, is a patented process employing a proprietary chemical separation method in which specimens are processed through a number of chemical separation steps and adulterated with a radioactive “internal tracer.” The tracer competes with immunoassay antibodies for chemical “binding” with the analytes of interest. Analyte (e.g., illicit drug) concentration levels are inferred by comparing the remnant radioanalyte tracer “count rate” in the production sample against the “known” disintegration count rate of the radionuclide reference standard. A “calibration curve” is established and production sample drug concentrations estimated from the end-process count rate (recall the foregoing discussion of methodologically similar serum RIA methods).

While RIAH® is in fact a rather sophisticated analytical process, valid methodological concerns remain:

Nagging generalizability questions persist. Most of the “scientific” literature cited by the patent-holder is of their own paternity and consists of animal studies and investigations within clinical drug abuse rehab cohorts. Analytical accuracy extrapolations to the general population (and mass-production commercial analysis) should be viewed critically.
Quality control realism issues: The vendor asserts that the drug user cannot wash the illicit metabolites out of the hair with any type of solvent/antidotes, but that the lab can “adsorb” QC spike concentrations “in” during analysis. True quality control realism would entail hair samples taken from a large cohort of volunteers who had ingested controlled amounts of the various illicit drug metabolites. Such cannot be done, for a host of what should be obvious reasons.
Cross-reactivity concerns: Are we to believe that only illicit drug metabolites excrete into the hair? The potential cross-reactivity, false positive concerns that attend conventional urine and serum analyses (see above) also apply to hair testing. For the vendor to suggest that GC/MS confirmation assays may not need be run in certain settings is outrageous.
There are a couple of overlooked sampling bias problems. First, the hair “sampling” protocol calls for snipping approximately 50 strands of hair from a location atop the head two inches posterior to a scalp midline figuratively “drawn” from ear to ear. Psychemedics’ own technical literature admits that hairs taken from other areas of the body yield significantly less reliable results. Well, this sampling location is precisely where many males experience at least partial baldness. Consistent results require consistent sampling, recall.

On a related note, one of the ostensible “virtues” of the hair test is its ability to “see back in time” to reveal a chronology of drug use. Human head air grows at roughly 0.5 inches per month. The RIAH specimen collection protocol calls for a 1.5" hair sample which will purportedly reveal a 90-day prior history of any illicit drug use, whereas urine tests only reveal very recent drug use—so drug users need only abstain for a short while to pass their urine tests. Well, the hair assay then discriminates in favor of bald men and against women. Balding or not, if John applies for a job or comes to work wearing a close-cropped “buzz cut,” can there be any automatic adverse inference? But, can Jane show up just as unremarkably with, well, a “G.I. Jane” coiffure?

More on the 90-day analytical “look-back.” It is telling that the RIAH technical literature touts the virtue of its lengthy “SW”—”surveillance window”—rather than referring to such in conventional drug bioassay terminology as the “DW,” or “detection window.” “Detection window” refers to the time during which metabolites are detectable prior to excretion to below analytically quantifiable concentration levels. The detection window for cocaine metabolites in urine, for example, is on the order of 1-3 days, and the urine DW for the fat-soluble, slower-to-purge THC (marijuana metabolite) is 7-30 days (depending, of course, on consumption patterns).

For Psychemedics, quarterly hair testing would ensure ongoing “surveillance” of employees adequate to effect a “drug-free workforce” (semantically distinct from the mere “drug-free workplace”). A handy little marketing hook, for “surveillance” is truly the primary function of all suspicionless drug testing.

Hair Tests Tangled in Problems: 1/7/98

Nearly 80 percent of all U.S. firms rely on standardized urine tests for drug abuse. And while there is another type of test -- the examination of hair strands, which is said to identify far more users than urine tests -- federal authorities say there are some shortcomings to the process, The Philadelphia Inquirer reported Jan. 5.

"The scope of drug testing is expanding dramatically, and with expanding hair testing, the likelihood of bias will increase, too. It's a major problem," warned J. Michael Walsh, executive director of the President's Drug Advisory Council under President Ronald Reagan and George Bush, and now a consultant to the urinalysis industry. One cause for concern, the experts say, is when hair tests are used on non-Caucasians. This group may test positive more often because researchers have found that traces of drugs last longer in thick, dark hair than thin, light-colored hair. Hair tests also can't catch recent drug use, the way urine tests can.

The experts add that there is another problem with hair tests. To date, 'hair analysis for the presence of drugs is unproven, unsupported by scientific literature or controlled trials," said Food and Drug Administration spokeswoman Sharon Snider.

While urine tests can easily detect marijuana use, the experts say that hair tests are better at detecting cocaine and heroin use. With that in mind, there is a belief that hair tests may someday be used in conjunction with urine analysis. "Hair testing may turn out to have a complementary role in workplace testing," said Robert Stevenson, deputy director of the Workplace Programs Division of the federal Center for Substance Abuse Prevention, "but we have yet to resolve remaining questions about its fairness and the ability to interpret results consistently."

Source: Join Together Online, 1-8-98

Surveillance backlash—“backwash,” or, wag the dog and baste the “turkey.”

As argued previously, for drug testing to truly serve a maximally deterrent function, implementation would have to be not only highly visible, but also unpredictable. Not only ought selection be random, but also the testing intervals themselves, so that those under surveillance could know neither whether nor when they might be selected to provide a specimen. Moreover, specimen collection procedures should also be maximally vigilant to preclude the possibility of adulteration or switching.

While some programs (e.g., Defense Department testing protocols) explicitly call for close direct visual monitoring of urination during specimen collection, our social sqeamishness in this regard has resulted in a good bit of laxity, and there is even much approving discussion of procedures that “allow for reasonable privacy and personal dignity concerns” in the Supreme Court opinions which have ruled thus far on testing programs. In Chandler, for example, the Court noted that the Georgia law permitted those to be tested to have their specimens collected by their own physicians at a time of their own choosing. No ostensible “privacy” issue there. In Treasury, the Court stated that specimen “donors” were subjected to neither proximate visual nor aural monitoring, also obviating privacy concerns. In Railway, it was noted by the Court that security procedures only extended to auditory monitoring by a same-sex overseer outside the secured stall—a negligible intrusion. Only in Vernonia were privacy considerations dismissed out of hand, for Opinion author Scalia concluded that (1) student-athletes are not otherwise fussy about locker-room physical exposure—including the use of communal urinals, and (2) even were they, students do not have full constitutional rights in any event.

Hard-liners insist that visual monitoring is an essential part of a complete “chain-of-custody” regimen essential to obviate evasion. The hard-liners have a point. But—never underestimate the resourcefulness of the evader. As recounted by John Coombs in Drug-impaired Professionals (see Chapter 2):

Although testing procedures are carefully monitored to prevent cheating, addicts devise ingenious methods to escape detection (Coombs and West 1991). For example, a female clinician ran a feeder tube down through her buttocks. By leaning back against a urine-filled bag, she got the expected urine stream. A young anesthesiologist filled a small polyethylene bag with clean urine, palmed it, and, by squeezing the sides together, delivered the appropriate specimen in the appropriate arc when observed by testing personnel. At first his wife provided the specimen, but when she later refused, he substituted his dog’s urine. “We didn’t realize we were analyzing the dog’s urine until later when he told us” the testing director remarked. (Coombs, pp. 184-5)

Rex! Here, boy; lift your leg; hold it right there...Well, we can put a stop to all evasive measures by requiring a full nudity strip search, right?

Wrong. Coombs continues:

A urologist catheterized his own bladder, removed his urine, substituted clean urine, and urinated clean urine under close supervision at the testing site. (p. 185)

Wow. Former NFL lineman and author Tim Green recounts similar tactics in his recent book about life in the National Football League, The Dark Side of the Game, wherein Green describes how football players would sometimes use an ordinary kitchen turkey baster and catheter tube to inject clean urine back into their bladders prior to submitting to a drug test.

Wag the Dog; Baste the metabolite Turkey; Darth Evader strikes back.

Other matrices, continued: saliva and sweat

Saliva has been proposed as a viable alternative assay matrix for drug testing. Vendors of saliva testing services base its ostensible preferability on on-site convenience, meaning that screening can be performed on the spot by technicians requiring only minimal training.All the costly elements of laboratory assays (including chain-of-custody expenses) are obviated. Two recent items concerning saliva tests follow:

Thursday January 15, 1998, 3:06 pm Eastern Time
Company Press Release, SOURCE: Epitope, Inc.

Epitope Announces FDA Cleareance of Oral Fluid Assay for Cocaine

BEAVERTON, Ore., Jan. 15 /PRNewswire/ -- Epitope, Inc. (Nasdaq: EPTO - news) today announced that its research partner, STC Technologies, Inc., located in Bethlehem, Pennsylvania, has received FDA clearance for the STC Cocaine Metabolite Micro-Plate EIA (enzyme immunoassay) Kit for use in detecting cocaine and cocaine metabolites in oral fluid collected with the OraSure(R)/EpiScreen(R) Oral Specimen Collection Device manufactured by Epitope. “This is the first oral fluid-based immunoassay for drugs of abuse cleared by the U.S. Food and Drug Administration,” said Sam Niedbala, Ph.D., executive vice president, research and development of STC...

...Drugs-of-abuse testing generally occurs in one of four basic testing segments: 1) Clinical testing including hospital emergency rooms, laboratories, and drug rehabilitation centers, 2) government mandated testing, such as testing of transportation workers (D.O.T.), defense contractors (D.O.D.), and other governmental contractors, 3) forensic testing, including applications in the criminal justice system, law enforcement, the courts, and probation/parole programs, and 4) industrial testing for employment evaluation and drug-free workplace programs. In each of these segments OraSure testing will provide an alternative for sample collection that can be performed in any setting, is non-invasive, is less embarrassing, and improves the chain of custody.

SALIVA AS AN ANALYTICAL TOOL IN TOXICOLOGY

Karin M. Höld, B.S.; Douwe de Boer, Ph.D.; Jan Zuidema, Ph.D.; Robert A.A. Maes, Ph.D.

http://big.stpt.usf.edu/~journal/hold.html, Fall 1996, International Journal of Drug Testing (click here for full paper)

from ABSTRACT

“Due to our present incomplete knowledge of saliva as a biological specimen, saliva drug levels should be used concomitantly with recorded drug concentrations in other fluids, e.g. plasma, to contribute to a more ideal interpretation of drug concentrations in clinical and forensic case studies.”

While the saliva test will be marketed initially as an alternative drug abuse assay, it will invariably end up being an addition to conventional lab methods, resulting not in savings but in additional employer expense.

The “Patch”

PharmChem, a drug testing marketer based in Menlo Park, California, recently received FDA clearance for sales of its drug abuse sweat patch. This adhesive device is actually a replacement for the urine sample vial rather than a method of detecting the presence of drugs directly. The patch, applied to the body for a week or two, absorbs sweat and retains it until the strip is analyzed via GC/MS in the lab. NIDA, while interested in the potential of this matrix (as an ongoing surveillance device), cautions that it presently encumbered by high false positive potential. While the FDA authorization pertains to criminal justice applications, PharmChem is reported to be seeking approval to market this technology to the employment sector. Click here for more, including an NIDA technical paper concerning the sweat patch.

“Character” testing: Handwriting analysis and the “MMPI Lite” (Substance Abuse Subtle Screening Inventory)

Can we uncover actual or potential “drug abusers” by subjecting them to “character” tests rather than the invasive bioassay methods that are so contentious? Use of lie detectors is largely proscribed by law, and hiring private detectives to shadow people to surveil and assess their “moral habits” is impractical, so some companies have tried administering psychological assessment instruments to discern “moral character defects” such as drug-seeking propensities or, more generally, “dishonesty” and “laxity” traits. Some employers have tried handwriting analysis, for example (see below), while others administer “personality tests” to weed out undesirables.

An acquaintance of mine was turned down recently for a job. The hiring firm told her that she was rejected because their handwriting analysis had shown that she was not "open to learning." Use of handwriting analysis ("graphology") to screen prospective employees is widespread in France and is becoming more common in both the United Kingdom and America. While job interviews, applications and recommendations remain the preferred screening techniques, more than 5% of American companies used graphology when hiring as of 1990. That number has almost certainly increased during the last few yearswith well-known firms like Citibank and Bristol-Myers experimenting with the technique.

Consultants who advocate graphological screening claim that such analysis is able to reveal important character traits of job applicants. While analysis cannot disclose a person's age or sex, it allegedly can discern (at relatively low cost) character traits of a potential hire--e.g., a candidate's stability, vivacity, creativity, intelligence, imagination, reasoning ability, speed of thinking, and force of character. In addition, graphologists maintain that their skill is extremely useful in identifying paedophiles, sociopaths, persons with cancer, schizophrenics, and epileptics. Some go so far as to maintain that the individual's entire personality structure appears in his or her handwriting. The individual is said to have no secrets before the graphologist who supposedly can tell one's private sexual likes and dislikes from a handwriting specimen. (Handwriting Analysis in Pre-Employment Screening, Daryl Koehn, Ph.D., The Online Journal of Ethics, DePaul University, Chicago, IL, URL: http://condor.depaul.edu/ethics/hand.html)

Like suspicionless drug testing, “character” evaluation in the workplace is also a very controversial area, for employment evaluations are by law supposed to be limited to job-skills suitability (with the limited exception of inquiry into prior criminal convictions). The Target discount store chain, for example, was successfully sued several years ago over its use of the MMPI (Minnesota Multiphasic Personality Inventory) in assessing employment candidates’ “character.” The MMPI is a clinical diagnostic instrument appropriate only for the evaluation of psychopathology in clinical settings. (See Soroka v. Dayton Hudson Corp., No. AO52157, 10-25-91. See also Getting Personal, ABA Journal, Vol.78, Jan.1992, pp.66-67.)

A Florida firm, CERA, Inc., now touts to businesses a paper-and-pencil 88-question “chemical dependency” test known as SASSI-2, the “Substance Abuse Subtle Screening Inventory,” purporting to identify actual or potential drug abusers, with a claimed “overall 94% accuracy level.” Esssentially a pared-down and more narrowly-focused MMPI knock-off, this instrument is marketed as a cost-effective alternative to traditional employment urine screening. The job applicant or employee completes the test—said to be written at a 6th-grade comprehension level—in 10-15 minutes. It is then faxed to the vendor, where it is “scored” and interpreted. The results are then faxed back to the client within the hour. The SASSI instrument is said to be unaffected by age, gender, and educational level factors—i.e., that is “generalizable” and suitable for broad employment application. Is this so?

A brief look into the SASSI validation methodology documentation follows:

This is to summarize the reliability and validity of the Substance Abuse Subtle Screening Inventory (SASSI 2). The SASSI-2 is a psychological assessment tool designed to identity those who have a substance-related disorder. This instrument is currently used by hundreds of organizations including addictions treatment centers, hospitals, other health care organizations, and employee assistance programs in corporate settings.

Sample:

These findings are based on the results of 2,954 respondents. Ninety percent of this sample consisted of clients from a variety of treatment programs throughout the country, including psychiatric hospitals, addictions treatment centers, a dual diagnosis hospital (substance abuse and psychiatric), a sex offender treatment program, a vocational rehabilitation program and a county detention center. The remaining ten percent consisted of college students and a group who responded to a newspaper ad requesting subjects with a family history of alcohol use. Sixty-six percent of the total sample was male. Sixty percent of the sample was Caucasian, 22% was African-American, and 11% was Native American. Forty-one percent reported never being married, whereas 30% were married and 25% were divorced. The average age of this group of respondents was 32, and the average educational level was tenth grade. Thirty-six percent of the sample (n =1,053) had been interviewed and diagnosed by a trained clinician. Of those interviewed, 75% were diagnosed as being dependent on alcohol and/or some other substance.

Accuracy In Identifying Substance Use Disorders...Combined Sample (n=839)

Prevalence of Disorder: 80%
Sensitivity: 94%
Specificity: 94%
PPP: 98%
NPP: 80%
Overall Accuracy: 94%

...results indicate that the SASSI-2 is a reliable and valid measurement tool and support its use for clinical assessment. The overall reliability coefficient for the SASSI-2 (coefficient alpha) was found to be .93, and supplementary analyses support the reliability of the SASSI-2 subscales. The SASSI-2 was found to correspond highly with independent clinical diagnoses. The SASSI-2 was also associated with theoretically related criteria (e.g., substance-related arrests and the number of illicit drugs used) but unassociated with theoretically unrelated criteria (e.g., intelligence, reading and arithmetic tests).

This is a rather glaring example of selection bias in a putative “validation” sample. These vendors cannot sustain generalizability claims given the homogeneity of this research group. This type of methodological myopia is something of a syndrome in drug abuse research generally. Clinicians working with drug rehab clientele routinely probe the psyches of their “patients” and sometimes “publish”—“market” would be the more apt term—their findings as characteristic of the general population without any serious effort to evaluate the salient attributes of non-problem “control group” cohorts. Indeed, our ostenible clinical understanding of the true nature of “addiction” is burdened by this type of limitation, for, we cannot just randomly select subjects representative of the general population for continued controlled dosing with psychoactive chemicals with the intent of analyzing the proximate and long-term outcomes. Such would be at once illegal and unethical.

The SASSI validation report contains a couple of oddities. First, it jumps from a discussion of the demographic attributes of the 2,954 validation pool respondents straight to prevalence and “accuracy” tabulations based on a ”Combined Sample (n=839)” of a “SASSI-3,” with no explanation of the difference in the cohort sizes or whether SASSI-2 and SASSI-3 are the same instruments. Second, there is no information presented anywhere in the report on the derivation of the “specificity” percentage of “94%.” Recall that specificity refers to the power on a measure to correctly identify and exclude “true negatives.” Given the small “n” and obvious sampling bias in gathering of this cohort, it is far from clear that the number “94%” has any precise (and independently replicable) meaning. A homogeneous, mostly captive clinical cohort of undereducated, intoxicant-prone subjects purporting to represent a validation anchor certifying the general employment assessment propriety of this instrument is dubious at best.

Also interesting is that this vendor also touts its method as a cost-effective alternative to traditional lab drug testing, but then goes on to embellish the asserted accuracy of SASSI with the following:

The utility of the SASSI has been demonstrated in clinical research with thousands of individuals. When used sequentially with urine drug testing the published accuracy rate is 99%. (source http://www.cerainc.com/html/sassi.html)

Well, is SASSI a cost-effective replacement for the urine screen, or one more methodologically suspect auxiliary expense?

Employer expense questions aside, any drug-free individual consenting to submit to this “test” should perhaps be IQ tested as well.

Market considerations

Vendors of drug testing services operate in an intensely competitive, low-bid market. True forensic “cost-plus” pricing is impractical, for employers and administrators want to keep screening costs as low as possible (hence quiet proposals to do away with the relatively expensive GC/MS confirmations for pre-employment screens). Recent stock market performance histories of major publicly traded testing firms are, in the aggregate, less than exhilarating. Several examples below illustrate the circumstances:

Psychemedics (ticker symbol: PMD), heretofore an aggressive marketer of employment hair testing, recently began hawking a home-test kit (the PDT-90) sold through discount drug store chains and pitched to worried parents. Epitope (ticker symbol: EPTO) is banking on its recently FDA-approved saliva test OraSure® (developed in partnership with STC Technologies, Inc) to help reverse the negative slope of its stock performance. OraSure® is aimed at the employment screening market, and is portrayed as a convenient and cost-effective “onsite” alternative to traditional lab services. Chemtrak, Inc. (ticker symbol: CMTR), a vendor of a variety of home-test and physicians’ internal office lab assays, has struggled with declining share prices for years. They are hopeful that their new “Parent’s Alert” home drug-testing kit will improve their fortunes. PharmChem (ticker symbol: PCHM) is the vendor of a variety of drug testing products and services, among them the recently FDA-approved drug-detecting skin “sweat patch.”

Quite a cohort of stock performance downhill skiers. For context, a graphic illustrating aggregate market performance during the same period is provided below.

It is indeed a tough market for vendors of drug screening products and serives (as the foregoing graphs indicate). In 1997 a major vendor of traditional employment-sector urine screening, U.S. Drug Testing Corporation, succumbed and was liquidated in Chapter 7 bankruptcy proceedings.

In such a difficult arena, two imperatives obtain. First, expanded market share is critical to eventual profitability, hence the enlistment of legislative collaborators through “educational” organizations whose activities focus on enactment of mandatory drug-testing laws, such as The Alliance for Model State Drug Laws and kindred lobbying groups such as The Institute for a Drug-Free Workplace. Second, there is ceaseless pressure to cut costs in the labs. Employee attitude surveys in laboratory journals reflect these circumstances, revealing a litany of discontent caused by excessive workloads, inadequate resources and pay, lack of training, understaffing, low morale, and high turnover.

Those who opted for laboratory careers with TV ad images of Glaxo glory and “Dow-Lets-Ya-Do-Great-Things” clinical whitecoat glamour in mind come to be considerably less than ecstatic working in high-stress, mass-production, methodologically banal environments having more in common with poultry processing plants than intellectually stimulating scientific enclaves.Yet we propose to exacerbate the infrastructure burden by proposing more unjustified suspicionless drug screening at every turn.

These minds, machines, and methods can be put to much better uses than those comprising a constabulary Peestone Cops.

Summary thus far:

Drug policy historical and political contexts are anything but rational (Chapter 1). Likewise for the epidemiological and social science empirical foundations (Chapter 2). The present chapter examines some of the core issues that call into serious question the wisdom of committing so much of the bioassay infrastructure to these dubious ends. Next: Is any of this truly legal? In Chapter 4 we see how historical ignorance, political dissembling, and scientific naivete combine with jurisprudential amnesia to turn the Fourth Amendment on its head.

Robert E. Gladd, Thesis work-in-progress internet edition: