Friday, October 30, 2009

Methodology: Logistic Regression and Relative Risk

I was using logistic regression today and as always am troubled by how to report the results. I did a little sleuthing and here's what I decided. Odds Ratios = Bad, Relative Risk = Not so Bad.

To begin at the beginning.

Logistic regression is a form of regression which is used when the dependent variable is a dichotomous variable. In other words, continuous variables are not used as dependents in logistic regression. However, independent variables included in the model may be of any type.

Multinomial logistic regression is used when the dependent variable has more than two classes although it can also used for binary dependent variables. When multiple classes of the dependent variable can be ranked, then ordinal logistic regression is preferred to multinomial logistic regression.

Critically, the impact of predictor variables is usually explained in terms of odds ratios.

Unfortunately, odds ratios are not intuitive to most people and many interpret them (incorrectly) as probabilities. For example, suppose there was a group of dogs, 40 which were male and 60 which were female. The probability of randomly selecting a male dog is 40 / (60+40) or 40/100 = 40%. The odds, however, of randomly selecting a male dog is quite different - it is 40/(100-40) or 40/60 = 67%.

Let's look at another example. Suppose that we have a group of students of which some are classified as ADD. Of 80 boys, 13 were classified as ADD and 67 were not. Of 100 girls, 6 were classified as ADD and 94 were not. The odds of a boy being classified as ADD (as the logistic regression output would report) is 13/67 = .194; the odds of a girl being so classified is 6/94 = .064. The odds ratio of being classified as ADD varies based upon sex. We could report the odds ratio of a boy being classified as compared to a girl as .194/.068 = 3.03125:1 or roughly 3:1 Unfortunately, again, this is not the probability that a boy will be classified as ADD compared to a girl. What it means is that for every boy not classified as ADD, 3 times as many boys will be classified as ADD than the number of girls classified for every girl not classified.

Do not write that in your report. Because it is uninterpretable (which is not a word), but you know what I mean.

So instead, reporting an estimated relative risk may be the best bet.

Estimated Relative Risk = Odds Ratio / ((1-Pr) + (Pr * Odds Ratio))
where Pr is the proportion of non-treated persons that exhibit the outcome of interest.

In our example the Odds Ratio is 3.039 and Pr is the proportion of girls who are classified as ADD = 6/100 = .06

Thus the estimated relative risk of a boy being classified as ADD is:
3.03125 / ((1-.06) +(.06*3.03125)) = 3.03125 / (.94+.1818) = 3.03125/1.1218 = 2.702. or "Boys are 2.7 times as likely to be classified as ADD as that of a girl being classified as ADD."

Compared to the true relative risk = (13/80) / (6/100) = .16252 /.06 = 2.71. our estimated risk is not a bad estimate - and much easier to explain!

Wednesday, October 14, 2009

Surveying: Bias in Survey Response Scales - Ambiguity in Frequency Descriptors

Bias in Surveys
Surveys are a cost-effective means of obtaining information from a wide variety of persons. For that reason they are often used in research and evaluation as a central data collection activity. Unfortunately, many persons who have no training in survey development and may be unaware of the issues of bias and error develop surveys for use as data collection tools.

One of the critical concerns when developing surveys is the issue of error. Error in surveys is the difference between one’s true answer and one’s actual answer. In an ideal situation error would be zero and there would be no difference between the true and actual answer. In algebraic terms, where T= True answer, A = Actual answer, and e is error, T = A + e.

Although we cannot calculate a person’s true answer, we can reduce error by addressing the bias which is known to inflate error. In terms of the actually constructing the survey, including developing survey items/questions and response options for closed-ended questions, reducing measurement error must be a critical concern to preserve the quality of data resulting from the survey.

Measurement error refers to the bias introduced as part of the measurement process, and can be introduced based on question wording, question placement, the context in which questions are asked, and the response options provided for closed end questions. Unfortunately, very little research exists as to ways in which response options introduce bias in survey responses. Without empirical research in this area it is hard to estimate how big of a problem measurement error may be for a particular survey because of the response options provided. Understanding ways in which response options may introduce bias is critical to changing survey practices regarding the development of response options and understanding the quality of the data obtained by the survey.

Typical survey response options for a frequency scale
One type of closed end question often asked on a survey is the frequency with which an event occurred. For example, a survey on student engagement may ask the respondent, the student in this case, how frequently, on average, throughout a typical week, he or she raises his or her hand to answer a question in class. As respondents are very unlikely to know the exact number of times, on average in a week, that he or she does this, they are often asked to provide a best guess. To help them estimate this frequency, a surveyor may provide the following responses to choose from, rather than just having the respondent provide a number: Never, A couple of times, A few times, Several times. Unfortunately, when these responses are provided with no additional information included, such as what number or range of numbers is associated with them, bias is introduced into the responses, directly affecting the reliability of the resulting data.

Bias in the use of frequency descriptors with no numerical anchoring
To assess the degree to which bias is introduced by the above commonly used frequency descriptors, as part of a workshop on survey quality 27 persons responded to the following directions:

Given the question “On average, how often do you withdraw cash from your bank account each month?” and the following response options “Never, Once, A couple of times, A few times, and Several times”, please assign a number or range of numbers to these response options.

Never =
Once =
A couple of times =
A few times =
Several times =

The question, “On average, how often do you withdraw cash from your bank account each month?” was provided to respondents as it placed a time span or boundary to the potential values for the descriptors and ensured that all respondents were responding within the same context.

Results
The table below shows the minimum and maximum values associated with each descriptor, as well as the mean of the ranges and the standard deviation of these means. As can be seen, the only descriptor where the standard deviation was zero was “Never”. While that finding is not surprising, as by definition, never means at no time, what is surprising is that the descriptor “Once” has a mean greater than 1.00 and a nonzero standard deviation. Although the mean is very close to (1.10) and the standard deviation is quite small (0.28), it is somewhat surprising that a range of numbers was associated with “Once” instead of just the number one. “A couple of times” was also associated with numbers outside of two as its minimum was two but the maximum value associated with it was eight, as in eight times. While the mean value associated with that descriptor is close to 2.00 (2.60), its standard deviation is quite large at 1.04. This suggests that the value or values respondents associate with that descriptor ranges greatly. Not surprisingly, the descriptors “A few times” and “Several times” appear to be fairly ambiguous in their meaning to respondents as well, as is evident by the large standard deviations associated with them (2.19 and 4.24, respectively).

Table 1: Numeric values associated with frequency descriptors
Descriptor n, Min., Max., Mean, sd
Never 27, 0, 0 , 0.00, 0.00
Once 27, 1, 2, 1.10, 0.28
A couple of times 27, 2, 8, 2.60, 1.04
A few times 27, 2, 14, 4.73, 2.19
Several times 27, 4, 24, 9.13, 4.24

Discussion
The findings presented above suggest that frequently used response option descriptors related to the frequency of an event or behavior occurring may introduce bias and thus error into survey responses. Although this finding may not be surprising to some who believe that frequency response options should be defined numerically, the degree of error associated with “Once” and “A couple of times” is surprising, given that by convention they are assumed to mean one time and two times, respectively. Whereas the finding that the range of numbers associated with the descriptors “A couple of times” and “Several times” is quite large, the actual standard deviation of the means of these ranges is also quite large, further suggesting that there is room for much measurement error when these descriptors are used without accompanying numbers.

The implication for survey developers is both clear and significant as these findings suggest that there is not an agreed upon definition for these typically used frequency descriptors. While the remedy may seem clear (rather than use a description alone associate a number or range of numbers with these descriptors or only use numbers alone to represent frequency of events or behaviors), the fact that oftentimes numbers and number ranges are not included indicates that surveyors are unaware of the degree to which these descriptive introduce measurement error into survey findings.

To improve survey developers’ understanding of how response options may introduce bias and thus error into survey response, thus undermining the quality of data resulting from the survey, additional research needs to be undertaken to further assess bias in other response options and to replicate the findings in this study.

Consulting: Choice and the Business of Evaluation

Consulting: Choice and the Business of Evaluation

Attached is a link to an article about choice: http://sivers.org/jam

According to this article, Columbia professor Sheena Iyengar ran a test to see what the implications were when customers were given more choices. She set up a free tasting booth in a grocery store with six different jams available for tasting. 40% of the customers stopped to taste with 30% of those (12% total) buying one of the jams. A week later she set up the same booth in the same store, but this time with twenty-four different jams available for tasting. 60% of the customers stopped to taste, but only 3% of those (2% total) bought some.

The implications of her study clearly suggest that too much choice has a negative impact on customers buying habit. What does this mean for me, the independent consultant?

As I start my newest business venture (EvalWorks) it makes me think that I don’t want to provide too many services less there be a negative view about my ability to provide all of them equally well. So instead of showcasing myself as a jack-of-all-trades I’ve identified my two core competencies: Program Evaluation and Survey Research.

I have been conducting program evaluations at the local, state, and federal levels for over 10 years and have a rich background in statistical methods related to data analysis. Whereas I could highlight my ability to analyze data as a separate core competency, I can in no way compete against someone who has been trained as a statistician.

Although most program evaluation specialists conduct surveys as part of their program evaluation work, I have highlighted my survey research work because I have a Certificate in Survey Methodology from UNC and conduct workshops on survey best practices. Whereas some people come into surveying not knowing the research base, I use the research base to address potential errors such as sampling error, coverage error, measurement error, and non-response error (http://www.ropercenter.uconn.edu/education/polling_fundamentals_error.html). I’ve also represented this information at the AEA/CDC Evaluation Institute for the past three years and presented on survey reliability at the Australasian Evaluation Society annual conference in Canberra, Australia in September.

So as I start this new venture I’ll let you know how it goes, concentrating on my core competencies in hopes of not falling victim to scope creep. Sounds like a mold you would find in a bathtub. Yuck!

Monday, October 12, 2009

Evaluation: PowerPoint - The Good and The Ugly

Below are links to two different articles on the use of Power Point. Edward Tufte offers a searing commentary on the use of PowerPoint by NASA in making decisions about safety and how lack of information, as a direct results of slides being presented versus a thorough report, resulted in the fatal decision to send the Challenger into orbit. Hammes’ article further explore the use of PowerPoint and identifies its weakness as a decision making tool. He does, however, offer insight as to when PowerPoint may be an appropriate choice – information (as opposed to decision) briefs and operational decisions that need to be made quickly.

I agree with the views of both men, that PowerPoint should not be used for decision making purposes, but serves a very useful way of presenting information. I’ve further adopted Tufte’s (and Stephen Few’s) suggestions about presenting data to make my PowerPoint even more useful. As an evaluator, I’ve started sharing a PowerPoint of all of my data displays I plan to use in my final report with the evaluation stakeholders. We review the data and I ensure that with little explanation they are able to identify the data of greatest importance denoted on each slide, and that these data address the questions they have about the evaluation.

By doing this I have accomplished three things:

1. I have ensured that my clients and evaluation stakeholders have reviewed all data pre-report and understand the findings;
2. I have identified any additional questions they may have for me to further explore with the data; and
3. I have initiated a process whereby the data cannot be ignored. Even if my report is put on a shelf I know that they have interacted with the data and thus are more likely to utilize the data.

Next I develop my report around the questions of critical importance, using the data tables and graphs I have already developed to identify critical findings. I then narrate the PowerPoint I originally developed by identifying questions each slide answers or pointing out data findings. The report and PowerPoint are both final products I can then provide to my client. He or she can share whichever one they want with staff, Board members, etc., deciding for him or herself which one will have the greatest impact with which person.

So far, by taking these steps, I have improved my clients’ ability to understand data findings, increased the likelihood that the data findings will be used by the client and stakeholders, and addressed the need for multiple reports for multiple audiences. My clients have been quite pleased and, as an independent consultant, this is certainly one of my goals.

Amy A. Germuth, Ph.D.
EvalWorks, LLC

---------------------------------------------------------

PowerPoint does Rocket Science – and Better Techniques for Technical Reports – E. Tufte
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001yB&topic_id=1

Dumb-dumb Bullets – T.X. Hammes
http://www.afji.com/2009/07/4061641

Friday, October 9, 2009

Evaluation: Evaluation Post-Mortems

My partner is a self-proclaimed computer nerd who writes software. As part of how she develops software, she and her business partners analyze, design, code, test, re-factor and document specific features of the software they are designing. At the end of this procedure (usually a 30-day period) they engage in a formal post-mortem where they analyze the success and failures of the project to-date, adding their findings to their general knowledge base for future use.

Conducting post-mortems is a common practice in business, yet I have never heard anyone talk abut doing such an analysis as a systematic process within evaluation. There is very little written about evaluations that don’t work, including why they don’t work. When we have that information (often form our own unfortunate experiences) we often keep that information to ourselves, as reporting it would be reporting our failures.

What if we made a post-mortem part of the evaluation process? Isn’t evaluation an iterative and cyclical process for those of us who design and conduct them? What would a formal post-mortem process at interim periods in an evaluation look like and would it be beneficial (and if so, why and how?)?

When used to assess business projects, some persons engage in a two-step process. The first step is to provide the persons involved a list of questions about the project that they think about and respond to on their own. The second step involves bringing all persons and their responses together to share what they thought and discuss lessons learned.

Michael Greer (http://www.michaelgreer.com/postmortem.htm) has developed a lost of general questions to guide post-mortems. They include such questions as:

1. Are you proud of our finished deliverables (project work products)? If yes, what's so good about them? If no, what's wrong with them?

2. What was the single most frustrating part of our project?

3. How would you do things differently next time to avoid this frustration?

4. What was the most gratifying or professionally satisfying part of the project?

5. Which methods or processes worked particularly well?

6. Which methods or processes were difficult or frustrating to use?

7. If you could wave a magic wand and change anything about the project, what would you change?

8. Did our stakeholders, senior managers, customers, and sponsor(s) participate effectively? If not, how could we improve their participation?

For evaluations, specifically, we might ask such questions as:

- How accurate were our original estimates of the time, cost, and other resources required of the evaluation? What did we over- or under-estimate?

- Knowing what we know now, would we have chosen the same type of evaluation design as the one we used? If not, what could have pointed us to a design that would have been better suited for such a project?

- Were our evaluation questions the best ones, or were there other questions we did not fully explore with stakeholders, through or evaluation, etc. that needed addressing?

- How would we rate the quality of the data we gathered and what could we have done to have collected more convincing data for formative and summative purposes?

- Did our presentation of results highlight the data so that stakeholders could make their own interpretations or understand the ones we made?

- What did we do to help stakeholders understand and use the evaluation findings?

While the list is endless, questions could be further identified by management areas (identifying the evaluand, evaluation method, data collection, data analysis, reporting, etc.).

I view a post-mortem as separate from just following the program evaluation standards (as some of these questions do get at) as it has a very formal outcome, the identification of lessons learned. And I see it as clearly separate from a meta-evaluation as meta-evaluations are themselves evaluations and not designed to identify lessons learned, as much as to identify the value in the evaluation that was conducted. Also, for meta-evaluation to be viewed as unbiased, they do need to be conducted by someone outside of the original evaluation, whereas post-mortems are specifically designed to engage the original evaluators.

I’d be interested in what others think and what other practice. Is there any group who is doing this in a formal and systematic manner where questions are identified and discussed and lessons learned are developed? What are good questions we should be considering if we want to make this part of our formal evaluation practice? Is anyone doing this in collaboration with the former evaluand?

Evaluation: The Tipping Point - A new type of case?

In Purposive Sampling subjects are selected because of some characteristic. Michael Quinn Patton (1990) has proposed the following cases of purposive sampling.

1. Extreme or Deviant Case - Learning from highly unusual manifestations of the phenomenon of interest, such as outstanding success/notable failures, top of the class/dropouts, exotic events, crises.

2. Intensity - Information-rich cases that manifest the phenomenon intensely, but not extremely, such as good students/poor students, above average/below average.

3. Maximum Variation - Purposefully picking a wide range of variation on dimensions of interest...documents unique or diverse variations that have emerged in adapting to different conditions. Identifies important common patterns that cut across variations.

4. Homogeneous - Focuses, reduces variation, simplifies analysis, and facilitates group interviewing.

5. Typical Case - Illustrates or highlights what is typical, normal, average.

6. Stratified Purposeful - Illustrates characteristics of particular subgroups of interest; facilitates comparisons.

7. Critical Case - Permits logical generalization and maximum application of information to other cases because if it's true of this once case it's likely to be true of all other cases.

8. Snowball or Chain - Identifies cases of interest from people who know people who know people who know what cases are information-rich, that is, good examples for study, good interview subjects.

9. Criterion - Picking all cases that meet some criterion, such as all children abused in a treatment facility. Quality assurance.

10. Theory-Based or Operational Construct - Finding manifestations of a theoretical construct of interest so as to elaborate and examine the construct.

11. Confirming or Disconfirming - Elaborating and deepening initial analysis, seeking exceptions, testing variation.

12. Opportunistic - Following new leads during fieldwork, taking advantage of the unexpected, flexibility.

13. Random Purposeful - (still small sample size) Adds credibility to sample when potential purposeful sample is larger than one can handle. Reduces judgment within a purposeful category. (Not for generalizations or representativeness.)

14. Politically Important Cases - Attracts attention to the study (or avoids attracting undesired attention by purposefully eliminating from the sample politically sensitive cases).

15. Convenience - Saves time, money, and effort. Poorest rational; lowest credibility. Yields information-poor cases.

16. Combination or Mixed Purposeful - Triangulation, flexibility, meets multiple interests and needs. (Patton, 1990)

I was thinking about these cases when I recently re-read Malcolm Gladwell’s book, The Tipping Point. In this book he said: "The Law of The Few says that there are exceptional people out there who are capable of starting an epidemic. All you have to do is find them." (Page 132). Hence, he suggests, in so many words, a “Tipping Point” case. But wait, these cases can be further broken down, (and are, by Gladwell) to the following:

1. Connectors - people who "link us up with the world ... people with a special gift for bringing the world together." – think Kevin Bacon….

2. Mavens - “information brokers" or people who connect us with new information – I think of bloggers…..

3. Salesmen are "persuaders", often persons with powerful negotiation skills – Barack Obama?

These persons are different from being extreme cases or critical cases, under Patton’s definition. They certainly fit “cases” one might consider in mathematical sociology, social network analysis, or case studies. So has Gladwell identified additional cases that should added as part of the literature to the type of cases evaluators consider today when sampling?

Consulting: The 1% Rule

The 1% rule says that if you increase something by 1% a day, in 70 days you’ll have twice as much. That goes for consulting skills: if you increase your consulting skills by just 1% each day, then in 70 days you’ll be twice as good.

I didn’t believe it at first, and you may not either, so here’s the algebra behind it:

(1.01)^x = 2
Log(1.01)^x = Log(2)
xLog(1.01) = Log(2)
x = Log(2)/Log(1.01)
x = 70

So what are you doing to improve your evaluation and consulting skills?

Start by reading my blogs where I plan to address both areas on a weekly basis! And let me know what you think or what areas you think I should address. My hope is that I will learn and you will too.

Best regards,

Amy A. Germuth
EvalWorks, LLC
October, 2009

Surveying: How Big Should Our Sample Be? Part 2

Well, I was hit with the same question again a week ago: “How big should our sample be?” In this case I have five treatment schools and need to identify comparison schools so I can truly ascertain whether any changes are related to the program I am evaluating.

A first note – yes, I must make comparisons at the school level. Even though I will be using test scores and I would love to aggregate them at the classroom/teacher or keep them at the student level, I am restricted to the school level for two reasons – 1) the intervention is a school-wide intervention, it’s not limited to certain teachers or students, and 2) the treatment schools were purposefully selected based on school-level characteristics.

What else do I know? Mainly that I will need a large number of schools in my comparison group as I am hoping to detect an intervention that may have a small effect size (.2 – maybe a little less), according to Cohen. Remember, Effect size = (Treatment group mean – Control group mean) / Pooled standard deviation. Thus I will need to design my study so that it will account for a potentially small effect size.

Luckily, I can again use a power analysis to help me out. For this analysis, I will set alpha to .05 (this limits my Type I error to 5%, in other words, limits the chances that my test accepts an effect that does not exist to 5%) and I will set beta, my Type II error rate, to 20% (thus limiting the chance that my test rejects an effect that actually exists to 20%).

Using these limits, I find that with 5 treatment schools, depending upon the number of comparison schools I include, the effect size my test will detect increases as the number of comparison schools increase. For example, with 5 treatment schools, the effect sizes I can detect depending upon the number of comparisons schools I include are as follows:

Comparison schools Effect Size detected
5 .297
10 .239
20 .189
40 .164

Right now I would like to use comparison schools in the same district so will probably try to identify the 20 elementary schools that match the closes on key school indicators (size, %FRL, % teachers highly qualified, etc.). That will allow me to identify an effect size of .189 or greater, if one exists.

(With special thanks to John Keltz at UW-Madison and part of the Center for Educator Compensation reform who gave a great talk on Effect Size and Power)