Monday, February 22, 2010
Evaluation: Quasi-Experimental Designs - Advantages and Disadvantages of Each
Because after all my talk about Causality and Rubin it's very hard to do experimental designs in education. So here's an overview of quasi-experimental designs - some which are worthy of trying! But not all!
One-Shot Pre Test Only Design:
O X
Ha-Ha - I made this one up! Who would do this, although I guess it could occur if you ran out of funding (or everyone dropped out of the treatment pool). I have seen stranger things proposed.....
One-shot Post Test Only Design:
X O
Sigh. How do you know there was even a change? Enough said.
One-shot Pre-Post Test Design:
O X O
Double sigh. Okay - you may be able to detect a change but how can you attribute it to the treatment? You need another group. However, I see this proposed WAY TOO OFTEN.
Post-test Only Intact group Design:
X O
- O
Another sigh (but only one). Good that you now have two groups. However, you don't know whether your two groups started at the same place. For example, if your treatment group scores high on a test (a good thing) and your control group does not, you don't know whether your treatment group would have scored higher than the control group to begin with (and whether their final score is really one of NO CHANGE).
Pre-Test Post-test Intact Group Design:
O X O
O - O
Now we're cooking! Two groups, hopefully equivalent. If persons were randomly assigned to a group this would be an experimental design (and we would be cooking with gas!). Without random assignment it remains a quasi-experimental design. However, if we used propensity score matching to match groups it would be a HIGH QUALITY quasi-experimental design. Or we could use regression discontinuity and use a cut score to determine the two groups. Again, that would result in a HIGH QUALITY quasi-experimental design.
Now - go forth . And design better evaluations!
One-Shot Pre Test Only Design:
O X
Ha-Ha - I made this one up! Who would do this, although I guess it could occur if you ran out of funding (or everyone dropped out of the treatment pool). I have seen stranger things proposed.....
One-shot Post Test Only Design:
X O
Sigh. How do you know there was even a change? Enough said.
One-shot Pre-Post Test Design:
O X O
Double sigh. Okay - you may be able to detect a change but how can you attribute it to the treatment? You need another group. However, I see this proposed WAY TOO OFTEN.
Post-test Only Intact group Design:
X O
- O
Another sigh (but only one). Good that you now have two groups. However, you don't know whether your two groups started at the same place. For example, if your treatment group scores high on a test (a good thing) and your control group does not, you don't know whether your treatment group would have scored higher than the control group to begin with (and whether their final score is really one of NO CHANGE).
Pre-Test Post-test Intact Group Design:
O X O
O - O
Now we're cooking! Two groups, hopefully equivalent. If persons were randomly assigned to a group this would be an experimental design (and we would be cooking with gas!). Without random assignment it remains a quasi-experimental design. However, if we used propensity score matching to match groups it would be a HIGH QUALITY quasi-experimental design. Or we could use regression discontinuity and use a cut score to determine the two groups. Again, that would result in a HIGH QUALITY quasi-experimental design.
Now - go forth . And design better evaluations!
Monday, February 15, 2010
Evaluation: Understanding Rubin's Model of Causality
Rubin Causality (named for Donald Rubin) states that any relationship demonstrated in an experiment (where the units of analysis are randomly assigned to experimental and control groups) is a valid causal relationship and that any relationship that cannot be demonstrated in an experiment is not causal. However, the main dilemma we face is that if we show that" X causes Y", it is often impossible to show that "non-X does not cause Y" since you cannot see both potential outcomes at the same time for the same case. For example, if you show that taking statin drugs lowers Mary's LDL or "bad" cholesterol levels, you no longer have Mary with high LDL levels for which to see what happens if she is not given statin drugs. Thus Rubin's Model of Causality requires that we have randomly assigned control and treatment groups (in other words that we use an experimental design) to assess causality.
Friday, February 12, 2010
Evaluation: Five Types of Causality
I just came across an interesting table about causality, a critical area of study for evaluators. In this table, Peter Hall (see full citation at end), posits five different types of causality. I have included them here with my own examples as they relate to education.
1) Many causes for the same effect: An increase in x (teacher content knowledge) causes an increase in y (student achievement) in some cases but does not have this same effect in other cases, where y is caused by an entirely different set of causes (an example of such a cause could be increased time spent by student studying). This is one we evaluators see quite often, hence the need for a very thoughtful research design!
2) Cause dependency upon time: An increase in x (years as an educator) is associated with an increase in y (student achievement) at one point in time, but not another. Much research supports the opinion that at least some teachers become less effective as they near retirement, for multiple reasons.
3) Same cause but different outcomes: An increase in x (greater governance by school boards) causes outcome y (increased diversity across all schools in a system) in some cases, but outcome z in other cases (less diversity - more neighborhood schools). This is (unfortunately) happening in Wake County in NC, which is near where I live.
4) Outcomes are the effects of various causes that depend on each other: Outcome y is dependent upon many other variables v, w, and x - whose values are in turn jointly dependent upon each other. I couldn't think of an example for this one. Hall 's example is y (successful wage coordination) depends on the value of many other variables - v (union density), w (social democratic governance), and x (social policy regime) - whose values are in turn jointly dependent on each other
5) Circular causality: Increases in x (student achievement) increase y (student expectations), but increases in y (student expectations) also increase x (student achievement). In this case such causality is a good thing!
Source: Peter A. Hall. 2003. "Aligning Ontology and Methodology in Comparative Research" In J. Mahoney and D. Reuschemeyer, eds. Comparative Historical Analysis in the Social Sciences. New York, NY: Cambridge University Press. Pp.373-404
1) Many causes for the same effect: An increase in x (teacher content knowledge) causes an increase in y (student achievement) in some cases but does not have this same effect in other cases, where y is caused by an entirely different set of causes (an example of such a cause could be increased time spent by student studying). This is one we evaluators see quite often, hence the need for a very thoughtful research design!
2) Cause dependency upon time: An increase in x (years as an educator) is associated with an increase in y (student achievement) at one point in time, but not another. Much research supports the opinion that at least some teachers become less effective as they near retirement, for multiple reasons.
3) Same cause but different outcomes: An increase in x (greater governance by school boards) causes outcome y (increased diversity across all schools in a system) in some cases, but outcome z in other cases (less diversity - more neighborhood schools). This is (unfortunately) happening in Wake County in NC, which is near where I live.
4) Outcomes are the effects of various causes that depend on each other: Outcome y is dependent upon many other variables v, w, and x - whose values are in turn jointly dependent upon each other. I couldn't think of an example for this one. Hall 's example is y (successful wage coordination) depends on the value of many other variables - v (union density), w (social democratic governance), and x (social policy regime) - whose values are in turn jointly dependent on each other
5) Circular causality: Increases in x (student achievement) increase y (student expectations), but increases in y (student expectations) also increase x (student achievement). In this case such causality is a good thing!
Source: Peter A. Hall. 2003. "Aligning Ontology and Methodology in Comparative Research" In J. Mahoney and D. Reuschemeyer, eds. Comparative Historical Analysis in the Social Sciences. New York, NY: Cambridge University Press. Pp.373-404
Thursday, February 11, 2010
Evaluation: Evaluation Reporting - Another thought!
Table: A potential framework for conceptualising evaluative analysis
ANALYTICAL COMPONENTS | |||
Four components | Brief description | Types of questions | |
What | FINDINGS - raw Description of findings |
What worked, for whom, under what | |
How and Why | FINDINGS - analysed Analysis of findings Conclusions about findings | ||
So What do the findings say | CONCLUSIONS about the policy / | Merit or worth (quality or value) of a | |
So What do the findings mean | SIGNIFICANCE and implications of | Policy and programme decision-making and/or | |
Anyway, as shown, it's four analytic components are linked to broad evaluation questions and may provide another way to format evaluation reports.
Question 1 is "What?" and is where one would address the findings to date of the evaluation. It's meant to be a place where raw data are presented, or, in other words, results are described.
Question 2 is "How and Why?" and is where data are actually analyzed and compared to generate conclusions about what resulted, for whom results were greatest, etc.
Question 3 is "So what do the findings say about the evaluand (i.e., program, policy, etc.)?" and relates to what most of us think about when we think evaluation - the merit or worth of the evaluand. Many reports stop short of answering this question even though the basis of evaluation is to draw such conclusions.
Question 4 is "So what do the findings mean?" and is where the significance of findings and implications for the evaluand are discussed. Again, based on my report reading, few reports address such a question, especially when results are not positive.
If anyone finds the source for this please let me know so I can give the author his or her due. Again, while it nicely identifies some of the questions driving evaluation analysis, it may also be a nice guide for framing reports!
Friday, January 15, 2010
Surveying: The Fantastic Five Checklist for Improving Survey Reliability Writing Better Questions
Surveying is both an art and a science and developing a high quality survey question is not always easy to do. Even factual information is a challenge to measure, as reliability and validity can easily be affected by question wording
The “Fantastic Five” checklist includes five question that I have gathered from various sources, to ask about any survey question you have written. An answer of no for any single question below suggests that the survey question you have written may not be one that respondents can reliably answer and thus may need rewriting.
The five questions are:
1. Can the question be consistently understood?
Example: How many times have you been hospitalized in your life?
What counts as hospitalization? A 23-hour admit, Your birth? A day-op surgery?
Clearly define any events that may be viewed inconsistently.
2. Does the question communicate what constitutes a good answer?
Example: When did you first purchase a car?
A year ago, after college, in 1979, etc.
Indicate the answer you are looking for: In what MONTH did you first purchase a car?
3. Do all respondents have access to the information needed to answer the question?
Example: What was the annual premium for your health insurance last year?
Most persons would need their insurance records or check register to accurately answer this question. If respondents need such materials, make sure they are aware of that upfront.
4. Is the question one which all respondents will be willing to answer?
Example: Have you been tested for HIV in the last year?
Many people will respond no or not respond because they fear an answer of yes suggests they are involved in what they consider deviant activities. Instead, if an alternative, less threatening question can get at the same answer, use it. For example, one could ask: Have you donated blood in the last year?
5. Can the question be consistently communicated to respondents?
Example: What was your annual income for 2008?
This may be better written as , "Including all forms of income (e.g., wages, gratuities, social security, rent paid to you, dividend earned, tips, annuities, and alimony) what was your annual income for 2008?" However, note that this is an easier question to ask on a written survey than a survey conducted by an interviewer. Questions that may be hard to consistently administer to respondents might be better off asked as a series of questions.
Good resources for improving your survey question (and where these questions came from) are the following:
DeVellis, R.F. (2003). Scale development: Theory and applications, 2nd edition. Thousand Oaks, CA: Sage
Dillman, D. (1999). Mail and Internet surveys: The tailored design method, 2nd Edition. New York: John Wiley Company.
Fowler, F. J. Jr. (1995). Improving survey questions: Design and evaluation. London: Sage.
The “Fantastic Five” checklist includes five question that I have gathered from various sources, to ask about any survey question you have written. An answer of no for any single question below suggests that the survey question you have written may not be one that respondents can reliably answer and thus may need rewriting.
The five questions are:
1. Can the question be consistently understood?
Example: How many times have you been hospitalized in your life?
What counts as hospitalization? A 23-hour admit, Your birth? A day-op surgery?
Clearly define any events that may be viewed inconsistently.
2. Does the question communicate what constitutes a good answer?
Example: When did you first purchase a car?
A year ago, after college, in 1979, etc.
Indicate the answer you are looking for: In what MONTH did you first purchase a car?
3. Do all respondents have access to the information needed to answer the question?
Example: What was the annual premium for your health insurance last year?
Most persons would need their insurance records or check register to accurately answer this question. If respondents need such materials, make sure they are aware of that upfront.
4. Is the question one which all respondents will be willing to answer?
Example: Have you been tested for HIV in the last year?
Many people will respond no or not respond because they fear an answer of yes suggests they are involved in what they consider deviant activities. Instead, if an alternative, less threatening question can get at the same answer, use it. For example, one could ask: Have you donated blood in the last year?
5. Can the question be consistently communicated to respondents?
Example: What was your annual income for 2008?
This may be better written as , "Including all forms of income (e.g., wages, gratuities, social security, rent paid to you, dividend earned, tips, annuities, and alimony) what was your annual income for 2008?" However, note that this is an easier question to ask on a written survey than a survey conducted by an interviewer. Questions that may be hard to consistently administer to respondents might be better off asked as a series of questions.
Good resources for improving your survey question (and where these questions came from) are the following:
DeVellis, R.F. (2003). Scale development: Theory and applications, 2nd edition. Thousand Oaks, CA: Sage
Dillman, D. (1999). Mail and Internet surveys: The tailored design method, 2nd Edition. New York: John Wiley Company.
Fowler, F. J. Jr. (1995). Improving survey questions: Design and evaluation. London: Sage.
Monday, January 11, 2010
Methdology: Rasch Analyses - what they might add
The past two blogs I have written have been about factor analysis - a data reduction technique I have used frequently. For example, I have often used exploratory factor analyses to determine which items to keep as part of survey constructs. However, I was recently shown some Rasch analyses which showed me such characteristics as item fit and construct reliability, as well as how well items discriminate among persons (e.g. are they easy to agree too or hard to agree to) and how persons viewed response options.
For example, while I was able to use factor analyses to determine that a specific item loaded onto my construct at a weight of .765, I could have found the same thing ("good infit") using Rasch analyses. However, the use of Rasch analyses also showed that this was one of the easier items to agree to (meaning it had low discrimination among persons, not what I wanted). Rasch analyses also told me something about my response options: the Rasch analyses showed that persons did not follow a trajectory from Highly disagree to Somewhat disagree, to Neither agree or disagree, to Somewhat agree, to Highly agree. Rather, a person would go through this sequence but without using the middle response option (Neither agree or disagree). This suggests that instead of a 5-point scale with a middle "neutral" response option, respondents in actuality responded as if the response options represented a 4-point scale, treating the middle option as "Not applicable".
That's a lot of information from one analysis! I plan to try conducting a Rasch analyses next time I would have used factor analyses. One popular program used to conduct such analyses is Winsteps (www.Winsteps.com). It has a primer / how-to book, but I would also suggest reading Bond and Fox's book, "Applying the Rasch Model". It shows some examples using Winsteps as well as the text one could use to reproduce their findings.
*With special thanks to Andrew Sunland at Learning Point Associates for making me a believer!
For example, while I was able to use factor analyses to determine that a specific item loaded onto my construct at a weight of .765, I could have found the same thing ("good infit") using Rasch analyses. However, the use of Rasch analyses also showed that this was one of the easier items to agree to (meaning it had low discrimination among persons, not what I wanted). Rasch analyses also told me something about my response options: the Rasch analyses showed that persons did not follow a trajectory from Highly disagree to Somewhat disagree, to Neither agree or disagree, to Somewhat agree, to Highly agree. Rather, a person would go through this sequence but without using the middle response option (Neither agree or disagree). This suggests that instead of a 5-point scale with a middle "neutral" response option, respondents in actuality responded as if the response options represented a 4-point scale, treating the middle option as "Not applicable".
That's a lot of information from one analysis! I plan to try conducting a Rasch analyses next time I would have used factor analyses. One popular program used to conduct such analyses is Winsteps (www.Winsteps.com). It has a primer / how-to book, but I would also suggest reading Bond and Fox's book, "Applying the Rasch Model". It shows some examples using Winsteps as well as the text one could use to reproduce their findings.
*With special thanks to Andrew Sunland at Learning Point Associates for making me a believer!
Sunday, January 3, 2010
Methodology: Factor Analysis - Best Practices Part 2
Following factor analysis, one usually computes factor scores for use in subsequent analyses. These scores can be computed in multiple ways, each with different features. Below I explore such ways so that the reader can make a better choice for him or herself as to which way to compute them.
Below are 5 "non-refined" or simple methods for computing factor scores
1. Sum or average raw scores (including negative scores, if that's what was produced) corresponding to all items that load on a factor
2. Sum or average raw scores corresponding to all items that load above a certain cutoff point (e.g., .4 or above)
4. Calculate a weighted sum or average of raw scores corresponding to all items that load on a factor or only those that load above a certain cutoff point, using the factor loading as the weight.
5. Calculate a sum, average, or weighted sum or average using the standardized scores - either for all items or those above a certain cutoff point. This involves standardizing raw scores to the same mean and standard deviation. This is most often used when standard deviations vary greatly across the items.
The above five scores are frequently used because of simplicity's sake - the choice as to which to use is guided more by the researcher's decisions than hard science. One additional method that is worth mentioning, however, is the use of regression scores. This is considered a "refined" method, meaning linear combination is used to calculate factor scores versus simple addition. Scores are calculated by weighting the raw scores by their regression coefficients and then standardizing the scores to a mean of 0. These scores can easily be obtained using SPSS or SAS and are often used because these popular programs can calculate them without much additional work needed from the researcher. Researchers often believe that this method produces scores that have greater validity than those produced by "unrefined" methods.
Subscribe to:
Posts (Atom)