Volume 44 Issue 4, December 2018, pp. 289-302

Increasing the number of female evaluators could help female candidates if evaluators prefer candidates of their own gender. I study whether there is any evidence of such preferences with a unique data set containing 10,500 scores given by 105 evaluators to 3,500 students in the humanities and social sciences who applied for a doctoral scholarship. On average, I find very weak evidence of same-gender preferences for male evaluators (p = 0.133). To better understand this effect, I also study same-gender preferences across the distribution of candidates, in subcommittees with different gender composition, and for evaluators from different disciplines. I show that male evaluators give higher scores to strong male candidates relative to those given by female evaluators. At the same time, male evaluators give higher scores to male candidates than do female evaluators when there is only one male evaluator in the subcommittee. The representation of men in a discipline does not seem to affect the scores given by evaluators. Overall, there is no clear evidence that replacing a male evaluator with a female one would help female candidates.

L’augmentation du nombre des évaluatrices pourrait favoriser les femmes candidates, s’il est vrai que les évaluateurs, hommes et femmes, privilégient les candidats de leur sexe. L’auteur se demande si certains faits permettraient d’établir l’existence de telles préférences en examinant un ensemble particulier de données regroupant 10 500 notes attribuées par 105 évaluateurs, hommes et femmes, à 3 500 étudiants et étudiantes en sciences humaines et sociales ayant soumis leur candidature à un programme d’études de troisième cycle. En moyenne, l’auteur ne relève que très peu d’éléments attestant que les évaluateurs privilégieraient les candidats de leur sexe (p = 0,133). Pour mieux comprendre cette observation, il étudie également les préférences pour les candidats de même sexe qui se manifestent dans l’ensemble de la distribution de candidats et candidates, au sein des sous-comités présentant différentes compositions hommes-femmes, et chez les évaluateurs et évaluatrices appartenant à différentes disciplines. L’auteur montre que les candidats masculins performants sont mieux notés par les évaluateurs que par les évaluatrices. Parallèlement, les candidats masculins sont mieux notés par les évaluateurs que par les évaluatrices lorsque le sous-comité ne compte qu’un seul évaluateur masculin. La représentation masculine dans une discipline ne semble pas avoir d’incidence sur les notes attribuées par les évaluateurs et évaluatrices. Dans l’ensemble, l’hypothèse selon laquelle le remplacement d’un évaluateur par une évaluatrice favoriserait les femmes candidates n’est corroborée par aucun élément de preuve.

Women are underrepresented in university faculties.1 ­Explicit or implicit gender discrimination is one of the many factors that could explain this phenomenon.2 Throughout their careers, women could face hostile evaluation committees composed mostly of men with gender bias favouring male candidates, thus making it difficult for women to secure funding or positions and climb the academic ladder (De Paola and Scoppa 2015; Moss-Racusin et al. 2012; Wold and Wenneras 1997). To address this issue, policy-makers have considered introducing gender quotas in evaluation committees to increase the representation of women. Such policies have been introduced in Finland (1995), Spain (2007), and France (2014; Bagues, Sylos-Labini, and Zinovyeva 2017). Replacing male evaluators with female evaluators could promote the careers of female candidates if evaluators prefer candidates of their own gender.

Even though the effectiveness of this policy relies on the existence of such own-gender preferences, empirical evidence of such preferences is scant. Some research has relied on group decisions to infer the preferences of evaluators and has provided mixed findings. Although De Paola and Scoppa (2015) show that female candidates have a lower probability of being promoted when the evaluation committee is composed exclusively of men, Bagues and Esteve-Volart (2010) find that female candidates have a lower probability of being hired when the proportion of female evaluators is higher. More recently, Bagues et al. (2017) find no evidence of same-gender preference using individual data from 100,000 candidates applying to a centralized competition for promotion to associate or full professor in Italy and Spain.3

This article contributes to this literature by considering 10,500 individual scores given by 105 evaluators to 3,500 doctoral candidates applying for a scholarship in the social sciences and humanities.4 In a first step, I study the individual scores given by male and female evaluators to a given candidate using candidate fixed effects. In other words, I measure the average impact of replacing a male evaluator with a female evaluator on the scores of candidates. Overall, I find some weak evidence of same-gender preference. Male evaluators give 0.0876 more points than female evaluators to male candidates (p = 0.133).

In a second step, I consider three policy-relevant settings in which same-gender preferences could occur and influence the impact of such gender quotas. First, the perception of an evaluator may depend on the relative strength of candidates of another gender. For example, a paternalistic male evaluator could be more stringent for relatively strong female candidates but more generous toward relatively weak female candidates. There may therefore not be any impact on the average female candidate, but there could be an impact in the tails of the distribution. Such a phenomenon could influence the outcome of the competition. Indeed, the gender composition of the evaluation committee could change the funding status of marginal candidates—those whose scores are close to the threshold. Because these candidates can be relatively strong or weak depending on the competition, it is important to understand the impact of the committee’s gender composition on candidates throughout the distribution.

I find that male evaluators give 0.479 more points (25.3 percent of a standard deviation) to relatively strong male candidates than do female evaluators. Moreover, there is some weak evidence that female evaluators give 0.245 fewer points (13.0 percent of a standard deviation) to relatively weak female candidates than do male evaluators. In competitions with relatively strong marginal candidates, replacing a male evaluator with a female one could therefore level the playing field. In a competition with relatively weak marginal candidates, however, replacing a male evaluator with a female one could harm female candidates.

Second, the gender composition of the committee could affect how evaluators perceive candidates of their own gender. If male evaluators develop a positive bias for male candidates in the presence of female evaluators (Bagues et al. 2017), gender quotas could defeat the purpose of changing the committee’s gender composition. I find evidence of such an effect: male evaluators give 0.198 more points (10.5 percent of a standard deviation) to male candidates than do female evaluators when they are the only man on the evaluation committee.

Finally, the gender composition of the evaluators’ discipline could influence how the evaluator perceives candidates. Evaluators in gender-balanced disciplines may show less evidence of same-gender bias than those in male-dominated disciplines. Conversely, evaluators in male-dominated disciplines could show a positive bias toward female candidates (Breda and Hillion 2016; Breda and Ly 2015; Williams and Ceci 2015). Exploiting large variations between male-dominated disciplines such as philosophy and gender-balanced ones such as social work, I find no evidence of any difference between evaluators of different disciplines.

This article contributes to the literature by studying same-gender preferences in a novel and relevant evaluation setting. First, contrary to established researchers, doctoral candidates can only provide limited information supporting their application, thus leaving more room for bias, as in Milkman, Akinola, and Chugh (2015) or Breda and Ly (2015). Second, evidence suggests that merit-based scholarships have a long-term causal impact on the careers of their recipients (Bettinger et al. 2016; Chandler 2018). Such competitions not only allocate scholarships, they also foreshadow academic careers. Third, because candidates are randomly assigned to three evaluators who each assign a score, this article is one of the first5 to take advantage of candidate fixed effects to identify differences in scores specifically related to the gender of the evaluator. Finally, almost $750 million is spent every year by the Social Science and Humanities and Research Council (SSHRC) to promote research in the social sciences and humanities in a variety of programs. It is important to understand whether gender bias influences the allocation of such funds. The rest of the article is structured in the following way. I first introduce the selection procedure used by the SSHRC to allocate scholarships. Second, I present the data. Third, I discuss the methodology and show the results of the regressions. Finally, I discuss the results and present some policy implications. The SSHRC is a Canadian federal agency that promotes and supports post-secondary–based research and training in the humanities and social sciences. One of the ways through which SSHRC achieves this goal is by awarding approximately$66 million6 in scholarships yearly to doctoral students.

 Table 1: Disciplines by Committee

Table 1: Disciplines by Committee

 First committee Fine arts, literature (all types) Second committee Classical archaeology, classics, classical and dead languages, history, mediaeval studies, philosophy, religious studies Third committee Anthropology, archaeology (except classical archaeology), archival science, communications and media studies, criminology, demography, folklore, geography, library and information science, sociology, urban and regional studies, environmental studies Fourth committee Education, linguistics, psychology, social work Fifth committee Economics, industrial relations, law, management, business, administrative studies, political science

Candidates—Canadian citizens or Canadian permanent residents—submit their application in the fall with the following documentation: a project proposal, a curriculum vitae, two reference letters from faculty, and all their post-secondary transcripts.7 Although students enrolled at a Canadian university apply through their home university preselection committee, those matriculated at foreign universities or not matriculated do so directly at the preliminary competition at SSHRC. The top-ranked candidates from each precompetition are forwarded to the national competition. The data used in this study stem from the 2003–2004 and 2004–2005 national competitions.

Applications forwarded to the national competition are sorted into one of the five committees on the basis of the discipline of the project proposal (Table 1). Within each committee, applications are then allocated to one of the three or four subcommittees according to the candidate’s last name. Table 2 shows the subcommittees of Committee 5 for both competition years as an example. Candidates with last names starting with the letters A–F usually went to the first subcommittee; those with last names starting with the letters G–M, to the second subcommittee; and the remaining candidates to the third subcommittee. Exceptions can be explained by the fact that evaluators cannot assess students from their own university. This table was created using only scholarship recipients, because the last names of non-recipients were not disclosed. Overall, there were 35 subcommittees. In 2004, there were 16 subcommittees (5 committees, each with 3 subcommittees, except psychology–education–social work, which had 4), and in 2005, there were 19 subcommittees (5 committees, each with 4 subcommittees, except the economics–management–political science committee, which had 3). On average, a subcommittee assessed 100 applicants and allocated C$3.76 million in scholarships.  Table 2: Distribution of Students by First Letter of Last Name—Committee 5 Table 2: Distribution of Students by First Letter of Last Name—Committee 5 First letter of last name of candidate Subcommittees 2004 Subcommittees 2005 1 2 3 1 2 3 A 6 0 0 4 0 2 B 10 1 0 21 0 1 C 9 0 1 12 0 1 D 6 0 0 5 0 3 E 3 0 0 0 0 0 F 4 1 0 2 3 2 G 0 2 1 2 12 1 H 0 4 0 0 4 0 I 0 0 0 0 1 0 J 0 3 0 0 0 0 K 0 3 0 1 5 1 L 0 12 2 1 8 0 M 0 15 4 1 8 4 N 0 0 2 1 0 2 O 0 0 3 0 0 0 P 1 0 2 0 1 9 Q 0 0 0 0 0 0 R 0 0 8 0 0 7 S 1 0 8 0 2 9 T 1 0 2 0 1 2 U 0 0 0 0 0 0 V 0 0 4 0 1 0 W 0 0 4 0 0 4 X 0 0 0 0 0 0 Y 0 0 1 0 0 2 Z 0 0 0 0 0 0 Total 41 41 42 50 46 51 Note: This table reflects only information concerning candidates who received and accepted a scholarship. I have information only on the last names of these candidates. Students who are not allocated to the subcommittee corresponding to the first letter of their last name represent one of the two exceptions because evaluators cannot assess students from their university. Each subcommittee consists of three associate or full professors in one of the disciplines included in the committee.8 These professors are active researchers who have previously received a research grant from SSHRC and who have been chosen by SSHRC. Each evaluator gives a score between 0 and 109 to each candidate allocated to their subcommittee on the basis of past academic results, the potential contribution to the advancement of knowledge of the program of study, and relevant professional and academic experience. The gender of candidates is not revealed to evaluators. It can, however, be inferred from the pronouns used in the reference letters. Evaluators know the identity of the two other evaluators in their subcommittee when assessing candidates, but the assessment is done individually. Evaluators meet once they have evaluated all candidates to discuss cases in which the scores awarded are very different. The scores contained in the data set are the final scores submitted by the evaluators after the meeting.10 A candidate’s total score is simply the sum of the three individual scores. Students are informed in April or May whether their application was successful. In the 2004 and 2005 competitions, there was no appeal procedure. The candidates with the highest scores starting first or second year at a Canadian university receive the Canadian Graduate Scholarship. This scholarship provides recipients with three annual payments of$35,000. The second-tier candidates or the best candidates ineligible for Canadian Graduate Scholarship11 receive the SSHRC Doctoral Fellowship, which represents a yearly payment of $20,000 until the fourth year of doctoral studies.12 The last tier receives no scholarship from SSHRC but could still be awarded one from their own university or from other funding agencies. The possibility of receiving other awards depends on the student’s specific situation. In the 2004 and 2005 competitions, 3,500 doctoral candidates were assessed in the national competitions by 105 evaluators grouped in 35 subcommittees. The data set contains 10,500 scores (3 scores per candidate). For all candidates, I have information on score, gender, discipline, year of study, subcommittee, and type of award received. Overall, 62.6 percent of candidates (see Table 3) and 62.8 percent of recipients are women. These candidates were assessed by 49 female and 56 male evaluators. SSHRC attempts to give equal representation to both genders with 46.7 percent of evaluators being women. It is clearly below the share of female candidates (62.6 percent), but it is above the share of female associate professors in the social sciences in 2014 (43.1 percent). Not surprisingly, the most common match was female candidate–male evaluator (3,442) and the least common one was male candidate–female evaluator (1,722). Table 4 shows the distribution of candidates into the 27 disciplines defined by SSHRC. Literature and psychology are the most common disciplines in the data set, with each boasting more than 450 candidates. Industrial relations, folklore, and demography are the least prevalent with fewer than 15 candidates each. Among evaluators, the most represented discipline is literature (13 evaluators) followed by history (9 evaluators). No evaluator represented the disciplines of classics, demography, folklore, industrial relations, and mediaeval studies.  Table 3: Gender Distribution of Candidates and Evaluators Table 3: Gender Distribution of Candidates and Evaluators Matches with Male Evaluator Female Evaluator Total Candidates, n (%) Male candidate 2,208 1,722 1,310 (37.4) Female candidate 3,442 3,128 2,190 (62.6) Total evaluators, n (%) 56 (53.3) 49 (46.7)  Table 4: Distribution of Candidates and Evaluators by Discipline Table 4: Distribution of Candidates and Evaluators by Discipline Discipline Candidates, n Evaluators, n % of Male Professors in Discipline Anthropology 132 3 48.2 Archaeology 53 2 75.0 Classics 39 0 68.3 Communications 100 4 58.0 Criminology 28 1 54.7 Demography 13 0 Economics 86 4 78.9 Education 278 7 42.6 Fine arts 185 7 57.2 Folklore 12 0 Geography 85 3 69.8 History 301 9 62.8 Industrial relations 9 0 Interdisciplinary 89 2 55.3 Law 79 5 55.2 Library and information science 17 2 37.5 Linguistics 81 7 45.0 Literature 467 13 48.9 Management 95 4 64.1 Mediaeval studies 15 0 Philosophy 192 8 71.4 Political science 233 5 68.9 Psychology 479 8 54.3 Religious studies 103 4 65.6 Social work 40 2 35.6 Sociology 228 4 51.0 Urban and regional studies 61 1 56.6 Total 3,500 105 Of these 3,500 candidates, 54.3 percent received a scholarship. The total score necessary to receive a scholarship varies across subcommittees with a minimum of 13.7, a maximum of 18.4, and a mean of 16.15. These variations can be explained by the relative generosity of evaluators in a subcommittee or by the covariance of the individual scores given by evaluators. Figure 1: Distribution of Individual Scores Figure 2: Distribution of Total Scores Each evaluator assigns a score between 0 and 10 to each candidate in the subcommittee. Figure 1 shows the distribution of individual scores, with an average of 6.02 and a standard deviation of 1.89. Figure 2 shows the distribution of total scores—the sum of the individual scores of the three evaluators in a subcommittee for a given candidate—which has an average of 18.1 and a standard deviation of 4.6. Finally, Figure 3 shows the distribution of individual scores based on the gender of evaluators and candidates. There is no clear indication of any form of differences, which is confirmed by the Kolmogorov–Smirnov tests: comparing male and female evaluators for male candidates (p = 0.961), comparing male and female evaluators for female candidates (p = 0.368), comparing male and female candidates for male evaluators (p = 0.086), and comparing male and female candidates for female evaluators (p = 0.667). Gender representation varies by subcommittee. Table 5 shows that in 6 of 35 subcommittees only one gender was represented (2 subcommittees composed only of women and 4 subcommittees composed only of men). The four male-only subcommittees were in the second (1), fourth (1), and fifth committees (2). The two female-only subcommittees were in the first (1) and fourth (1) committees.13 Overall, 583 candidates were evaluated by single-gender subcommittees. Figure 3: Distribution of Individual Scores by Gender of Evaluators and Candidates Note: In the top graphs, the dashed line represents male candidates and the solid line represents female candidates. In the bottom graphs, the dashed line represents male evaluators and the solid line represents female evaluators.  Table 5: Number of Candidates by Type of Subcommittee Table 5: Number of Candidates by Type of Subcommittee Composition of Subcommittee Candidates, n Subcommittees, n 3 female evaluators 189 2 2 female evaluators 1,552 15 1 female evaluator 1,365 14 No female evaluator 394 4 Total 3,500 35 Same-Gender Preferences in General I first conduct a regression with candidate fixed effects to determine whether evaluators on average give higher scores to candidates of their own gender:  (1) The variable Same Gender takes the value 1 if evaluator j and candidate i have the same gender and the value zero otherwise. The candidate fixed effects are captured by λ i. Because the specification includes candidate fixed effects, β 1 reflects the difference between the average score given by evaluators who share the same gender as the candidate and the average score given by evaluators who do not share the same gender. If the average score given by the former group is greater than the average score given to the latter one, the coefficient will be positive. Standard errors are clustered by subcommittee, because candidates in the same subcommittee face similar risks related to the composition of the subcommittee. Column 1 in Table 6 shows no evidence that evaluators give higher scores to candidates of their own gender in comparison with evaluators who are not the same gender as the candidate. I then distinguish between male and female evaluators to determine whether the two genders treat candidates of their own gender differently:  (2) The variable Male Candidate × Male Evaluator takes the value 1 if both evaluator and candidate are men and the value 0 otherwise. The variable Female Candidate × Female Evaluator takes the value 1 if both evaluator and candidate are women and the value 0 zero. Again, λi captures the candidate fixed effects. Because the specification includes candidate fixed effects, the coefficient β1 captures the difference between the average score given by male evaluators and by female evaluators to male candidates, and β2 captures the difference between the average score given by female evaluators and male evaluators to female candidates. If β1is positive, male evaluators on average give higher scores to male candidates relative to those given by female evaluators. If β2is positive, female evaluators on average give higher scores to female candidates relative to those given by male evaluators. The standard errors are clustered at the subcommittee level to capture correlated risks.  Table 6: Same-Gender Preferences with Candidate Fixed Effects Table 6: Same-Gender Preferences with Candidate Fixed Effects (1) (2) (3) (4) Same Gender 0.0206 (0.467) Male Candidate × Male Evaluator 0.0876 (0.133) Female Candidate × Female Evaluator –0.0182 (0.542) Male × Male × Total Score <14 –0.0130 (0.929) Male × Male × Total Score 14–24 0.0607 (0.284) Male × Male × Total Score >24 0.479*** (0.001) Female × Female × Total Score <14 –0.245** (0.013) Female × Female × Total Score 14–24 0.0504 (0.205) Female × Female × Total Score >24 –0.107 (0.225) Male × Male × 1 Male 0.198** (0.034) Male × Male × 2 Males 0.00820 (0.888) Female × Female × 1 Female –0.000406 (0.993) Female × Female × 2 Females –0.0382 (0.347) Constant 6.013*** (0.000) 6.010*** (0.000) 6.011*** (0.000) 6.021*** (0.000) N 10,500 10,500 10,500 10,500 r2 0.0000646 0.000458 0.00277 0.00108 Note: p values are in parentheses. The dependent variable is the score given by evaluator j to candidate i. All independent variables are dummies. The variable Same-Gender Dummy takes the value 1 if both evaluator and candidate are the same gender and the value 0 otherwise. The variable Male Candidate × Male Evaluator takes the value 1 if a male candidate is evaluated by a male evaluator and the value 0 otherwise. The variable Male × Male × Total Score 14–24 takes the value 1 if both evaluator and candidate are male and if the candidate’s total score is >14 and ≤24 and the value 0 otherwise. The variable Male × Male × 1 Male takes the value 1 if both candidate and evaluator are male and if there is only 1 man on the subcommittee. Otherwise, it takes the value zero. All models include candidate fixed effects (3,500). The standard errors are clustered by subcommittee (35). The average total score is 18.1, and the distribution of total scores is shown in Figure 2. *p < 0.10, **p < 0.05, ***p < 0.01. Column 2 in Table 6 shows no statistically significant evidence that male or female evaluators treat candidates of their own gender differently. If the standard errors are not clustered but simply robust to heteroskedasticity, the p-value of β1decreases to 0.096, making it barely significant at the .01 threshold. These results suggest that same-gender effects should be considered differently for male and female candidates. Figure 4: Scores of Male and Female Evaluators for Male Candidates Note: All the individual scores are demeaned using the average score received by the candidate. For example, if a candidate has the scores 3, 4, and 8, the average score is 5, which means that the demeaned scores are –2, –1, and 3. There would therefore be three dots at a total score of 15. Impact of Relative Strength of Candidate Because the impact of gender representation matters most for marginal candidates—those whose score is close to the threshold—and because they can be relatively weak or strong depending on the competition, I study same-gender preferences across the distribution. Figures 4 and 5 provide some graphical evidence for the different perceptions in the tails of the distribution using demeaned scores. Figure 4 shows that the individual scores of male and female evaluators diverge for male candidates with scores higher than 20: the scores of male evaluators are slightly above those of female evaluators. In Figure 5, the scores of male and female evaluators for female candidates diverge below the score of 15, with female evaluators giving lower scores to female candidates in that range. To provide some statistical evidence for these figures, I interact the same-gender dummy with a dummy identifying three segments of the distribution of total scores: relatively weak (total score ≤14), average (total score between 14 and 24), and relatively strong (total score >24) candidates. The categories are created to approximately capture the different components of the distribution. There are 669 candidates (19.1 percent) in the first category (≤14), 2,492 (31.0 percent) in the second one (14–24), and 339 (14.8 percent) in the third one (>24). The “more than 24” category was created to be strictly above the highest funding threshold. Generally, a dummy variable based on the value of a dependent variable would lead to biased estimates as a result of the correlation between the dummy variable and the error term.14 This issue, however, does not arise in a fixed-effects model explaining individual scores using total score as indication of relative strength. A general fixed-effects regression can be expressed as a demeaned regression:  $yij−y¯i=β(xij−x¯i)+ui−u¯i+∈ij$ (3) where i represents the candidate and j represents the evaluator. An individual demeaned score has no relationship with the total score.15 For example, a candidate who received a score of 7 from one evaluator and a total score of 18 (average score of 6) will have a demeaned score of 1. Similarly, a candidate who received a score of 5 and a total score of 12 (average score of 4) will also have a demeaned score of 1 from this evaluator. A demeaned score therefore does not provide any information on the total score of a candidate. Similarly, Є ij is not related to the total score, because the sum of demeaned scores does not equal the total score, as one can see in Figure 3. The sum of the three demeaned scores simply equals zero, becausej∈J (yijȳi) = j∈J yijn ȳi = nȳinȳi = 0. A high or low Єij therefore has no influence on the total score and on the categorization of the observation. Figure 5: Scores of Male and Female Evaluators for Female Candidates Note: All the individual scores are demeaned using the average score received by the candidate. For example, if a candidate has the scores 3, 4, and 8, the average score is 5, which means that the demeaned scores are –2, –1, and 3. There would therefore be three dots at a total score of 15. Finally, the sum єi1+ єi2 + єi3 = 0. This sum is zero for all candidates becausej∈J(yijȳi) = 0 andj∈J(xijx̄i) = 0, where J is the set of three evaluators assessing candidate i. A positive Єij simply means that there must be a negative Єij for another evaluator–candidate match for this candidate. Again, the individual error term of the regression is unrelated to the total score. I can therefore perform the following regression with dummy variables interacting the candidate–evaluator gender match with the relative strength of candidates and obtain consistent coefficients:  (4) All independent variables are binary. For example, the variable Male × × Total Score 14–24 takes the value 1 only if both evaluator and candidate are men and if the candidate has a total score greater than 14 and less than or equal to 24. All intervals exclude the minimum of the interval but include the maximum. The candidate fixed effects are captured by λi. As in the previous specifications, the coefficients should be understood as differences between male and female average scores for different categories given the inclusion of candidate fixed effects. For example, β1measures the difference between the average score given to male candidates with a total score less than or equal to 14 by male and female evaluators. If male evaluators give higher scores to this group than do female evaluators, the coefficient will be positive. As usual, the standard errors are clustered at the subcommittee level. Column 3 of Table 6 provides econometric evidence that the gender of evaluators influences their evaluation in the tails of the distribution. First, male evaluators give 0.479 more points (25.3 percent of a standard deviation) to male candidates who have scores higher than 24 than do female evaluators. The coefficient β2 is statistically different from β3 at p = 0.0051. This result is unlikely to be explained by a strategic form of gender discrimination. Evaluators may give higher scores to candidates from their gender who are close to the funding threshold to push them on the other side. However, candidates with scores higher than 24 are well above any funding threshold (highest threshold is 18.4).16 There is therefore little incentive for evaluators to give higher scores to these candidates for strategic reasons. Second, there is some evidence that female evaluators give 0.245 fewer points (13.0 percent of a standard deviation) to female candidates with scores less than or equal to 14 than do male evaluators. The coefficient β4 is statistically different from β5 at a p = 0.0133.  Table 7: Robustness Check: Definition of the Tails Table 7: Robustness Check: Definition of the Tails Key Variables (1) (2) (3) (4) (5) (6) LT, 17; UT, 21 LT, 16; UT, 22 LT, 15; UT, 23 LT, 14; UT, 24 LT, 13; UT, 25 LT, 12; UT, 26 Male × Male × Lower Tail 0.0373 (0.711) −0.00988 (0.932) −0.0178 (0.887) −0.0130 (0.929) 0.0107 (0.940) 0.0756 (0.709) Male × Male × Middle 0.102 (0.370) 0.108 (0.182) 0.0789 (0.190) 0.0607 (0.284) 0.0702 (0.231) 0.0784 (0.155) Male × Male × Upper Tail 0.146** (0.049) 0.198** (0.011) 0.296*** (0.006) 0.479*** (0.001) 0.474*** (0.008) 0.332* (0.058) Female × Female × Lower Tail −0.107* (0.065) −0.0934 (0.108) −0.129** (0.045) −0.245** (0.013) −0.147 (0.165) −0.0759 (0.544) Female × Female × Middle 0.0196 (0.762) 0.00413 (0.926) 0.0196 (0.674) 0.0504 (0.205) 0.00507 (0.896) −0.0116 (0.737) Female × Female × Upper Tail 0.0761 (0.302) 0.0540 (0.532) 0.0136 (0.875) −0.107 (0.225) −0.0307 (0.783) −0.0239 (0.882) Constant 6.011*** (0.000) 6.011*** (0.000) 6.011*** (0.000) 6.011*** (0.000) 6.011*** (0.000) 6.010*** (0.000) N 10,500 10,500 10,500* 10,500 10,500 10,500 Notes: Clustered p-values are in parentheses. The dependent variable is the score given by evaluator j to candidate i. All independent variables are dummies. The variable Male × × Lower Tail takes the value 1 if both evaluator and candidate are male and if the candidate has a total score in the lower tail of the distribution. The lower and upper tails of the distribution are defined at the top of the column. LT, 16; UT, 21, for example, means that total scores ≤16 are considered lower tail and total scores >21 are considered upper tail. Scores between 16 and 21 are considered middle. All models include candidate fixed effects (3,500). The standard errors are clustered by subcommittee (35). The average total score is 18.1, and the distribution of total scores is shown in Figure 2. *p < 0.10; **p < 0.05; ***p < 0.01. To demonstrate the robustness to these findings, I define the upper tail of the distribution using scores ranging from 21 to 26 in increments of 1 and vary the lower tail of the distribution using scores ranging from 17 to 12 in increments of 1. Table 7 shows that the male evaluators give higher scores to male candidates than do female evaluators for each of these thresholds, suggesting a robust effect. The effect for female evaluators is only statistically significant at thresholds of 14 and 15 and is therefore less robust to the definition of the lower tail. Impact of the Gender Composition of Subcommittees The gender composition of the selection committee could affect how evaluators feel about candidates of their own gender. To study this phenomenon, I conduct the following regression:  (5) All independent variables are binary. The candidate fixed effects are captured by λi. Because the specification includes candidate fixed effects, the coefficients measure the difference between two averages. For example, β1 captures the difference between the average scores given by male and female evaluators to male candidates in subcommittees with only male evaluators. If β1 is positive, male evaluators on average give higher scores than female evaluators to male candidates in subcommittees with only one male evaluator. The standard errors are clustered at the subcommittee level to capture correlated risks. Figure 6: Scores Given to Male Candidates by Share of Male Professors in the Evaluator’s Discipline Column 4 of Table 6 shows that male evaluators give 0.198 (10.5 percent of a standard deviation) higher scores to male candidates than do female evaluators when they are the only man on the subcommittee. If one compares β1with β2, this difference is significant at p = 0.0611, suggesting that male evaluators behave differently from other men when they are in the minority. Impact of the Gender Composition of Disciplines Finally, the gender composition of the discipline could also influence the evaluators. For example, professors working in male-dominated disciplines may be more prejudiced against female candidates. Table 4 reports the share of male professors in all disciplines represented by SSHRC. With 78.9 percent, economics is the most male-dominated discipline in the social sciences, and social work (35.6 percent male professors) is the most female dominated. Figures 6 and 7 provide little evidence that the male and female evaluators behave differently in disciplines with low or high male representation. Figure 6 shows the demeaned scores given to male candidates. Even though the scores of male evaluators are slightly above those of female evaluators—between 50 and 70—this difference is very small. Econometric analysis has shown that this difference is not statistically significant. Figure 7 shows that male and female evaluators evaluate female candidates in the same way independent of the share of male professors in the evaluator’s discipline. Overall, there is no evidence that the gender of evaluators in male-dominated disciplines affects the scores given to male or female candidates and that the gender quotas are particularly needed in gender-imbalanced disciplines. Systematic Gender Discrimination Because I have very little information on each candidate, I am unable to credibly use random effects and determine whether there is any systematic gender discrimination in the allocation of graduate scholarships. It is unclear what kind of information would be available about the candidates, and it would be difficult to create variables to capture the information contained in project proposals or reference letters. Moreover, there is probably little variation in the grade-point averages of students who all have excellent marks. Studying systematic discrimination in this setting is nearly impossible. Because 62.6 percent of candidates and 62.8 percent of scholarship recipients are women, it would be very unlikely that any evidence of systematic gender discrimination in this setting would be found even if some relevant information were available. Figure 7: Scores Given to Female Candidates by Share of Male Professors in the Evaluator’s Discipline This article looks at whether evaluators prefer candidates of their own gender. Using 10,500 scores given by 105 evaluators to 3,500 doctoral candidates applying for a scholarship, I find no evidence that evaluators give higher scores to candidates of their own gender in general. This result casts some doubt on the usefulness of gender quotas in evaluation committees to promote the careers of female academics in general. However, one has to keep in mind that the scores considered in this study are the result of a group discussion that may reduce the visible bias of evaluators. Moreover, there may be other reasons to increase the representation of female evaluators: male and female evaluators could provide different perspectives. I do, however, find evidence that evaluators treat candidates of their own gender differently in the tails of the skill distribution. First, I find that male evaluators give higher scores than female evaluators to relatively strong male candidates. Male evaluators may identify with relatively strong male candidates and want to encourage them. Because mentors can view protégés as younger versions of themselves (Humberd and Rouse 2015), evaluators may see themselves in the candidates. This type of identification may be only prevalent for male evaluators because male candidates are in a minority position in social sciences and humanities competitions. A smaller group may be more cohesive (Iannaccone 1992). Second, there is some weak evidence that female evaluators grade relatively weak female candidates more harshly than do male evaluators. This result is similar to Broder (1993), who finds that female evaluators for the National Science Foundation are harder on female candidates in general, and it is consistent with a variation of the Queen Bee syndrome (Staines, Tavris, and Jayaratne 1974). If women in a position of authority are harder on female subordinates, it would not be surprising that relatively weaker female candidates have a more difficult time standing up to the high standards of academically distinguished female evaluators. This result can also be explained as a form of paternalism on the part of male evaluators. They may feel sympathy toward relatively weak female candidates and therefore give them higher scores than do female evaluators. This interpretation is consistent with male judges being more lenient than female judges when sentencing female defendants (Nagel and Hagan 1983). Finally, I present evidence that male evaluators tend to give higher scores to male candidates than do female evaluators when they are the only man on the subcommittee. In a minority position, a male evaluator may feel the need to protect a male candidate from possible discrimination from female evaluators, or he may want to express his support for his male identity (Akerlof and Kranton 2000). The first could be prevented if male evaluators are informed that female evaluators do not seem to discriminate against male candidates. Unfortunately, I cannot distinguish between these two. Overall, these results provide a more nuanced perspective on the impact of increasing the representation of female evaluators on committees. There is some evidence that it can be beneficial to female candidates in certain circumstances: if marginal candidates—those close to the funding threshold—are relatively strong. At the same time, there is some weak evidence that replacing a male evaluator with a female evaluator could harm female candidates if marginal candidates are relatively weak or if the remaining male evaluator is the sole representative of his gender. An organization considering increasing the representation of female evaluators should therefore first consider the relative strength of marginal candidates and assess the reaction of male evaluators to ensure this policy promotes the careers of female candidates. ## Acknowledgements The author acknowledges funding from the Social Science and Humanities Research Council (SSHRC) through an Insight Development Grant (430-2017-00540). He is indebted to SSHRC for sharing information and particularly Margaret Blakeney, Ariadne Legendre, Andreas Reichert, Matthew Lucas, and Jack Mintz for their help. Finally, he thanks Steve Lehrer for useful comments. The usual caveat applies. Notes 1 In Canada, for example, they represent 51 percent of PhD graduates (Statistics Canada 2014), 46 percent of assistant professors, 38 percent of associate professors, and 23 percent of full professors (CAUT/ACPPU 2014, Table 2.12). The National Research Council (2009) and European Commission (2016) provide similar evidence for the United States and Europe, respectively. 2 Other explanations include the lack of research network (Blau et al. 2010), issues relating to fertility decisions (Ceci and Williams 2011), and a female distaste for competition (Bosquet, Combes, and Garcia-Penalosa 2014; Niederle and Vesterlund 2007) and bargaining (Small et al. 2007). Ceci et al. (2014) provide a thorough survey of this literature. 3 These results are similar to those from Abrevaya and Hamermesh (2012), who show no significant difference between male and female reviewers when assessing the work of male and female academics. 4 This type of centralized selection setting is widely used, for example, to grant National Institutes of Health awards (Li 2017), to promote professors in Spain and Italy (Bagues et al. 2017), and to select candidates for the Ecole Normale Superieure (Breda and Ly 2015) 5 To my knowledge, only Bagues et al. (2017) use candidate fixed effects. 6 SSHRC awarded C$65,928,665 for the 2003–2004 competition and C$65,775,00 for the 2004–2005 competition, or approximately US$50,000,000.

7 It is important to note that in many graduate programs, students must apply for external funding to be eligible for internal funding. There is therefore probably little self-­selection.

8 See Table 1 for a list of disciplines per committee.

9 An evaluator is only allowed to give one 0 and one 10 per competition.

10 It is unclear to what extent the preliminary and final scores are different. Unfortunately, the preliminary scores are not registered, which makes it impossible to study this difference. With the large number of candidates per subcommittee (100), one would expect files to be discussed for less than 3 minutes on average if the meeting takes less than 4 hours. Moreover, there is probably very little discussion for applications with scores greatly above or below the threshold, which is the majority of applications.

12 Students awarded the scholarship in their first year of doctoral studies will receive $80,000. Similarly, students awarded a scholarship in fourth year will only receive one payment of$20,000.