410 likes | 576 Vues
Constructing an adolescence friendship network within the ALSPAC birth cohort using probabilistic record linkage techniques. Research Team. Simon Burgess (CMPO, Bristol) Eleanor Sanderson (CMPO, Bristol) Marcela Umaña (CMPO, Bristol) Andy Boyd (ALSPAC, Bristol). Study Rationale.
E N D
Constructing an adolescence friendship network within the ALSPAC birth cohort using probabilistic record linkage techniques.
Research Team • Simon Burgess (CMPO, Bristol) • Eleanor Sanderson (CMPO, Bristol) • Marcela Umaña (CMPO, Bristol) • Andy Boyd (ALSPAC, Bristol)
Study Rationale Social Networks are ubiquitous and powerful “The people with whom we interact… influence our beliefs, decisions and behaviours” Jackson 2010 The manner in which networks carry this influence depends in detail on the structure and characteristics of the network.
Background • Examples of researching Networks • ADD Health – Longitudinal survey of school children in the US. Questionnaire included a list of pupils in the school, respondent asked to nominate their five best male and five best female friends • Others based around communication networks or other defined communities
Background • Advantages of studying social networks in a cohort study: • Extensive phenotype and genotype data and extensive linkage data
Background • Advantages of studying social networks in ALSPAC: • Regional catchment area, narrow age range of participants (18 month age range, 3 school years) • Disadvantages: • Only the study participant is asked to nominate their friends
Data Collection Methodology • School based (register based) method not considered feasible • Cost • School Engagement • Questionnaire based alternative • Sent to participants still in compulsory education (age ~15-16) • Where the participant still lived in England
Data Collection Methodology • Asked the participant to nominate their 5 best friends, in no particular order
Linkage Objectives • To identify all unique individuals from the pool of nominated friends (de-duplication) • To identify which of the nominated friends are also eligible to participate in ALSPAC
Before we get to linkage…… there’s ethics • Seeking personal identifiers of participants friends seen as contentious • Lawyers advised us that this is legal and within the bounds of Data Protection Act (1998) • Personal identifiers to be used for statistical use only and pseudonymised prior to research use
Before we get to linkage…… there’s ethics • Once the nominated friends have been coded the personal identifiers cannot be used again. • No longitudinal follow up possible on the full data set, but it is possible on those linked to ALSPAC.
The Data • 3,132 participants returned a questionnaire • 14,500 nominated friends • Personal Identifiers include: • Name, Date of Birth, School, School year, gender • Phenotypic data includes: • How they met, duration of friendship, shared interests
Data Quality • Completeness of highly distinguishing personal identifiers • 14,414 nominated friends >=2 identifiers • 12,612 nominated friends >=3 identifiers • 6,215 nominated friends included all four identifiers
Data Quality • All data reported by a participant (age ~16) about their friends • Some of this will be unknown or prone to greater error, particularly date of birth and non-local schools • Names include many spelling errors • Names and school details include many abbreviations and familiar names
Standardisation • School names coded to National Pupil Database ‘Unique Record Number’ (using http://www.edubase.gov.uk) • Names converted to upper case • All spaces and symbols contained within a name removed: • O’Driscoll to ODRISCOLL • St.Claire to STCLAIRE
Standardisation • Names matched to a name Lexicon, compiled from: • NHS name lexicon • National Pupil Database • ALSPAC ‘known as’ names • Non-matching names evaluated using Jaro string comparator metrics (assesses spelling differences, typos, keying errors, string lengths) • See Herzog, Scheuren and Winkler 2007 • “A Dictionary of First Names” Oxford University Press 2006
Standardisation • Name Lexicon examples: • Andrew, Andy, Andi, Drew all categorised to the same male group • Abigail, Abbie, Abi, Ab1 all categorised to the same group • Where are two linked names not the same? • E.g. Should Abraham and Ibrahim be categorised together? • Names can be included in multiple groups (impacts on linkage evaluation)
Standardisation • Impact of Lexicon, unique values condensed into categories: • Forenames 2,108 into 1,339 • Surnames 5,743 into 4,895
Linkage Methodology • Used approach developed by Fellegi & Sunter (1969) Aim to simulate human reasoning by comparing each of several elements from the two records… from fundamental concepts of probability Clark 2004
Estimating Match Weights • For a given field with match probability M and unmatch probability U • For an agreement: • Log (M/U) • For a disagreement • Log (1-M/1-U) • Sum the weights across all matching comparisons (all the fields)
Weightings • M-Probability: Probability that the identifier agrees given a true match • Based on assessment of the quality of the data (i.e. data entry errors, missing data but accounting for improvements due to cleaning and standardisation)
Weightings • U-Probability: Probability that identifier agrees given that the records do not constitute a true match • Based on ‘Gold Standard’ of the existing ALSPAC – National Pupil Database linkage • Supported by data, 95% nominated friends described as being in education in the ALSPAC time period
Stratification or ‘blocking’ • Large number (14,500 x 14,500) of possibilities to evaluate • So we ‘blocked’ on identifiers with low discriminatory potential (gender, school year) and high potential (name, school) • Multiple iterations so as not to exclude cases which contained errors in the blocking identifiers
Manual Review • Evaluated a random selection of cases to determine thresholds for accepting a match as: • Definitely ‘true’ (including some false positives) • Definitely ‘false’ (excluding some true positives)
Manual Review • Cases with results between the two thresholds all manually reviewed
Results • Data • 3,123 respondents • nominated 4.64 friends on average • 14,503 nominated friends • First Phase of Linkage • 11,327 individuals identified • Linkage to ALSPAC • 6,961 nominated friends linked • 4,572 individuals linked
Results: Network Structure • Total Network • 13,056 individuals in total (1,394 respondents are also nominated as a friend) • 50% of nominations are to someone in ALSPAC • 12% of nominations are to someone who is also a respondent to the friendship questionnaire
Results: Network Structure • Largest component contains 2/3 of the individuals in the network
Future Research • Structure of the network • Homophily • The tendancy to establish relationships among people who share similar characteristics or attributes
Future Research • Risk taking behaviour • Antisocial behaviour • Transition into Higher Education, Employment or unemployment • And many more…
Reflections on Linkage Process Quality of the data determines the quality of the linkage • To reflect this the majority of time/resource was spent on data cleaning, standardisation and extensive manual verification
Reflections on Linkage Process Establishing the weightings • Method not without problems as excludes privately educated pupils, who have different name frequencies • Weighting established on national population, but ALSPAC regionally clustered • Potential to use statistical approaches instead
Reflections on Linkage Process Ultimately • While resource intensive the methodology did allow the identification of a friendship network within ALSPAC • Little evidence to suggest that this was as ethical contentious from cohorts perspective as expected (based only on response rates and small numbers of complaints – further research into this would have been of interest)
Continuing Role of Linkage • Linkage to administrative records is, by adding to the ALSPAC resource, providing new data which can be used in social network analysis
Thank You Questions? Andy Boyd a.w.boyd@bristol.ac.uk
References • Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prevention 10, 186-191 • Felligi IP & Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64, 1183-1210 • Herzog TN, Scheuren FJ and Winkler WE (2007) Data Quality and Record Linkage Techniques. New York: Springer. • Jackson M (2010) An overview of social networks and economic applications. In Handbook of social economics, edited by Benhabib J, Bisin A & Jackson M