Geocoding self-reported addresses: Lessons learned

Geocoding self-reported addresses: Lessons learned November 29, 2012 Karyn Backus Epidemiologist CT Department of Public Health 860-509-7342 Karyn.backus@ct.gov

SECTION 1: BACKGROUND • The next few slides provide a quick overview of the purpose of geocoding health data, concerns about confidentiality and privacy, and brief overview of spatial analysis. • These slides are intended to give you a overall perspective about how geocoded data is used for public health evaluation. • Section 2 will tackle the specific concepts of the geocoding process and several lessons learned pertaining to these concepts.

GEOCODING HEALTH DATA • “In the past, disease patterns were mapped and analyzed primarily at the level of political units, such as states, counties, or census tracts.” • “Now, with the advent of geographic information systems (GIS), researchers can now achieve a new level of precision and flexibility in geographic locating.” • “Geocoded residential addresses can be used to examine the spatial patterns of cancer incidence, staging, survival, and mortality. This emerging technology allows the mapping of many different kinds of geographies, including disease rates in relation to pollution sources.” • “Geographical detail also permits clearer identification of areas of need and for improved health service provision.” • “Balancing the availability of such precise information, however, are concerns about maintaining the privacy and confidentiality of individuals as required by law, as well as securing reliable estimates of disease rates.” Citation: Geocoding Health Data (2008) by Rushton et al.

GEOCODING INDIVIDUALs/EVENTS • “Geocodes have become useful in public health research and practice because they provide linkages between disparate data that contribute to health and disease in populations.” • “Public health has come to focus on investigating health at the level of the individual and the individual’s family. Risk factors for disease were identified as individual behaviors that needed to be changed.” • “Recently, it has become practice to add to the individual level risk factors the societal and neighborhood risk factors that also contribute to ill health and to its treatment.” Citation: Geocoding Health Data (2008) by Rushton et al.

GEOCODING CONFIDENTIAL DATA • “We also became interested in methods to protect the privacy of individuals because many geocodes are essentially personal identifiers….” • “The release of geocoded data by its owner or custodian to researchers of other users may violate privacy laws or agreements under which the data were originally collected.” • “Geographic masking is a process by which highly accurate geocodes are modified to an extent sufficient for the data to be released to users.” • Aggregation from points to areal units such as tract, town, zip-code • Affine transformation, random perturbation, nearest neighbor, etc. Citation: Geocoding Health Data (2008) by Rushton et al.

Spatial Analysis: CAUTION • Valid spatial analysis of geocoded health data rest on the assumptions that geocodes are complete and correct. Although this may often be sufficiently well satisfied for area-level geocodes such as ZIP-codes, for point-level geocodes, it is not. • 10%-30% match failure of subjects’ addresses is common • Positional errors of hundreds of meters are common • Incompleteness and inaccuracy of geocodes generally have an adverse effect on statistical analysis and can lead, if ignored, to conclusions of dubious value. • E.g., the power to detect a disease cluster becomes badly degraded when the positional error is of similar magnitude to the cluster itself. Citation: Geocoding Health Data (2008) by Rushton et al.

Spatial ANALYSIS: STATISTICS • Point data: does the observed pattern of point locations appear random, or do events appear to be clustered? • The cases and controls combined serve as the denominator • Use spatial scan statistics for detecting cases that show clustering • Areal data: what associations do we observe between the observed counts and covariate values measured in the same regions? • Typically use the census-based population counts for the denominators of the sub-regions under study • Use spatial scan statistic for detecting area units that show clustering • Scan statistic: repeatedly scans the area of study using scan windows (i.e., circle) of varying size and calculates the ratio of cases to controls within each window • Depending on the study question and approach, a scan window that has a ratio of cases that is significantly higher than other windows may be interpreted as a cluster Citation: Geocoding Health Data (2008) by Rushton et al.

SpatIAL scan statistic Cancer Map Patterns: Are They Random or Not? Martin Kulldorff, PhD, Changhong Song, MS, David Gregorio, PhD, Holly Samociuk, BS, Laurie DeChello, MPH Background • Maps depicting the geographic variation in cancer incidence, mortality or treatment can be useful tools for developing cancer control and prevention programs, as well as for generating etiologic hypotheses. An important question with every cancer map is whether the geographic pattern seen is due to random fluctuations, as by pure chance there are always some areas with more cases than expected, or whether the map reflects true underlying geographic variation in screening, treatment practices, or etiologic risk factors. Methods • Nine different tests for spatial randomness are evaluated in very practical settings by applying them to cancer maps for different types of data at different scales of spatial resolution: breast, prostate, and thyroid cancer incidence; breast cancer treatment and prostate cancer stage in Connecticut; and nasopharynx and prostate cancer mortality in the U.S. Results • Tango’s MEET, Oden’sIpop, and the spatial scan statistic performed well across all the data sets. Besag-Newell’s R, Cuzick-Edwards k-NN, and Turnbull’s CEPP often perform well, but the results are highly dependent on the parameter chosen. Moran’s I performs poorly for most data sets, whereas Swartz Entropy Test and Whittemore’s Test perform well for some data sets but not for other. Conclusions • When publishing cancer maps we recommend evaluating the spatial patterns observed using Tango’s MEET, a global clustering test, and the spatial scan statistic, a cluster detection test.

SECTION 2: LESSONS LEARNED • The next section will review some basic concepts of geocoding along with a variety of lessons I have learned over the years pertaining the various concepts or steps in the process.

Geocoding Addresses • Geocoding is the process of assigning a location to an address • Purpose is to use the locational information for spatial analysis • Collect the address information • Use a geocoder to assign a location to the address • Use the locational information for analysis

BASICS: SELF-REPORTED ADDRESSES Self-reported addresses are of varying quality • The source of the address information impacts quality • Reason for reporting the address (e.g. driver’s license, hospital ER) • Collection system (e.g. electronic, paper, over the phone) • Preprocessing is typically necessary • Parse into necessary fields • Clean up the structure, remove extra information • Consider additional cleaning options (e.g. QAS) • Birth and death certificates are legal documents • Birth addresses: 99% match rate* • Death addresses: 90%-96% match* * reformatted, cleaned, processed through an external software, idiosyncratic edits applied, and hand matched

LESSON LEARNED: PRE-PROCESSING • Pre-processing is pivotal to successful geocoding • Can significantly improve overall rates • Can significantly reduce post-processing time • Automating the pre-processing is ideal • Hand-matching often results in corrections that can be implemented during pre-processing • Geocodes may be more accurate since the addresses are cleaner

LESSON LEARNED: PRE-PROCESSING

Basics: Street Database • Geocoder uses a database of street segments for reference • “Street centerlines” or “street reference dataset” • Segments are associated with geographic locations • Actual road network broken into segments • Segments have number range/street name elements/town/state/zip code • Historical street data not maintained • Delays for new street or housing construction • Alternate street names are supplied • CT uses Tele Atlas for the streets database • Tele Atlas (TA) segments use TIGER-line networks (Census-based) • Segments break at Census-based town and tract boundaries • CT uses official CT DEEP-based polygons for the town boundaries • Census ≠ DEEP  misaligned boundary lines • Segments may not break at DEEP town boundaries

LESSON LEARNED:BOUNDARY DIFFERENCES CT DEEP Town Boundary: Cheshire/Wallingford line Tele Atlas Town/Tract Boundary: Cheshire/Wallingford line

Basics: Geocoder • Address Locator = address geocoder in ArcMap • Complex pattern matching program • Compares input addresses to your reference database • Uses a “style” to define the pattern (odd-even numbering, mixed numbers) • Produces a score to represent the confidence in a match between the input address and the database segment • Many parameters are user-defined • Options for applying alternate streets names and place names • Street segments – point location is estimated along the segment line • May not accurately reflect parcel location • Does not represent building location • Roof-top – the point is placed where the building is located • Emphasizes the actual building location • Databases are growing but not available for all areas

LESSON LEARNED: INTERPOLATION IS NOT EXACT • Segments require interpolation for placement • Geocoder matches the input address to a street segment • Uses the number range, street name, town, state, and zipcode to determine a match • The placement of the point is estimated within the segment’s number range • Longer segments may have lower accuracy (rural areas) • Can be problematic at boundaries (parcels, tracts, town, state, shoreline) 42 Alison Ave: Point is placed about 50% of the way along the segment on the left side Each segment is depicted with a separate color for illustration purposes. Green background represents Cheshire town. Grey background represents Wallingford town.

Lessons Learned: UNDERSTAND YOUR PARAMETERS • Use the user-defined parameters to your benefit • 90% minimum match score = preferred sensitivity/specificity rate for our health data • Side offset is distance from the center of the road • Beware that in urban settings, an offset of 30+ feet can place the point in a parcel behind the intended parcel or over a town/census boundary • End offset prevents clustering at intersections • Do not match if candidates tie! • ESRI noted that there was no logic for how the geocoder would choose one candidate over any other • Output x/y coordinates – isn’t that why you are geocoding? • Coordinates are in your current projection • Can add coordinates at any time with Add XY Coordinates • Convert to NAD83 if you want to add X/Y as Lat/Lon

Lessons Learned: USE Alternate and Place Names • Alternate Names • Locator will not automatically evaluate alternate names • Ex: 1001 Berlin Tpke = 1001 Route 5 = 1001 North Broad St • If Berlin Tpke is the primary street name, Route 5 will fail to geocode • May require creating a standalone dataset of the alternates from the streets database • Must be defined when creating the address locator • Significantly improves match rates! • Place names table can be included in the address locator • Substitutes an address when a place name is found • Ex: Masonic Home > 22 Masonic Ave • Not provided in the street database; I culled mine from various sources • Notable improvement in matches with death data where nursing homes, elderly housing, and other facilities are more often reported • Can be added after the locator is created

Lesson Learned: USE Composite Locators • Composite locator will cycle through individual locators • Addresses are matched using the style and settings of each locator • Not possible to re-run the locator on just a portion of the input dataset, so… • …a composite locator allows the user to define a hierarchy of locators • More than one “style” (e.g., odd/even numbering, mixed numbering) • More than one reference database (e.g., roof top, centerlines) • More than one zone field (e.g., town name, town code, postal name) • Different sets of parameters (e.g., change in spelling sensitivities)

Lesson Learned: ADVANCED “Style” Customization • Adjusting the address locator “styles” • In ArcGIS 10.0, the address matching programs are written in .xml • ESRI provides a white paper on customizing address locators in .xml • The code is editable but be careful !!! Coding errors may not be obvious in your results • ESRI created a custom dual-streets style for me upon special request • Increased the weight for street name elements • Previously: North Main St could match to Main St with a 90% or greater score • Now: Increased penalty results in cleaner matches at the 90% minimum • Increased the penalty for town and zip code elements • Previously, Town code 100 would match to Town code 10 with a 90% or greater score • Now: Requires an exact match to meet the minimum 90% match score • Made the cost/penalty for town and zip equal so that one isn’t prioritized • Previously: zip code was given greater weight than town • Allows for a missing town or a missing zip with no penalty • Only when one of the fields is null is there no penalty • Makes it possible to geocode using only town or only zip code • Used heavily in hand-matching: I can null a field I know is incorrect (e.g., zip code is outdated) • Allows for a swap between pre-directional and post-directional when one was null • Middle Tpke W will now match to W Middle Tpke without penalty • Added local idiosyncrasies to the pattern matching • Ex: Ella T. Grasso • Significantly reduced the time I spent hand-matching addresses

Basics: Interactive Match • Able to review results or edit addresses to adjust match status • Subset the data file for selective review • Hand-match only unmatched address • Review matched addresses (ex: less than 100%) • Use a data variable to subset addresses (ex: QAS edit Y/N) • Use the candidates button to map candidates • This is where the base maps become valuable!

LESSON LEARNED: Interactive MEANS INTERACTIVE • Overlay relevant base maps to help decision processes • Street network, town boundaries, tract boundaries, block group boundaries, zip code-town overlap, water bodies, parcels • Edit the input address (sensibly) to adjust the match score • Misspellings or other text errors • Null a town name or zip code (when using CT-DPH custom locator) • Some ties cannot be reconciled • Two street names with the same number range may exist in separate sections of a single town or zip code • Keep notes about addresses or areas where there are repeated issues that can be corrected during pre-processing

Lessons Learned: Interactive Match This address reports Colchester for town name but falls outside of Colchester town. Colchester is reported as the town name because Colchester is the postal area name for 06415, not because the address exists in Colchester town. Nullifying the town field results in a 100% match on zip code 06415 in East Haddam town (red circle).

LESSON LEARNED: Mailing Addresses • Self-reported addresses are usually mailing addresses • Mailing addresses are a function of the USPS • Mailing address may not reflect street centerlines • PO Boxes, rural routes, housing developments • Postal name is reported as the city/town • Each postal code has a postal code name • The postal name for each zip code is a function of the USPS • Postal names may be the same as CT town names but are independent constructs • Postal codes, names, and boundaries can change over time • Secondary information (apt, floor, unit) used for mail delivery is often irrelevant for geocoding with street centerlines • Street centerlines are a function of the municipalities and the D.O.T.

Lesson Learned: Reported town ≠ OFFICIAL CT town • When entered into data systems, the reported address is often parsed into separate fields: street, town, state, zip code • If the reported address is a mailing address, the postal name will be entered into the town field • Some data systems assign one of the 169 official CT towns based on reported town and zip code • May use a cross-index between CT town and USPS zip code to assign CT town • *What happens when the postal zone extends beyond the town border? • Colchester zip code extends into East Haddam town • New Haven and East Haven zip codes bisect New Haven and East Haven towns • *What happens when the postal zone is the same name as another town? • Canaan zip code exists only in North Canaan town • North Canton zip code exists only in Barkhamsted town • *What happens when a postal zone represents more than one town? • Stafford Springs zip code contains Stafford and Union towns • Jewett City zip code contains Lisbon and Griswold • Mystic zip code spans Groton and Stonington, • Moosup zip code spans Plainfield and Sterling *see GIS DAY 2011 poster: Geocoding Vital Events

GIS DAY 2011 POSTER:TOWN? ZIP CODE? Zoom in on slide, if necessary.

Lesson Learned: GEOCODING TO VERIFY “TOWN OF RESIDENCE” • Town of residence is used heavily for statewide reporting and analysis • Some systems may use mailing address to populate town of residence • Does not account for many zip codes that overlap into neighboring towns • Zip codes change over time but collection systems often not updated • Cannot determine correct town of residence without geocoding • Town-level counts/rates can be skewed when corrections are not made • Births and deaths are geocoded to assign town of residence for statistical purposes • CT Tumor Registry collapses Canaan and North Canaan when reporting town-level statistics. These neighboring towns showed highly disparate rates resulting from incorrectly assigned tumor cases. • Less populous geographies are more affected by incorrect assignments

GIS DAY 2012: Birth Rates by Town Of residence • Here is an example of reporting by town of residence. • Incorrect town assignment for births could notably affect the birth rates reported for some towns. • Town of residence at birth is a primary factor in local and state population estimates and projections. • It is also a primary input for towns planning for future school enrollment and districting.

FINAL THOUGHTS • Many lessons learned over 7 years • Pre-process your data • Inspect your results • Automate your processes whenever possible • Reported address does not always provide accurate town of residence • Remember that geocoded coordinates are confidential data • Recognize and comprehend the limitations of your geocoded dataset as it pertains to further spatial analysis • Verify your assumptions at every step of the way

Geocoding self-reported addresses: Lessons learned