Writeslike Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

Background Relevant research themes: • Metadata harvesting and reuse • Automatic metadata extraction • Text analysis • Social network analysis • Scholarly communication, particularly informal communication

Aim Helping people to find each other: • Finding other researchers with similar interests to yourself in your geographic area • Or in your area of research • Not everybody with similar interests will attend the same conferences! • Helping students find potential research supervisors • Encouraging serendipity.

Relevant technologies In fact there are an awful lot of these. Social network analysis: • Requires a very large dataset • Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data • Solution b) is cheap, simple, and very fallible. • Not a new approach – at the core of bibliometrics

Relevant technical problems • Author identity disambiguation • Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). • Needs to be solved to acceptable level. • Need to define how good 'acceptable' is. • Formal solutions usually depend on unique identifiers + registries • Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

Methodology • Harvest OAI metadata: captures large list of: • Author names (somewhat randomly formatted) • Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) • Citations (sometimes) • Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc. • Retain OAI source: useful clue regarding author affiliations (sometimes)

Methodology (II) • Analyse text for noun-phrase-like structures – useful clue as to theme • Background information required, such as: Institution name, domains/URLs associated with each institution • Retrieved via harvesting from Wikipedia • Much of this information is not well-structured, so unavailable via DBPedia • Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. • Search with contextual network graph algorithm

'Sometimes' and 'usually' • Statistics are: • Cheap • Imperfect • Available • Rapid innovation philosophy: • Cheap is good • Simple is good • Solutions requiring novel/additional uptake of infrastructure are out of reach

Results • Basic concept worked well • Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) • Interface development actually required more time than the dataset development, and exceeded project length... • But useful dataset can be released as linked data, reused for various purposes

Walkthrough: Basic search (the harder method!)

Advanced search

Walkthrough

Conclusion • OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data • There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network-like graphs • Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications

Writeslike Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk