1 / 17

Writeslike Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk. Background. Relevant research themes: Metadata harvesting and reuse Automatic metadata extraction Text analysis Social network analysis Scholarly communication, particularly informal communication. Aim.

drosales
Télécharger la présentation

Writeslike Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

  2. Background Relevant research themes: • Metadata harvesting and reuse • Automatic metadata extraction • Text analysis • Social network analysis • Scholarly communication, particularly informal communication

  3. Aim Helping people to find each other: • Finding other researchers with similar interests to yourself in your geographic area • Or in your area of research • Not everybody with similar interests will attend the same conferences! • Helping students find potential research supervisors • Encouraging serendipity.

  4. Relevant technologies In fact there are an awful lot of these. Social network analysis: • Requires a very large dataset • Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data • Solution b) is cheap, simple, and very fallible. • Not a new approach – at the core of bibliometrics

  5. Relevant technical problems • Author identity disambiguation • Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). • Needs to be solved to acceptable level. • Need to define how good 'acceptable' is. • Formal solutions usually depend on unique identifiers + registries • Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

  6. Methodology • Harvest OAI metadata: captures large list of: • Author names (somewhat randomly formatted) • Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) • Citations (sometimes) • Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc. • Retain OAI source: useful clue regarding author affiliations (sometimes)

  7. Methodology (II) • Analyse text for noun-phrase-like structures – useful clue as to theme • Background information required, such as: Institution name, domains/URLs associated with each institution • Retrieved via harvesting from Wikipedia • Much of this information is not well-structured, so unavailable via DBPedia • Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. • Search with contextual network graph algorithm

  8. 'Sometimes' and 'usually' • Statistics are: • Cheap • Imperfect • Available • Rapid innovation philosophy: • Cheap is good • Simple is good • Solutions requiring novel/additional uptake of infrastructure are out of reach

  9. Results • Basic concept worked well • Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) • Interface development actually required more time than the dataset development, and exceeded project length... • But useful dataset can be released as linked data, reused for various purposes

  10. Walkthrough: Basic search (the harder method!)

  11. Advanced search

  12. Walkthrough

  13. Conclusion • OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data • There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network-like graphs • Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications

More Related