90 likes | 278 Vues
Mining the web to improve semantic-based multimedia search and digital libraries http://gate.ac.uk/ http://nlp.shef.ac.uk/ Horacio Saggion Kalina Bontcheva University of Sheffield 21 November 2006 IST Event 2006 Web Mining and Semantic Web: Networking with industry and academia
E N D
Mining the web to improve semantic-based multimedia search and digital libraries http://gate.ac.uk/http://nlp.shef.ac.uk/ Horacio Saggion Kalina Bontcheva University of Sheffield 21 November 2006 IST Event 2006 Web Mining and Semantic Web: Networking with industry and academia [This work has been partially supported by SEKT (http://sekt.semanticweb.org/),PrestoSpace (http://www.prestospace.org) and TAO (http://www.tao-project.eu/projects]
Web mining and semantic annotation: why? • Semantic annotation produces explicit representation of knowledge, given content • Knowledge is often implicit in the data sources • …or hard to extract automatically to a sufficient accuracy • Frequently knowledge can be mined from the web and merged with the original content to improve semantic search and reasoning capabilities 2(9)
Web mining and semantic annotation: how? • GATE is a widely used open-source infrastructure for text mining (http://gate.ac.uk): • Ten years old, with 1000s of users at 100s of sites • Supports major document formats and languages • Helps build semantic annotation components • Integrate these with content and knowledge mined from the web • Create, test, and deploy these into an end-to-end application (some examples next) 3(9)
RichNews: Multimedia Annotation • The problem: • Access to archive material in the BBC is provided by some form of semantic annotation and indexing • Manual annotation is time consuming (up to 10x real time) and expensive • Rich News (developed within the Prestospace project) aims to (partially) automate the annotation of news programs • Developed on BBC TV and radio news • Involving human in the loop is possible if desired • Recordings of broadcasts go in one end • Index of semantic metadata describing each news story comes out the other http://gate.ac.uk/sale/www05/web-assisted-annotation.pdf 4(9)
Web mining in RichNews • Why web mining: • Speech recognition produces poor quality transcripts with many mistakes • Closed captions/subtitles not always available • These news stories can also be found on the BBC and other web sites • The solution: • Obtain key terms from the ASR transcripts • Search the web for related stories from same date • Find best matching stories • Obtain semantic annotations from this richer text • Merge with semantic annotations on transcript to obtain more precise knowledge, grounded in the video stream http://gate.ac.uk/sale/www05/web-assisted-annotation.pdf 5(9)
RichNews Example 6(9)
TAO – Augmenting Software Artefacts with Semantics • TAO project – http://www.tao-project.eu • Transitioning Applications to Ontologies • Case study on augmenting software artefacts with semantics • Learning ontologies from multiple software artefacts • Knowledge about a software project often spread across different sources on the web: • Source code, discussion messages, bug descriptions, documentation 7(9)
New Challenges • Moving towards mining and semantically annotating Web 2.0 • Opinion mining from blogs and discussion forums • Mining wikis • Social network analysis • Mining multimedia content • Initial experiments in ongoing projects, but we need further work on these emerging social-oriented web 8(9)
Thank you! These slides: • http://gate.ac.uk/sale/talks/ist06/ist-event06.ppt Further details: • RichNews: http://gate.ac.uk/sale/www05/web-assisted-annotation.pdf • SEKT: http://gate.ac.uk/sale/iswc06/iswc06.pdf • TAO: http://www.tao-project.eu 9(9)