1 / 18

The Role of “Big Data” in Scientific Publishing

The Role of “Big Data” in Scientific Publishing. Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56 th Annual Conference Philadelphia, PA, USA 2014-02-24. Why the scare quotes?.

wiley
Télécharger la présentation

The Role of “Big Data” in Scientific Publishing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Role of “Big Data” in Scientific Publishing Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56th Annual Conference Philadelphia, PA, USA 2014-02-24

  2. Why the scare quotes? Reference: http://ajharmony.tumblr.com/post/65901268958/mostlysignssomeportents-big-data-is-like, from a quote by Dan Ariely in https://www.facebook.com/dan.ariely/posts/904383595868

  3. Audience poll: current data scales How large is the amount of data your organization currently manages to produce its online products and services? • Gigabytes • Terabytes • Petabytes • Exabytes

  4. Scientific content in the context of big data

  5. What does big data mean to scientific publishing? • Scientific publishing is the act of compressing a universe’s worth of data into small pieces of content that people can consume • In essence, this is the ultimate big data problem • But it is one in which until recently publishers have played a very simple role • That is beginning to change

  6. What are we beginning to do with big data? • Create more useful content by enhancing it with data extracted from content • Make the researcher’s life better by exploiting data about how content is used to improve her experience of using our online applications • Enable research itself by supporting the care and feeding of experimental data at scale

  7. Audience poll: big data use cases Which of these uses of big data is most important for your organization? • Extracting data from content • Improving user experience through usage analytics • Managing experimental data • All of the above • None of the above

  8. Sources of data in scientific publishing

  9. Example: collaborative filtering in ScienceDirect • When users look at articles on ScienceDirect, they are provided links to other articles of interest • Related Articles originally implemented using bag-of-words similarity using search engine query • Goal: Increase click-through rate on Recommended Articles over previous Related Articles offering; drive usage, engagement & revenue • Pilot: Ran from March to July 2013, with 9 variants A/B tested with ~5% SD traffic A/B tested • Production: Since Aug 2013 • Inputs • 5 years of SD usage data/events • All SD XML Articles • SNIP2 Journal Rankings ~12M articles 6 billion events Thor Roxie Similarity Co-download matrix Attribute Ranking pii-684259, pii_585346, pii_491635 pii-739156 Daily updates

  10. Audience poll: big data tools and platforms Which big data tools/platforms are you currently using? • Apache Hadoop • A Hadoop distribution (Cloudera, MapR, Amazon EMR, …) • LexisNexis HPCC • Twitter Storm • Rolling our own • None of the above

  11. How big data infrastructure works • All of these tools and platforms basically make the following easy to do • Break data up into many chunks, each of which can fit into memory on a given machine • Send each chunk to a machine where it is processed into chunks containing intermediate results • Combine the intermediate results into a single aggregate data set • Lather, rinse, repeat…

  12. Big data technologies within Elsevier

  13. Big data technology issues (in no particular order) • Talent acquisition • What training is needed to make big data platforms usable by our existing teams? • Who/what is a data scientist? • Best practices and design patterns for big data • @nathanmarz’ Lambda Architecture • The proliferation of big data platforms • HPCC, MapR, Cloudera… • Cloud-based vs. hosted solutions • Amazon Elastic MapReduce, Redshift • Data formats and practice for scaling ETL/ELT • Apache Avro, Google Protocol Buffers, zlib-compressed JSON • Numerical computing frameworks for optimization • High-performance computing using GPUs

  14. Can we use big data to enable new business models? • These technologies can yield a wealth of infrastructure, tools, workflows and business models to clone and adapt to the special circumstances of scientific publishing • Big data can open the door to optimizing the value exchange between author, publisher and reader • This will require us to walk away from legacy preconceptions • Ask yourself: is it this way because it was done on paper? • A thought experiment: gold open access as computational advertising

  15. Big data is key to computational advertising Reference: S. Yuan, A.Z. Abidin, M. Sloan and J. Wang. Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users. arXiv:1206.1754v1 [cs.IR] 8 Jun 2012.

  16. Can big data enable computational publishing? knowledge Authors Researchers credit article inventories $$$ ($) article inventories time & focus $$$ $$ Article exchanges Publishers article inventories The simplified ecosystem of author-pays scientific publishing. Authors spend budget to buy article inventories from article exchanges and publishers; article exchanges serve as matchers for articles and journals; publishers provide valuable information to satisfy and keep researchers; researchers read articles and exchange credit for knowledge from the authors. Note that normally researchers would not receive cash from publishers.

  17. Summary • Big data can play a role in creating new value for researchers and institutions • Ways in which big data is currently exploited in the consumer Internet provide guidance for its use by scientific publishers

  18. Thank You Bradley P. Allen Chief Architect, Elsevier b.allen@elsevier.com @bradleypallen

More Related