html5-img
1 / 33

Ten Years and Change

Ten Years and Change. the MX data archive at ALS 8.3.1. Acknowledgements. ALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate Center for Structure of Membrane Proteins Membrane Protein Expression Center II Center for HIV Accessory and Regulatory Complexes W. M. Keck Foundation

marler
Télécharger la présentation

Ten Years and Change

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ten Years and Change the MX data archive at ALS 8.3.1

  2. Acknowledgements ALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate Center for Structure of Membrane Proteins Membrane Protein Expression Center II Center for HIV Accessory and Regulatory Complexes W. M. Keck Foundation Plexxikon, Inc. M D Anderson CRC University of California Berkeley University of California San Francisco National Science Foundation University of California Campus-Laboratory Collaboration Grant Henry Wheeler The Advanced Light Source is supported by the Director, Office of Science, Office of Basic Energy Sciences, Materials Sciences Division, of the US Department of Energy under contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.

  3. ALS 8.3.1 data collection history terabytes (uncompressed)

  4. ALS 8.3.1 data collection history terabytes (uncompressed)

  5. ALS 8.3.1 data collection history images x 106

  6. DVD data archive: 68 TB

  7. DVD data archive

  8. 50 TB

  9. Primary failure mode of DVDs

  10. Primary failure mode of DVDs 3000 files remain unrecoverable (~0.1%)

  11. Which data go with which PDB? • 260,000 images are called “test” • cell: 48 62 84 90 101 104 • is within 5 Å and 5° of 16,000 PDBs focusing on 2001-2006 • 490 PDBs credit ALS 8.3.1 with data • 44 of these didn’t actually collect data • 64 collected data, but no credit

  12. Which data go with which PDB? 1,604,031 682,712 3602 3331 2524 1479 1054 1 to 200+ • images from 2001-2006 • collected “near” edges • find “runs” of >10 images • unify multi-wedge sets • run labelit & XDS • >70% complete? • I/σ > 10 • reduced cell vs PDB

  13. myoglobin Unit Cell: 90.9 90.9 46.8 90 90 120 1hh7 M. TB CSOR best Rcryst after rigid-body refinement 1rb5 RMS unit cell length deviation (Å)

  14. non-isomorphous Unsolved? Published MAD/SAD datasets Riso vs PDB deposit best Rcryst after rigid-body refinement

  15. Responses to inquiries “I have to find my old note book as I have no idea what that is.” “I have changed jobs a few times since and am really far away from crystallography now.” “Will see what I can find.” “We solved it but never published it. Sorry!”

  16. EGDA Dec 01 19:45:12 2001 egda46_*1_E#_###.img (1112 images, Se MAD)Dec 02 15:10:06 2001 egda27_*1_###.img (180, 1A, native?)Dec 02 19:21:55 2001 egdau1_*1_###.img (427, 8000eV (U?) SAD)Dec 02 20:58:26 2001 egdau1_*2_###.img (360, 8000eV (U?) SAD)Jun 01 14:07:43 2002 egda60_*1_###.img (360, Lutetium SAD) “I think that these EGDA data sets are very likely some of xxx’s data sets, he was working on E.coli guanine deaminase, something he brought from yyy. No structure was ever published James, xxx was unable to solve the structure from these data.”

  17. ~2.9 Å P21212 E. coli guanine deaminase R = 0.32 Rfree = 0.39 PDB ID: ????

  18. Metadata: can we rely on it? authors were after lobster arginine kinase got enolase instead arginine kinase structure still unknown Duquerroy, et al. (1994). "Lobster enolase crystallized by serendipity", Proteins: Struct., Funct., Bioinf.18, 390-393.

  19. raw image compresses 4.2x

  20. just spots compresses 337x

  21. pixel-wise median across dataset compresses 5x, but only one per dataset!

  22. deviation from median in “non-spot” areas compresses 3.5x

  23. after h264 of non-spot areas compressed ~50x

  24. difference between raw and compressed compresses 5.2x

  25. Lossy compression vs R/Rfree R factor compression ratio

  26. backblaze.com “pod” server backblaze.com offers “unlimited storage” data backup for $5/month.

  27. backblaze offers “unlimited storage” data backup for $5/month.

  28. backblaze does not sell these “pods”, but “protocase.com” does.

  29. Summary • saving data could double productivity • unit cell is not a good score • lossy compression: rallying cry? • backup vs archive • metadata: what do we really know?

  30. Brief Summary • this is a lot of work. • who is going to pay for it?

More Related