Discussion Points for 2 nd Pseudogene Call

Discussion Points for 2nd Pseudogene Call Mark Gerstein 2005,09.22 11:00 EST

Intersection of Pseudogenes from Three Groups: Original 42 45 Havana-Gencode:167 pseudogenes 35 21 86 Yale: 184 pseudogenes 87 87 18 17 18 16 22 UCSC retrogenes: 15 expressed (7-8 pseudogenes) + 143 not expressed (all pseudogenes) 86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene. Provided by France.

Intersection of Pseudogenes from 4 Groups: Updated 52 (2) Havana-Gencode:167 pseudogenes 14 (2) 16 (0) Yale: 164 pseudogenes 82 (34) 15 (1) 17 (7) 33 (1) UCSC retrogenes: 146 not expressed • The numbers in parentheses are pseudogenes from GIS. • All from http://pseudogene.org/ENCODE/cross-ref • Pseudo-exons were merged to form pseudogenes and used for this comparison (now a pseudogene has only a single start and end) • Strand information is ignored • There are a total of 229 pseudogenes in the union

Intersection of Pseudogenes from 4 Groups: Non-processed Consensus 52 (2) Havana-Gencode:167 pseudogenes 14 (2) 16 (0) 82 (34) Yale: 164 pseudogenes 15 (1) 17 (7) 33 (1) UCSC retrogenes: 146 not expressed Roughly agreement now is: 82 + 52 – 7 = 127 from 229 total What to do with 102?

How to Pick Pseudogenes for RT-PCR? • Start with the intersection 127 • Duplicated v processed: how many of each? (2:1?) • Rank Pseudogenes: • By likelihood to be transcribed according to ENCODE evidence • ditag, then CAGE, then tiling array • By their uniqueness in genome • Good primers • Non cross-hybridizing probes • How to get a consistent rank? • Who will do RT-PCR ? • What coordinates to use ? • (Ignore 1 processed pseudogene already being sequenced by GIS group.)

How to generate a consensus for remaining 102 pseudogenes? • Stick with the intersection 127 • Develop a consistent criteria for identifying pseudogenes and uniformly apply to ENCODE • E.g. protein matches with disablements found from a pipeline • Ignores tricky cases flagged by manual annotation • Do a simple union of UCSC, Havana & Yale, giving 229 • GIS is a subset of other 3 • Describe pseudogenes as being identified by multiple approaches and then explicitly flag each group’s unique ones in final annotation • Easy but perhaps biases stats • Do a qualified union • Allow each group to “question” particular pseudogenes in another’s set • Send questions around and then have a call to sort out differences • Need a way to arbitrate– e.g. we could demand an obvious disablement • We might learn something! • How do we represent this in the browser & in stats?

Once we have consensus, how to agree on pseudogene boundaries? • Keep unchanged each group’s boundaries • If pseudogenes overlap, take largest region (union) or smallest • Develop a uniform criteria for assigning pseudogene boundaries and apply it to each of the pseudogenes in the consensus set • Could just take each pseudogene in the consensus and have one group realign it against parent

Discussion Points for 2 nd Pseudogene Call

Discussion Points for 2 nd Pseudogene Call

Presentation Transcript

ENCODE Pseudogene Summary for GT call

2 nd Call subprojects

Points for Discussion

Discussion points:

Discussion Points

Points for discussion

2 nd Call for Proposals

Points for Discussion:

Discussion Points

Points for Discussion

Discussion Points

KEY POINTS FOR DISCUSSION

DISCUSSION POINTS

ENCODE Pseudogene Summary for GT call

SOME POINTS FOR DISCUSSION

ENCODE Pseudogene Call Summary

Discussion Points for 3rd ENCODE Pseudogene Call

Points for Discussion

Discussion Points

Discussion points