#CRAM4GH

Here can be your content

Promote Yourself on Twipu

Promote Yourself on Twipu

Software developers @sangerinstitute, @emblebi and beyond, including @BonfieldJames, have been developing custom algorithms to store the #bigdata that DNA sequencing produces @GA4GH #CRAM4GH #CRAM #DataCompression #DNAsequencing

0
9
24

A blog update on CRAM is well overdue, so tonight I decided to give some updates on the progress for CRAMv3.1, and musings on the nature of compression ratio vs time vs memory. #CRAM4GH https://datageekdom.blogspot.com/2019/05/data-compression-sweet-spot.html …

1
14
24

DNA doesn't take up much space - there is a copy of our entire genome in each of our cells. When a couple of genomes are sequenced, the information takes up enough space to fill a standard laptop @GA4GH @BonfieldJames #BigData #CRAM4GH #CRAM http://bit.ly/2UUeM0v 

0
14
26
GA4GH
2 months ago

AI identifies risk for certain genetic disorders, #CRAM4GH Twitter Chat Recap, and more Genomics and Health News for April 8 - 15, 2019 - https://mailchi.mp/ga4gh.org/ga4gh-news-apr152019-2677273 …

0
5
9
GA4GH
2 months ago
Missed the #CRAM4GH Twitter chat last Friday? View a recap of the conversation here: https://bit.ly/2IvHHkn 

Missed the #CRAM4GH Twitter chat last Friday? View a recap of the conversation here: https://bit.ly/2IvHHkn 

Software developers @sangerinstitute, @emblebi and beyond, including @BonfieldJames, have been developing custom algorithms to store the #bigdata that DNA sequencing produces @GA4GH #CRAM4GH #CRAM #DataCompression #DNAsequencing http://bit.ly/2CYcMJI 

0
7
15

Replying to @cmdcolin @drtkeane @GA4GH

The original paper (Fritz, @ewanbirney, et al) did mention the idea of assemblies of the reads that didn't map to the reference in order to create novel embedded references (large insertions, contamination, etc). It's an idea I'd like to explore. #CRAM4GH

0
1
1

Replying to @cmdcolin @drtkeane @GA4GH

It's useful if the reference will be used once only, eg a denovo assembly. Note CRAM (currently) can only embed one reference per "slice", which harms efficiency a little. SAM/BAM have no analogue as they don't use a reference for compression. #CRAM4GH

1
0
1
colin
2 months ago

Replying to @drtkeane @GA4GH @BonfieldJames

Sorry to jump in, what is the idea of embedded references? any reading material? does it have any analog in SAM format? #CRAM4GH

1
0
0

Replying to @ewanbirney @TechnicalVault @GA4GH

Instrument manufacturers need to ponder this too. Eg if you started with a 16-bit ADC, do some processing before writing out a 32-bit float, you still only really have ~16bits of information and 16bits of noise. Is it really "lossy" if we quantise them? #CRAM4GH

2
0
1
GA4GH
2 months ago

Thanks to everyone who participated in the #CRAM4GH twitter chat! To learn more about the #CRAM #fileformat and how you can adopt it in your own pipelines and workflows, visit https://www.ga4gh.org/cram/ 

0
3
5

Thank you all. Signing off with a philosophical view. It has been said that data compression is an artificial intelligence problem. Truly understand the data, and you'll know be able to describe it in the most succinct manner. #CRAM4GH

1
3
11

Replying to @GA4GH @ewanbirney @drtkeane

CRAM certainly makes use of @ga4gh refget for retrieving any remote references (optional - you can use local ones too). CRAM is also supported as an on-the-wire format by the htsget protocol. Note htsget can also support on-the-fly file format conversion. #CRAM4GH

0
1
2

Replying to @ewanbirney @BonfieldJames and 2 others

But ... I think there’s plenty of innovation to do here. In some sense this is much a “responsible data model” issue as it is a compression question #CRAM4GH

1
1
1

Replying to @GA4GH @kauralasoo and 2 others

Good question, and sadly I've no idea on the answer. At the very least I'd expect existing lossy compression methods (eg crumble, qvz2) to not be detrimental, but as with everything you need to test, test, test! #CRAM4GH

0
0
1

Replying to @TechnicalVault @GA4GH

Those large BAMs often come from embedded "trace" data. That's a significant chunk of data which is also very hard to compress well. Even without those, the long-read technology is hard to compress. I think we really need a quality binning again, ala Illumina #CRAM4GH

1
0
1

Replying to @kauralasoo @TechnicalVault and 3 others

I’ve not seen people asses this (I have on SNP calls); it has implicitly happened due to the binning changes on NovaSeqs - anyone had a look? One headache - as you know - is what is the right truth set #CRAM4GH

0
0
2
GA4GH
2 months ago

from @kauralasoo "Are there best practices 4 lossy compression of qlty scores in RNA-seq data?assume they're less important if you are not trying to call variants, but no idea what the effects would be on splice junction alignment." #CRAM4GH @ewanbirney @BonfieldJames @drtkeane

1
1
2

Replying to @BonfieldJames @GA4GH

Note you can really see the trade-off here. Smaller compression is possible, by as much as 30% in some cases, but it is significantly slower (maybe 4x). That's not something to do for pipelines, but it's maybe appropriate for long-term archival. #CRAM4GH

0
3
0
Next Page