================================goal: recent Emory HIV results and clustering consensus
- Emory HIV full genome sequencing shown at ASHG November 2012. - Follow up projects for stronger papers: - HIV genome from bulk mixtures and separated SGA samples. > Decovolute complex mixtures from clinical samples. - HIV GAG (1.5kb) sequencing. > Sequence 1 billion distinct genomes on a single chip ? - Clustering consensus analysis. - General framework: HIV, viral populations, 16S metagenomics, HLA diploid estimates, H.pylori cagY, BCR-ABL, cancer rare variant diagnostics. - Many components to be optimized. Next ================================HIV transmission biological problem:
- Examine full HIV genomes (9kb) from clincial transmission pairs. - Sequence complex mixtures of full HIV genomes (9kb) as well as single genomes physically separated by SGA. - PacBio Sequence 6 donor SGAs, 2 recipient SGAS, and pool of 60 bulk PCRs from donor. Try smartbells (overhang and blunt) and Tdt preps.+-----------+------+---------------------------+ | sample | runs | comment | +-----------+------+---------------------------+ | R463M.OH | 6 |new donor OH bulk | | R463M.BL | 3 |new donor BL bulk | | R463M.Td | 3 |new donor TdT bulk | | R463F.sga | 2 |new recipient sga | | R463M.sga | 6 |new donor sga | | R880F.pool| 4 |old recipient pooled sga's | | R880M.pool| 4 |old donor polled sga's | +-----------+------+---------------------------+- Best outcome: identify the exact HIV founder genome to the base that transmitted between the donor and the recipient. Next ================================Clustering consensus framework for HIV full genome sequencing:
---- The call: python /home/UNIXHOME/mbrown/mbrown/workspace2012Q4/HIVSGASynMix/ConsensusClusterSubset.py \ --runDir cc-2450417-0020 \ --fasta run0020_s1_p0.fasta \ --ref HIVemory.fasta \ --spanThreshold=6400 \ --entropyThreshold=1.0 \ --basfofn 2450417-0020.bas.fofn \ > 2450417-0020.workflow.output 2>&1 -------- Workflow: - estimate single best consensus using Quiver - align all fully spanning reads to produce MSA - feature select (rare) variant columns using entropy - compute distance between all pairs of reads: - Fraction of mismatches in MSA variant columns - cluster all pairwise distances with agglomerative complete-linkage. - stratify reads by cluster using threshold and recurse. This is the "plan B" algorithm. Quiver is the plan A that needs more time. Next ================================HIV transmission clustering results:
- For 28 runs show the clustering diagrams.+-----------+------+---------------------------+ | sample | runs | comment | +-----------+------+---------------------------+ | R463M.OH | 6 |new donor OH bulk | | R463M.BL | 3 |new donor BL bulk | | R463M.Td | 3 |new donor TdT bulk | | R463F.sga | 2 |new recipient sga | | R463M.sga | 6 |new donor sga | | R880F.pool| 4 |old recipient pooled sga's | | R880M.pool| 4 |old donor polled sga's | +-----------+------+---------------------------+README_RESULT_clusterimages.html README_RESULT_clusterimages_9k.html - How do you determine number of groups and cut the cluster ? - Given the simple distance, use binomials to threshold noise: - Approximate statistical cutoff given identified variant positions:+-----------------+----------+------------+ | Run | Variants |Threshold | +-----------------+----------+------------+ | POOLED bulk PCRs| ~150 |0.6 | | SGA F runs | ~27 |0.725 | | SGA M runs | ~20 |0.925 | | 880F | ~3 |1.0 | | 880M | ~130 |0.55 | +-----------------+----------+------------+- Problem SGAs are different from previous 21 SGAs: README_RESULT_allClusters.html Next ================================Alignment of all Quiver SGA and bulks.
- How are single Quiver consensus estimates related disregarding subspecies? - Alignment of all 28 Quiver HIV estimates: allSGA.aln - Weird grouping of R463M*BL with R800MpOH and R463M*OH with R880MpBL. - R463M_D6 is the closest to R463F, the founder virus? Present as a subspecies in the bulk? Next ================================PacBio Emory Full HIV Future:
- Explain SGA variance in new runs. There should be only one. - Tune thresholds and estimate clustering consensus: find exact founder virus to the base. README_emoryProjCollab_FullSubpop.html - Why differences in BL, OH ligations?