================================
goal: recent Emory HIV results and clustering consensus
- Emory HIV full genome sequencing shown at ASHG November 2012.
- Follow up projects for stronger papers:
- HIV genome from bulk mixtures and separated SGA samples.
> Decovolute complex mixtures from clinical samples.
- HIV GAG (1.5kb) sequencing.
> Sequence 1 billion distinct genomes on a single chip ?
- Clustering consensus analysis.
- General framework: HIV, viral populations, 16S metagenomics, HLA
diploid estimates, H.pylori cagY, BCR-ABL, cancer rare variant
diagnostics.
- Many components to be optimized.
Next
================================
HIV transmission biological problem:
- Examine full HIV genomes (9kb) from clincial transmission pairs.
- Sequence complex mixtures of full HIV genomes (9kb) as well as
single genomes physically separated by SGA.
- PacBio Sequence 6 donor SGAs, 2 recipient SGAS, and pool of 60 bulk
PCRs from donor. Try smartbells (overhang and blunt) and Tdt preps.
+-----------+------+---------------------------+
| sample | runs | comment |
+-----------+------+---------------------------+
| R463M.OH | 6 |new donor OH bulk |
| R463M.BL | 3 |new donor BL bulk |
| R463M.Td | 3 |new donor TdT bulk |
| R463F.sga | 2 |new recipient sga |
| R463M.sga | 6 |new donor sga |
| R880F.pool| 4 |old recipient pooled sga's |
| R880M.pool| 4 |old donor polled sga's |
+-----------+------+---------------------------+
- Best outcome: identify the exact HIV founder genome to the base that
transmitted between the donor and the recipient.
Next
================================
Clustering consensus framework for HIV full genome sequencing:
----
The call:
python /home/UNIXHOME/mbrown/mbrown/workspace2012Q4/HIVSGASynMix/ConsensusClusterSubset.py \
--runDir cc-2450417-0020 \
--fasta run0020_s1_p0.fasta \
--ref HIVemory.fasta \
--spanThreshold=6400 \
--entropyThreshold=1.0 \
--basfofn 2450417-0020.bas.fofn \
> 2450417-0020.workflow.output 2>&1
--------
Workflow:
- estimate single best consensus using Quiver
- align all fully spanning reads to produce MSA
- feature select (rare) variant columns using entropy
- compute distance between all pairs of reads:
- Fraction of mismatches in MSA variant columns
- cluster all pairwise distances with agglomerative complete-linkage.
- stratify reads by cluster using threshold and recurse.
This is the "plan B" algorithm. Quiver is the
plan A that needs more time.
Next
================================
HIV transmission clustering results:
- For 28 runs show the clustering diagrams.
+-----------+------+---------------------------+
| sample | runs | comment |
+-----------+------+---------------------------+
| R463M.OH | 6 |new donor OH bulk |
| R463M.BL | 3 |new donor BL bulk |
| R463M.Td | 3 |new donor TdT bulk |
| R463F.sga | 2 |new recipient sga |
| R463M.sga | 6 |new donor sga |
| R880F.pool| 4 |old recipient pooled sga's |
| R880M.pool| 4 |old donor polled sga's |
+-----------+------+---------------------------+
README_RESULT_clusterimages.html
README_RESULT_clusterimages_9k.html
- How do you determine number of groups and cut the cluster ?
- Given the simple distance, use binomials to threshold noise:
- Approximate statistical cutoff given identified variant positions:
+-----------------+----------+------------+
| Run | Variants |Threshold |
+-----------------+----------+------------+
| POOLED bulk PCRs| ~150 |0.6 |
| SGA F runs | ~27 |0.725 |
| SGA M runs | ~20 |0.925 |
| 880F | ~3 |1.0 |
| 880M | ~130 |0.55 |
+-----------------+----------+------------+
- Problem SGAs are different from previous 21 SGAs:
README_RESULT_allClusters.html
Next
================================
Alignment of all Quiver SGA and bulks.
- How are single Quiver consensus estimates related disregarding
subspecies?
- Alignment of all 28 Quiver HIV estimates: allSGA.aln
- Weird grouping of R463M*BL with R800MpOH and R463M*OH with R880MpBL.
- R463M_D6 is the closest to R463F, the founder virus? Present as a
subspecies in the bulk?
Next
================================
PacBio Emory Full HIV Future:
- Explain SGA variance in new runs. There should be only one.
- Tune thresholds and estimate clustering consensus: find exact
founder virus to the base.
README_emoryProjCollab_FullSubpop.html
- Why differences in BL, OH ligations?