================================

NIAID Large Internal Delete (LID) HIV Genomes

- GOAL: PacBio sequence HIV genome mixtures that have been isolated by SGA that have shorter lengths due to large internal deletions (LIDs) and compare results to MiSeq results so that algorithms can be perfected. We look forward to seeing the unblinded MiSeq estimates for the runs we present here. - Here is the sample list of the shorter genomes sent to PacBio:

Name Samples AvgEstimatedSize ---- ------- ---------------- Pt3-95-3mix-A-short 1995: s2, s19, s20, 2200 Pt3-95-5mix-B-short 1995: s1, s3, s8, s10, s11 3300 Pt3-95-6mix-C-short 1995: s4,s12,s13,s14,s16,s18 3400 Pt3-95-6mix-D-short 1995: s5, s6, s7, s9, s15, s17 4700 Pt3-01-3mix-E-short 2001: s29, s30, s31 2400 Pt3-01-4mix-F-short 2001: s21, s23, s27, s32 3300 Pt3-01-5mix-G-short 2001: s22, s24, s25, s26, s28 3400

- For this set of results, we give consensus genonomes for the Pt3-95-3mix-A-short (3 expected genomes) and Pt3-95-3mix-D-short (6 expected genomes) runs. Below we give 4 download links applying two algorithms to these two runs. - By comparing the results against the unblinded MiSeq estimates, we will be able to iterate the algorithms to get them perfect. We will start with two runs and then be able to use the other runs as new "test sets" to make sure our algorithms don't overfit. After getting everything perfect, the goal is to apply the final algorithms to the 20-mix of full length HIV genomes as the final validation test set. ================================

Consensus Methods

- We try two classes of algorithms for these results: Long Amplicon Analysis (LAA) and ClusteringConsensus (CluCon). -- Long Amplicon Analysis (LAA) is a de-novo method that works to cluster out more widely divergent genes and then finds smaller scale differences like SNPs within the genes. It was originally developed for multi-gene HLA sequencing. - LAA was run using the standard analysis protocol in SMRTportal. -- ClusteringConsensus (CluCon) is a reference based approach that clusters out different genomes and then mitigates noise using consensus within each clustering in a recursive fashion. - CluCon was run using a development version of the code available at https://github.com/mpsbpbi/clusteringConsensus - The development version works to handle PacBio error modes to achieve better performance. - An HIV HXB2 reference was used with the middle deleted. This was necessary because the deletions were so large, the alignment was broken into two separate parts. ================================

LAA Consensus Genomes

------------ - LAA results for the Pt3-95-3mix-A-short run:

Sequence Cluster Sequence Phase Length (bp) Estimated Accuracy Subreads coverage Cluster0 Phase0 2,238 99.96 298 1 Cluster0 Phase1 2,234 99.76 202 2 Cluster1 Phase0 2,818 100.00 500 3 Cluster2 Phase0 1,628 100.00 328 4 Cluster3 Phase0 831 99.71 54 Cluster4 Phase0 1 100.00 37 Cluster5 Phase0 475 99.54 23 Cluster6 Phase0 1,387 96.07 22

Remove any results with less than 200 coverage leaving four. - RESULT:Here are four consensus geonomes from the Pt3-95-3mix-A-short run: laa-A-amplicon_analysis.clean.fasta - From the runsheet, we expected three genomes but the algorithm found four with high coverage. ------------ - LAA results for the Pt3-95-3mix-D-short run:

Sequence Cluster Sequence Phase Length (bp) Estimated Accuracy Subreads coverage Cluster0 Phase0 4,234 100.00 500 1 Cluster10 Phase0 3,185 99.87 109 Cluster11 Phase0 121 99.25 89 Cluster12 Phase0 2,081 99.74 87 Cluster13 Phase0 2,999 100.00 81 Cluster14 Phase0 2,819 99.90 36 Cluster1 Phase0 2,297 95.40 53 Cluster1 Phase1 2,260 0.05 4 Cluster1 Phase2 2,296 99.57 76 Cluster1 Phase3 2,297 99.45 63 Cluster1 Phase4 2,298 0.05 4 Cluster1 Phase5 2,282 99.60 62 Cluster1 Phase6 2,282 98.66 72 Cluster1 Phase7 2,190 9.22 3 Cluster1 Phase8 2,278 99.98 162 Cluster2 Phase0 4,695 100.00 500 2 Cluster3 Phase0 5,201 100.00 422 3 Cluster4 Phase0 4,698 100.00 341 4 Cluster5 Phase0 5,019 100.00 285 5 Cluster6 Phase0 5,012 100.00 234 6 Cluster7 Phase0 2,705 100.00 142 Cluster8 Phase0 2,472 100.00 136 Cluster9 Phase0 2,933 100.00 115

Remove any results with less than 200 coverage leaving six. - RESULT:Here are six consensus geonomes from the Pt3-95-3mix-D-short run: laa-D-amplicon_analysis.clean.fasta - For completeness, here are all consensus geonomes from the Pt3-95-3mix-D-short run: laa-D-amplicon_analysis.fasta (superset of above) - From the runsheet, we expected six genomes and the algorithm found six with high coverage. ================================

CluCon Consensus Genomes

------------ - RESULT:CluCon consensus genomes for the Pt3-95-3mix-A-short run: short-mixes-A-clucons.fasta There are 3 estimated genomes with expected 3. len name VarCols --- ----- ------- 2767 short-mixes-A-clucons-num0 84 2771 short-mixes-A-clucons-num1 128 2781 short-mixes-A-clucons-num2 142 (VarCols is the number of columns that are estimated to be variant. Should be low for pure subsets.) ------------ - RESULT:CluCon consensus genomes for the Pt3-95-3mix-D-short run: short-mixes-D-clucons.fasta There are 7 estimated genomes with expected 6. len name VarCols --- ----- ------- 5064 short-mixes-D-clucons-num0 0 4667 short-mixes-D-clucons-num1 1 4668 short-mixes-D-clucons-num2 2 4204 short-mixes-D-clucons-num3 33 4994 short-mixes-D-clucons-num4 36 4880 short-mixes-D-clucons-num5 60 4612 short-mixes-D-clucons-num6 265 (VarCols is the number of columns that are estimated to be variant. Should be low for pure subsets.)