goal: NIAID sent MiSeq estimates for Pt3 1995 there should be 20
(s1:s20), check against our results.

================================

SUMMARY:

- We sequenced four mixes of 3, 5, 6, and 6 SGA shorter LID HIV genomes from Pt3/1995 blinded to the truth. - These results use the de-novo Long Amplicon Analysis (LAA) push-button protocol publicly available in our SMRTportal software. - We saw 100% concordance across the entire MiSeq estimate (except ~3 bases at the begining in some) with the PacBio estimate adding ~30 bases to the begining and ending of the MiSeq estimate. - We correctly de-novo estimated the genomes even though one sample contained multiple genomes and several samples contained identical genomes. - This "out-of-the-box" solution warrants unblinding the remaining Pt3 sequences (our estimates at the end of this page). If those look good, then testing on the 20-mix might give perfect results. ================================ NIAID unblinded the Pt3 1995 MiSeq estimates (Friday 2014-6-6) -rw-r--r-- 1 mbrown domain_users 71570 2014-06-06 15:31 pt3_20_short_frag_1995_sample_contigs.fa -rw-r--r-- 1 mbrown domain_users 11233 2014-06-06 15:31 pt3_20_short_frag_1995_sample_info.xlsx Here is the spreadsheet than Yan joined with our data: -=-=-=-=
mixNameLIBSampleNameHiromiSampleNameHiromiEstSizeAgilentLengthMiSeqDelLengthAproxDelHXB2CGClusterNote
Pt3-95-3mix-A-short S1_hiromi_3rd_vicunaCon(2536) Pt3_s#19 1828      heterozygous Not a SGA, i.e. containing multiple HIV genomes
Pt3-95-3mix-A-short S8_vicuna_consensus(2884) Pt3_#s2 282527842788&3377 2602..5390 & 5401..8778 Unique
Pt3-95-3mix-A-short S2_hiromi_3rd_vicunaCon(2261) Pt3_s#20 2252216167201082..7801 cross p17 and gp41 Unique 6720 bp deletion causes frameshift in gp41
Pt3-95-3mix-B-short S7_vicuna_consensus(3165) Pt3_#s1 3132306558703682..9552 Unique
Pt3-95-3mix-B-short S16_vicuna_consensus(3422) Pt3_#s10 3541332256093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-B-short S17_vicuna_consensus(3425) Pt3_#s11 3533332556093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-B-short S9_vicuna_consensus(3416) Pt3_#s3 3542331656093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-B-short S14_vicuna_consensus(3415) Pt3_#s8 3483331556093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S18_vicuna_consensus(3418) Pt3_#s12 3593331856093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S19_vicuna_consensus(3429) Pt3_#s13 3709332956093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S20_vicuna_consensus(3439) Pt3_#s14 3665333956093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S22_vicuna_consensus(3431) Pt3_#s16 3670333156093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S24_vicuna_consensus(3452) Pt3_#s18 3660335256093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-C-short S10_vicuna_consensus(3418) Pt3_#s4 3634331856093279..8888 D All identical, but with slightly different starting/ending position
Pt3-95-3mix-D-short S21_vicuna_consensus(5092) Pt3_#s15 53974992320&36244857..5177 & 5464..9088 Unique Also has 285 bp invertion (4466..4181 in consensus or 5178..5463 in HXB2)
Pt3-95-3mix-D-short S23_vicuna_consensus(5050) Pt3_#s17 5483495039605621..9581 Unique
Pt3-95-3mix-D-short S11_vicuna_consensus(4279) Pt3_#s5 4302417947294354..9083 Unique
Pt3-95-3mix-D-short S12_vicuna_consensus(4742) Pt3_#s6 4879464242834781..9064 Unique
Pt3-95-3mix-D-short S13_vicuna_consensus(4770) Pt3_#s7 4919467042844775..9059 Unique
Pt3-95-3mix-D-short S15_vicuna_consensus(5237) Pt3_#s9 5481513737645640..9404 Unique
-=-=-=-= We sequenced four mixes of 3, 5, 6, and 6 SGA shorter LID HIV genomes from Pt3/1995 blinded to the truth. Here's the number of mixed SGA samples and number unique genomes per mix based on MiSeq: run numSGAs numUniqMiSeqGenomes --- ------- ------------------- A 3 4+ B 5 2 C 6 1 D 6 6 --- ------- ------------------- Note that the number of SGA samples and the number of unique genomes differ. Mixes E,F,G are not from 1995 and are still blinded, though they were run and analyzed. ================================================================

#### LAA RESULTS

See end of page for definitions of terms in result tables. ================================ ----Mix A: RESULTS: mixA-amplicon_analysis_summary.csv mixA-amplicon_analysis.fasta mymerge[mymerge$TotalCov>200,] FastaName BarcodeName CoarseCluster Phase 1 Barcode0_Cluster0_Phase0_NumReads298 0 Cluster0 Phase0 2 Barcode0_Cluster0_Phase1_NumReads202 0 Cluster0 Phase1 3 Barcode0_Cluster1_Phase0_NumReads500 0 Cluster1 Phase0 4 Barcode0_Cluster2_Phase0_NumReads328 0 Cluster2 Phase0 TotalCoverage SequenceLength PredictedAccuracy ConsensusConverged 1 298 2238 0.9995531 True 2 202 2234 0.9975511 False 3 500 2818 1.0000000 True 4 328 1628 1.0000000 True NoiseSequence IsChimera ChimeraScore ParentSequenceA ParentSequenceB 1 False False NaN NA NA 2 False False NaN NA NA 3 False False NaN NA NA 4 False False NaN NA NA CrossoverPosition qName 1 -1 Barcode0_Cluster0_Phase0_NumReads298/0_2238 2 -1 Barcode0_Cluster0_Phase1_NumReads202/0_2234 3 -1 Barcode0_Cluster1_Phase0_NumReads500/0_2818 4 -1 Barcode0_Cluster2_Phase0_NumReads328/0_1628 tName qStrand tStrand score 1 S2_hiromi_full_length_3rd_vicunaConsensus(2261) 0 0 -10805 2 S13_vicuna_consensus(4770) 0 1 -2681 3 S8_vicuna_consensus(2884) 0 1 -13905 4 S7_vicuna_consensus(3165) 0 0 -4507 percentSimilarity tStart tEnd tLength qStart qEnd qLength nCells 1 100.0000 0 2161 2161 36 2197 2238 45349 2 98.2079 4112 4670 4670 1661 2218 2234 11682 3 100.0000 0 2781 2784 31 2812 2818 58369 4 96.9792 3 937 3065 6 966 1628 20865 We get 2 100% about full-length and extra on ends (S2 and S8). S1 was not an SGA and we get two imperfect hits to (S13 and S7). If S1 was really two genomes then we are correct. NOTE: Accuracy of hit in percentSimilarity. Full-length because on target: tStart~=0 and tEnd~=tLength. Extra on ends because on query: qStart > 0 and qEnd < qLength ================================ ----Mix B: RESULTS: mixB-amplicon_analysis_summary.csv mixB-amplicon_analysis.fasta mymerge[mymerge$TotalCov>200,] FastaName BarcodeName CoarseCluster Phase 1 Barcode0_Cluster0_Phase0_NumReads500 0 Cluster0 Phase0 2 Barcode0_Cluster1_Phase0_NumReads500 0 Cluster1 Phase0 TotalCoverage SequenceLength PredictedAccuracy ConsensusConverged 1 500 3373 1 True 2 500 3095 1 True NoiseSequence IsChimera ChimeraScore ParentSequenceA ParentSequenceB 1 False False NaN NA NA 2 False False NaN NA NA CrossoverPosition qName 1 -1 Barcode0_Cluster0_Phase0_NumReads500/0_3373 2 -1 Barcode0_Cluster1_Phase0_NumReads500/0_3095 tName qStrand tStrand score percentSimilarity tStart 1 S24_vicuna_consensus(3452) 0 0 -16745 100 0 2 S7_vicuna_consensus(3165) 0 0 -15310 100 3 tEnd tLength qStart qEnd qLength nCells 1 3349 3352 18 3367 3373 70297 2 3065 3065 6 3068 3095 64270 We get 2 100% about full-length and extra on ends (S7 and S24). Correct! S24 is the best hit among the "all identical". ================================ ----Mix C: RESULTS: mixC-amplicon_analysis_summary.csv mixC-amplicon_analysis.fasta mymerge[mymerge$TotalCov>200,] FastaName BarcodeName CoarseCluster Phase 1 Barcode0_Cluster0_Phase0_NumReads500 0 Cluster0 Phase0 TotalCoverage SequenceLength PredictedAccuracy ConsensusConverged 1 500 3372 1 True NoiseSequence IsChimera ChimeraScore ParentSequenceA ParentSequenceB 1 False False NaN NA NA CrossoverPosition qName 1 -1 Barcode0_Cluster0_Phase0_NumReads500/0_3372 tName qStrand tStrand score percentSimilarity tStart 1 S24_vicuna_consensus(3452) 0 0 -16745 100 0 tEnd tLength qStart qEnd qLength nCells 1 3349 3352 17 3366 3372 70297 We get 1 100% about full-length and extra on ends (S24). Correct! S24 is the best hit among the "all identical". This is given six input SGA tubes. ================================ ----Mix D: RESULTS: mixD-amplicon_analysis_summary.csv mixD-amplicon_analysis.fasta > mymerge[mymerge$TotalCov>200,] FastaName BarcodeName CoarseCluster Phase 1 Barcode0_Cluster0_Phase0_NumReads500 0 Cluster0 Phase0 16 Barcode0_Cluster2_Phase0_NumReads500 0 Cluster2 Phase0 17 Barcode0_Cluster3_Phase0_NumReads422 0 Cluster3 Phase0 18 Barcode0_Cluster4_Phase0_NumReads341 0 Cluster4 Phase0 19 Barcode0_Cluster5_Phase0_NumReads285 0 Cluster5 Phase0 20 Barcode0_Cluster6_Phase0_NumReads234 0 Cluster6 Phase0 TotalCoverage SequenceLength PredictedAccuracy ConsensusConverged 1 500 4234 1 True 16 500 4695 1 True 17 422 5201 1 True 18 341 4698 1 True 19 285 5019 1 True 20 234 5012 1 True NoiseSequence IsChimera ChimeraScore ParentSequenceA ParentSequenceB 1 False False NaN 16 False False NaN 17 False False NaN 18 False False NaN 19 False False NaN 20 False False NaN CrossoverPosition qName 1 -1 Barcode0_Cluster0_Phase0_NumReads500/0_4234 16 -1 Barcode0_Cluster2_Phase0_NumReads500/0_4695 17 -1 Barcode0_Cluster3_Phase0_NumReads422/0_5201 18 -1 Barcode0_Cluster4_Phase0_NumReads341/0_4698 19 -1 Barcode0_Cluster5_Phase0_NumReads285/0_5019 20 -1 Barcode0_Cluster6_Phase0_NumReads234/0_5012 tName qStrand tStrand score percentSimilarity tStart 1 S11_vicuna_consensus(4279) 0 1 -20895 100 0 16 S13_vicuna_consensus(4770) 0 1 -23350 100 0 17 S15_vicuna_consensus(5237) 0 1 -25685 100 0 18 S12_vicuna_consensus(4742) 0 1 -23210 100 0 19 S23_vicuna_consensus(5050) 0 1 -24750 100 0 20 S21_vicuna_consensus(5092) 0 0 -24945 100 3 tEnd tLength qStart qEnd qLength nCells 1 4179 4179 23 4202 4234 87727 16 4670 4670 9 4679 4695 98038 17 5137 5137 34 5171 5201 107845 18 4642 4642 34 4676 4698 97450 19 4950 4950 33 4983 5019 103918 20 4992 4992 7 4996 5012 104737 We got all six of them right with coverage 200 cutoff! CORRECT! All of them full length (except 3 bases in S21) with and extra 30 bases or so at the ends, so we extend the MiSeq estimate! ================================================================================================

Results for the Pt3 2001 still blinded runs.

Estimates with coverages>200 should be examined ----Mix E (2001: s29, s30, s31) RESULTS: mixE-amplicon_analysis_summary.csv mixE-amplicon_analysis.fasta ----Mix F (2001: s21, s23, s27, s32) RESULTS: mixF-amplicon_analysis_summary.csv mixF-amplicon_analysis.fasta ----Mix G (2001: s22, s24, s25, s26, s28) RESULTS: mixG-amplicon_analysis_summary.csv mixG-amplicon_analysis.fasta ================================================================================================

Definitions for result tables:

- FastaName: name of the consensus sequence - BarcodeName: always 0 in this case, if using barcodes then the id - CoarseCluster: how the algorithm split the reads - Phase: how the algorithm split the CoarseCluster - TotalCoverage: number of reads that went into consensus - SequenceLength: the length of the estimate consensus - PredictedAccuray: the algorithm's de-novo estimate of the consensus accuracy - ConsensusConverged: whether the algorithm converged confidently to an answer - NoiseSequence: is the consensus estimate likely to be noise - IsChimera: is the consensus a cross-over of other sequences? Important in PCRing related species - ChimeraScore ParentSequenceA ParentSequenceB CrossoverPosition: information if IsChimera=True - qName: queryID in alignment - tName: targetID in alignment - qStrand, tStrand: direction of query and target in alignment - score: alignment score - percentSimilarity: the similarity in the aligned region - tStart, tEnd: where the alignment started and ended in the target sequence - tLength: total length of the target sequence - qStart, qEnd: where the alignment started and ended in the query sequence - qLength: total length of the query sequence - nCells: Number of ZMWs used??? I'll have to ask. Example: FastaName BarcodeName CoarseCluster Phase 1 Barcode0_Cluster0_Phase0_NumReads500 0 Cluster0 Phase0 TotalCoverage SequenceLength PredictedAccuracy ConsensusConverged 1 500 3372 1 True NoiseSequence IsChimera ChimeraScore ParentSequenceA ParentSequenceB 1 False False NaN NA NA CrossoverPosition qName 1 -1 Barcode0_Cluster0_Phase0_NumReads500/0_3372 tName qStrand tStrand score percentSimilarity tStart 1 S24_vicuna_consensus(3452) 0 0 -16745 100 0 tEnd tLength qStart qEnd qLength nCells 1 3349 3352 17 3366 3372 70297