goal: validate the large ~300-bp deletion observed in the HIV data.

ALIGNMENT RESULTS

Show IGV alignments for P222-D105 (strong deletion) and P9213-D59 (evidence of deletion but at lower abundance), P222-D101 (no large deletion evidence by metrics) Note, you might have to download the references and bam locally so IGV can properly index them (the webserver is password protected).

P222-D105 (strong deletion)

Given Reference: AC222_D105_assembly.fa Given Reference: AC222_D105_assembly.fa.fai BAM alignments: deletionValidation-AC222_D105_15Mar_C01_1.sorted.bam BAM alignments: deletionValidation-AC222_D105_15Mar_C01_1.sorted.bam.bai RAW reads >4k mapped: deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta

P9213-D59 (evidence of deletion)

Given Reference: 9213_D59.fa Given Reference: 9213_D59.fa.fai BAM alignments: deletionValidation-9213-59_14Mar~D01_1.sorted.bam BAM alignments: deletionValidation-9213-59_14Mar~D01_1.sorted.bam.bai RAW read >4k mapped: deletionValidation-clucon-10-runallgivenref-9213_D59.fa-9213-59_14Mar~D01_1.clustered.fasta

P222-D101 (no large deletion evidence)

Estimated eference: quiverResult.consensus.fasta Estimated eference: quiverResult.consensus.fasta.fai BAM alignments: deletionValidation-222-101_15Mar~D01_1.sorted.bam BAM alignments: deletionValidation-222-101_15Mar~D01_1.sorted.bam.bai RAW read >4k mapped: deletionValidation-cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1.clustered.fasta

METHODS: Here is Rebecca's email Sept 2 >>>>>>>>>>>>>>>> Hi Michael, I am looking at AC222 D105, where you have a 300 bp deletion in Env in your consensus reads relative to the reference genome that I provided. However, I dont see the deletion in either the 454 reads or the Pac Bio reads when I look at an alignment of the raw reads to my reference. I use the program IGV to view the read alignment to the reference. It is clear in the 454 reads that there is no deletion in Env relative to the reference. In the alignment of the Pac Bio reads, I do see a drop in coverage in that region, but it does not appear that there are reads spanning that region that have missing bases. In order to check the reads that you used, I selected reads longer than 4000 bp and did the same alignment to my D105 reference. I get about 50 reads. In these reads I also dont see a large deletion, in fact these reads seem to nicely span the whole region. Do you think this is a reasonable way to check for the deletion in the reads? I wonder what could be going on in your consensus builder that you get these deletions. I also checked the 454 reads for sample 9213 D59, and I also see no signs of such a large deletion. I'll hold off on installing the software until we iron this out. Thanks for any help with this. Best, Rebecca -- Rebecca Batorsky, PhD Research Fellow Ragon Institute of MGH, MIT and Harvard 400 Technology Square | Room 761 | Cambridge MA 02139 Phone: 857.268.7015 Email: rbatorsky@partners.org <<<<<<<<<<<<<<<<< ================================ GOAL: generate alignments of two time points in BAM format so that Rebecca can see the deletion in IGV viewer. TODO: Perhaps also give references for both deleted and non-deleted so partners.org can run their pipeline and show that the deleted and non-deleted alignments are truncations of each other (their aligner doesn't want to follow the 300 base deletion and stops on one side, giving a "half alignment". Against their reference I have: Initial Look: clucon_222-105-15Mar-AC222_D105_assembly_fixed Run all look: clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/ ma_clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.mutationAnalysis.output.top ma_clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.mutationAnalysis.output.png Bases 7033-7382 strongly deleted. MSA 35165-36910 The recursion on the 4k sub-reference: * drwxr-xr-x 2 mbrown domain_users 1237 2013-08-26 17:07 cluconFrom4k-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1 * drwxr-xr-x 2 mbrown domain_users 1436 2013-08-27 13:24 cluconFrom4k-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1-~_D1C1 * drwxr-xr-x 2 mbrown domain_users 1292 2013-08-27 13:17 cluconFrom4k-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1-~_D1C2 * drwxr-xr-x 2 mbrown domain_users 1292 2013-08-27 13:16 cluconFrom4k-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1-~_D1C3 In clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/, I have 590 reads that map >4k to their reference (+1 for the given reference). The ids are in alignments.filterFull 265 265 17977 clustergroup_1.ids 202 202 13707 clustergroup_2.ids 74 74 5021 clustergroup_3.ids 49 49 3325 clustergroup_4.ids 590 590 40030 total Let's create a fasta in the order of clustergroup_*.ids. If the aligner puts them in the same order then the alignment should clearly show the deletion. (IGV with SAM/BAM will change order... but only look at the >4k mapped) Pull the fasta in order of the clustering: export SEYMOUR_HOME=/mnt/secondary/Smrtanalysis/opt/smrtanalysis; . $SEYMOUR_HOME/etc/setup.sh; fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta \ --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta.index \ -F -q /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/clustergroup_1.ids \ > /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta \ --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta.index \ -F -q /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/clustergroup_2.ids \ >> /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta \ --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta.index \ -F -q /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/clustergroup_3.ids \ >> /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta \ --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-105_15Mar/C01_1/Analysis_Results/m130316_025735_42183_c100458832550000001523065103201322_s1_p0.fasta.index \ -F -q /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1/clustergroup_4.ids \ >> /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta There are 590 sequences in that file. ---- Run pbalign.py to get a sam file: pbalign.py deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta \ /mnt/data3/vol53/Share/CollaboratorData/partorg/data/assembly_fasta/AC222_D105_assembly.fa \ deletionValidation-AC222_D105_15Mar_C01_1.sam # convert sam to input for IGV samtools view -bS deletionValidation-AC222_D105_15Mar_C01_1.sam > deletionValidation-AC222_D105_15Mar_C01_1.bam samtools sort deletionValidation-AC222_D105_15Mar_C01_1.bam deletionValidation-AC222_D105_15Mar_C01_1.sorted samtools index deletionValidation-AC222_D105_15Mar_C01_1.sorted.bam cp /mnt/data3/vol53/Share/CollaboratorData/partorg/data/assembly_fasta/AC222_D105_assembly.fa ./ >>>> Run IGV: Reference: AC222_D105_assembly.fa BAM alignments: deletionValidation-AC222_D105_15Mar_C01_1.sorted.bam RAW reads >4k mapped: deletionValidation-clucon-4-runallgivenref-AC222_D105_assembly.fa-222-105_15Mar~C01_1.clustered.fasta ================================ Do the same for P9213-D59 in which the deletion should occur at lower abundance. cat clucon-10-runallgivenref-9213_D59.fa-9213-59_14Mar~D01_1/clustergroup_*.ids > /tmp/all.ids fastaindex --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/9213-59_14Mar/D01_1/Analysis_Results/m130315_025009_42183_c100458702550000001523065103201383_s1_p0.fasta --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/9213-59_14Mar/D01_1/Analysis_Results/m130315_025009_42183_c100458702550000001523065103201383_s1_p0.fasta.index fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/9213-59_14Mar/D01_1/Analysis_Results/m130315_025009_42183_c100458702550000001523065103201383_s1_p0.fasta \ --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/9213-59_14Mar/D01_1/Analysis_Results/m130315_025009_42183_c100458702550000001523065103201383_s1_p0.fasta.index \ -F -q /tmp/all.ids \ > /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-clucon-10-runallgivenref-9213_D59.fa-9213-59_14Mar~D01_1.clustered.fasta 306 raw sequences pbalign.py \ deletionValidation-clucon-10-runallgivenref-9213_D59.fa-9213-59_14Mar~D01_1.clustered.fasta \ /mnt/data3/vol53/Share/CollaboratorData/partorg/data/assembly_fasta/9213_D59.fa \ deletionValidation-9213-59_14Mar~D01_1.sam # convert sam to input for IGV: deletionValidation-9213-59_14Mar~D01_1 samtools view -bS $1.sam > $1.bam samtools sort $1.bam $1.sorted samtools index $1.sorted.bam cp /mnt/data3/vol53/Share/CollaboratorData/partorg/data/assembly_fasta/9213_D59.fa ./ >>>> Run IGV: Reference: 9213_D59.fa BAM alignments: deletionValidation-9213-59_14Mar~D01_1.sorted.bam RAW read >4k mapped: deletionValidation-clucon-10-runallgivenref-9213_D59.fa-9213-59_14Mar~D01_1.clustered.fasta ================================ Finally do P222-D101 (no observed deletion) on the estimated reference because I don't have a given reference. cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1: cat cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1/clustergroup_*.ids > /tmp/all.ids fastaindex --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-101_15Mar/D01_1/Analysis_Results/m130316_051506_42183_c100458832550000001523065103201323_s1_p0.fasta --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-101_15Mar/D01_1/Analysis_Results/m130316_051506_42183_c100458832550000001523065103201323_s1_p0.fasta.index fastafetch --fasta /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-101_15Mar/D01_1/Analysis_Results/m130316_051506_42183_c100458832550000001523065103201323_s1_p0.fasta --index /mnt/data3/vol53/Share/CollaboratorData/partorg/data/222-101_15Mar/D01_1/Analysis_Results/m130316_051506_42183_c100458832550000001523065103201323_s1_p0.fasta.index \ -F -q /tmp/all.ids \ > /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/deletionValidation-cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1.clustered.fasta 850 raw sequences pbalign.py \ deletionValidation-cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1.clustered.fasta \ /home/UNIXHOME/mbrown/mbrown/workspace2013Q3/partners-hiv/cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1/quiverResult.consensus.fasta \ deletionValidation-222-101_15Mar~D01_1.sam # convert sam to input for IGV: deletionValidation-222-101_15Mar~D01_1 samtools view -bS $1.sam > $1.bam samtools sort $1.bam $1.sorted samtools index $1.sorted.bam cp -i cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1/quiverResult.consensus.fasta quiverResult.consensus.fasta >>>> Run IGV: Reference: quiverResult.consensus.fasta BAM alignments: deletionValidation-222-101_15Mar~D01_1.sorted.bam RAW read >4k mapped: deletionValidation-cluconFrom4k-16-runallgivenref-X-222-101_15Mar~D01_1.clustered.fasta