<pre>

<h2>goal</h2>

goal: summarize results of BCR-ABL variant finding on a larger number
of barcoded patient samples.

<h2>summary</h2>

- METHODS: Sequence 862 base region of kinase domain from BCR-ABL
using nested PCR followed by barcoded PacBio RS sequencing for 54
clinical samples from 36 CML patients. Look for amino acid variants to
the reference sequence. Run two chips for technical repeats, although
all results only examine one chip at a time. The PacBio RS sequences
continuous single molecules spanning the entire region.

Here is workflow overview <a href="http://usmp-acid/mbrown/mbrown/workspace2012Q3/bcrAblBarcodes.Full/bcrablWorkflow.pdf">bcrablWorkflow.pdf</a>

- 106 known SEQUENCING variants. PacBio finds all of them. A threshold
of 6e-08 would give all of the sequencing mutations with no false
negatives. <a href="knownmutsSEQ.merge.tsv">knownmutsSEQ.merge.tsv</a>

- 131 known SEQUENOM variants. PacBio finds all of them. A p-value
threshold of 1.03e-03 would give all known sequenom mutations with no
false negatives. <a href="knownmutsSEQUENOM.merge.tsv">knownmutsSEQUENOM.merge.tsv</a>

- Note that the stricter p-value of 6e-08 would miss 8 SEQUENOM
variants (that have p-values >6e-08 but <1.03e-03).

- Taking the more strict pvalue threshold of 6e-08, I see 485 novel
variants (327 non-silent) that occur in BOTH technical repeat
chips occurring with abundances 0.2% to 86.15%. <a href="novelPacBioVariants.tsv.xls">novelPacBioVariants.tsv.xls</a>

- We can tell when mutations occur together in compound mutations and
sort by frequency, <a href="bcrabl.compoundvariants.runs20-21c24-37.tsv.xls">bcrabl.compoundvariants.runs20-21c24-37.tsv.xls</a>

- We can identify compound mutations at early time points at low
levels that eventually dominate, suggesting early clinical
intervention strategies.

<h2>samples, runs, and barcodes</h2>

We ran 54 samples in duplicate. Patient SampleIDs were barcoded and
run. The run table is: <a href="map2.tsv">map2.tsv</a>

Barcode yields are summarized:

limsid	thresh	ccsnum	barcodenum	barcodeYield	barcodeHitNames	barcodeHitCounts
2450177-0020	25	36722	29990	0.816676651598497	F0 F1 F2 F3 F4 F6	3172 3756 2787 9906 3185 7184
2450177-0021	25	37465	30593	0.816575470439077	F0 F1 F2 F3 F4 F6	3159 3809 2855 10238 3287 7245
2450177-0024	25	22911	17812	0.777443149578805	F0 F1 F2 F3 F4 F5 F6 F7	1187 1374 1898 5502 2669 1640 1723 1819
2450177-0025	25	22382	17575	0.785229202037351	F0 F1 F2 F3 F4 F5 F6 F7	1160 1463 1824 5301 2670 1613 1763 1781
2450177-0026	25	21233	16909	0.796354730843498	F0 F1 F2 F3 F4 F5 F7	5215 1167 1528 3010 1182 3818 989
2450177-0027	25	22824	18270	0.800473186119874	F0 F1 F2 F3 F4 F5 F7	5381 1362 1621 3241 1355 4191 1119
2450177-0028	25	17554	14646	0.834339751623562	F0 F1 F2 F3 F4 F5 F6 F7 F8	1588 1917 1601 1110 2301 2395 1887 1846 1
2450177-0029	25	19417	16104	0.829376319719833	F0 F1 F2 F3 F4 F5 F6 F7	1813 2162 1709 1241 2421 2586 2041 2131
2450177-0030	25	17045	12845	0.753593429158111	F0 F1 F2 F3 F4 F5 F6 F7	2437 2576 2175 2900 299 356 902 1200
2450177-0031	25	17290	13003	0.752053209947947	F0 F1 F2 F3 F4 F5 F6 F7	2449 2698 2221 2911 290 330 880 1224
2450177-0032	25	19790	15977	0.807326932794341	F1 F2 F3 F4 F5 F6	2436 2523 2047 1461 4905 2605
2450177-0033	25	19487	15870	0.81438907989942	F1 F2 F3 F4 F5 F6	2390 2524 2042 1426 4826 2662
2450177-0034	25	17284	14822	0.857556121268225	F0 F1 F2 F4 F5 F6 F7	1412 3990 1525 1847 1972 2389 1687
2450177-0035	25	20820	17708	0.850528338136407	F0 F1 F2 F4 F5 F6 F7	1677 4775 1844 2351 2290 2771 2000
2450177-0036	25	12859	10474	0.814526790574695	F0 F1 F2 F3 F4	3925 1879 1263 2391 1016
2450177-0037	25	12341	9997	0.810064014261405	F0 F1 F2 F3 F4	3890 1693 1134 2216 1064

Looks decent with ~80% yields, ~15k barcode reads per chip, sample
coverage 300 to 10,240 aligned reads. -0028,29 contains one false F8
barcode call. -0030,31 contains errant F4.

<h2>known variants</h2>

We were given a list of known variants by other technology (sequencing
and sequenom): <a href="Mutations&#32;by&#32;seq&#32;and&#32;MA&#32;130612_Rev.xlsx">Mutations by seq and MA 130612_Rev.xlsx</a>

I cleaned this into two normalized tables:
<a href="knownmutsSEQ.norm.csv">knownmutsSEQ.norm.csv</a>
<a href="knownmutsSEQUENOM.norm.csv">knownmutsSEQUENOM.norm.csv</a>

The headers are:
"key": run.barcode variant key for joining tables
"SampleID": your patient SampleID
"TestGroup": a pacbio run group
"PtInitials": your patient initials
"CollDate": your collection date
"Muts": your variant
"Frac": your estimate of fractional abundance
"runbc": pacbio run.barcode
"one": analysis flag
"poi": amino-acid position of interest.
"coi": codon of interest at poi in the reference
"Qcodon": query codon. the variant
"fracNull": the fraction of the variant seen in the null control run
"countNull": the number of times the variant is seen in the null control run
"SNull": the sum total of observations in the null control run at that position
"fracQAtP": the fraction of the variant seen in the sample
"countQAtP": the number of times the variant is seen in the sample
"SQAtP": the sum total of observations in the sample at that position
"pval": fisher exact test of sample counts to null counts: ((countQAtP, SQAtP),(countNull,SNull))
"mutation": the variant
"limsID": pacbio run id
"barcode": pacbio barcode
"Sample.ID": your patient SampleID

<h2>theshold estimate</h2>

In order to set an appropriate threshold to separate noise from true
variant results, I joined the known variants by the PacBio variant
estimates. A more stringent threshold of 6.0e-08 was chosen.

Here is the joined known variants with PacBio data:
<a href="knownmutsSEQ.merge.tsv">knownmutsSEQ.merge.tsv</a>
<a href="knownmutsSEQUENOM.merge.tsv">knownmutsSEQUENOM.merge.tsv</a>

This 6.0e-08 threshold yields all SEQUENCING variants and all SEQUENOM
variants except 8.

Here are the missed SEQUENOM variants at the 6.0e-08 threshold:
----
                      key SampleID TestGroup PtInitials CollDate  Muts Frac
42  2450177-0021.F1 g250e        2         1        laj   1/6/05 g250e   x%
44  2450177-0021.F1 l248v        2         1        laj   1/6/05 l248v   x%
85  2450177-0024.F6 f359c       15         1        map   6/6/05 f359c   x%
104 2450177-0025.F1 m244v       10         1        ksb  30/3/05 m244v   x%
161 2450177-0027.F0 l248v       17         1         sp  23/6/05 l248v   x%
206 2450177-0029.F4 t315i       66         4         mz  26/7/06 t315i   x%
279 2450177-0034.F2 m351t       89         4        lys  27/5/05 m351t   x%
289 2450177-0034.F6 v299l       93         4        dvd  11/1/07 v299l   x%
              runbc one poi coi Qcodon fracNull countNull SNull fracQAtP
42  2450177-0021.F1   1 250 ggg    gag  0.00028         8 28889  0.00469
44  2450177-0021.F1   1 248 ctg    gtg  0.00010         1 10000  0.00240
85  2450177-0024.F6   1 359 ttc    tgc  0.00022         4 18461  0.00317
104 2450177-0025.F1   1 244 atg    gtg  0.00010         1 10000  0.00687
161 2450177-0027.F0   1 248 ctg    gtg  0.00010         1 10000  0.00172
206 2450177-0029.F4   1 315 act    att  0.00392       164 41835  0.00999
279 2450177-0034.F2   1 351 atg    acg  0.00026        11 41782  0.00850
289 2450177-0034.F6   1 299 gtg    ttg  0.00028        12 42241  0.00450
    countQAtP SQAtP     pval mutation       limsID barcode Sample.ID     key2
42         10  2132 6.28e-08    g250e 2450177-0021      F1         2  2 g250e
44          7  2918 1.94e-04    l248v 2450177-0021      F1         2  2 l248v
85          4  1263 9.61e-04    f359c 2450177-0024      F6        15 15 f359c
104         8  1164 1.17e-07    m244v 2450177-0025      F1        10 10 m244v
161         7  4074 1.02e-03    l248v 2450177-0027      F0        17 17 l248v
206        19  1901 7.25e-04    t315i 2450177-0029      F4        66 66 t315i
279         5   588 2.03e-06    m351t 2450177-0034      F2        89 89 m351t
289         9  2002 1.45e-07    v299l 2450177-0034      F6        93 93 v299l
----

There is the possibility that a 6.0e-08 threshold is too stringent and
that a larger, more-lenient threshold would have good
performance. However, we would have to validate a larger number of
predictions to see if they are true or false positives. Also, we can
possibly boost statistics by combining technical repeat runs together.

<h2>novel variants</h2>

Taking the more strict pvalue threshold of 6e-08, I see 485 novel
variants (327 non-silent) that occur in BOTH technical repeat chips
occuring with abundances 0.2% to
86.15%. <a href="novelPacBioVariants.tsv.xls">novelPacBioVariants.tsv.xls</a>

Here is a large abundance novel variant in both technical repeats:
                         key one poi coi Qcodon fracNull countNull SNull
98846  2450177-0034.F2 t240a   1 240 acg    gcg  0.00015         6 41248
104291 2450177-0035.F2 t240a   1 240 acg    gcg  0.00015         6 41248
       fracQAtP countQAtP SQAtP pval mutation       limsID barcode Sample.ID
98846   0.86150       734   852    0    t240a 2450177-0034      F2        89
104291  0.86089       854   992    0    t240a 2450177-0035      F2        89
           key2
98846  89 t240a
104291 89 t240a

A next step is to validate these novel variants to see if they are
real.

One can also allow for significant variants that don't occur in both
technical repeat chips. This might identify some very low abundance
variants.

----
Note that in order to simplify results, I took the most significant
technical repeat and multiple codon encoding.

Here is a case where a f317l is spread across multiple codons:

                      key SampleID TestGroup PtInitials CollDate  Muts Frac
250 2450177-0036.F2 f317l       97         4        eec 27/12/05 f317l  30%
251 2450177-0036.F2 f317l       97         4        eec 27/12/05 f317l  30%
252 2450177-0036.F2 f317l       97         4        eec 27/12/05 f317l  30%
253 2450177-0036.F2 f317l       97         4        eec 27/12/05 f317l  30%
              runbc one poi coi Qcodon fracNull countNull SNull fracQAtP
250 2450177-0036.F2   1 317 ttc    tta    1e-04         2 19818  0.38975
251 2450177-0036.F2   1 317 ttc    ctc    1e-04         1 10000  0.20143
252 2450177-0036.F2   1 317 ttc    cta    1e-04         1 10000  0.00358
253 2450177-0036.F2   1 317 ttc    ttg    1e-04         1 10000  0.00596
    countQAtP SQAtP      pval mutation       limsID barcode Sample.ID
250       327   839  0.00e+00    f317l 2450177-0036      F2        97
251       169   839 6.30e-180    f317l 2450177-0036      F2        97
252         3   839  1.76e-03    f317l 2450177-0036      F2        97
253         5   839  1.58e-05    f317l 2450177-0036      F2        97

In this case, a single codon = 39% abundance but two codons=59%
abundance.

<h2>compound mutations</h2>

We can look at how mutations occur compounded together because we
sequence a continuous 862-base region from single molecules.

Here is a summary of all compound mutations: <a href="bcrabl.compoundvariants.runs20-21c24-37.tsv.xls">bcrabl.compoundvariants.runs20-21c24-37.tsv.xls</a>

For this summary, any mutation seen 10 or less times is not trusted
and discarded from compound mutation analysis.

Among 47 samples where >1 mutation was detectable by direct sequencing
or MassARRAY, SMRT sequencing revealed that 40 (85%) had compound
mutations detectable at a frequency of >1%.  In total, we detected 73
different compound mutations at a frequency of ≥1%.  In all cases
where compound mutations were detected and more than one treatment
timepoint was available, at least one compound mutation clearly
evolved from a mutation detectable at a prior timepoint.  In the most
complex case, 4 separate mutations yielded 8 different mutant alleles.

<h2>time evolution</h2>

Patient CSY has 3 time points: F3=22/1/08, F2=28/3/06, F1=21/3/05 in
run 2450177-0032.

Show all compound mutations greater than 1% abundance:

---- F1:
26	0.010874	p230p.cca,f359c.tgc
606	0.253450	
1325	0.554161	f359c.tgc

---- F2:
58	0.023529	t315i.att,f359c.tgc
334	0.135497	
1006	0.408114	f359c.tgc

---- F3:
21	0.010479	t315i.att,a350a.gct,e352d.gat
21	0.010479	t315i.att,a350a.gct
36	0.017964	t315i.att,e352d.gat,f359c.tgc
65	0.032435	
78	0.038922	t315i.att,a350a.gct,f359c.tgc
118	0.058882	t315i.att,a350a.gct,e352d.gat,f359c.tgc
140	0.069860	f359c.tgc
244	0.121756	t315i.att
903	0.450599	t315i.att,f359c.tgc

In 2005 there is a f359c mutation (55%). In 2006, there is also a 2%
t315i.att,f359c.tgc. In 2008, t315i.att,f359c.tgc domainates (45%),
followed by t315i (12%), and f359c (7%).

Perhaps this could have been caught in 2006 rather than 2008.

<h2>t315i compound mutations</h2>

Here are all mutations that compound with t315i at greater than 1%,
where the integer indicates the number of chips runs where the
compound mutation was observed.

      1 d276e.gag,t315i.att
      1 t315i.att,a350a.gct
      1 t315i.att,d482d.gat
      2 e255v.gtg,t315i.att
      2 g250e.gag,t315i.att
      2 t315i.att,a350a.gct,e352d.gat
      2 t315i.att,a350a.gct,e352d.gat,f359c.tgc
      2 t315i.att,a350a.gct,f359c.tgc
      2 t315i.att,e352d.gat,f359c.tgc
      2 t315i.att,f359v.gtc
      2 t315i.att,h396r.cgt
      2 t315i.att,m351t.acg
      2 t315i.att,m458i.att
      2 t315i.att,s417y.tac
      2 t315i.att,w478l.ttg
      2 y253h.cac,t315i.att
      4 e255k.aag,t315i.att
      4 e279k.aag,t315i.att
      4 t315i.att,f359c.tgc
     29 t315i.att

We can also highlight the most commonly independently observed
mutations in the 2HYY structure:

<img width="600px" src="2hyy.png">

You can also look at how samples and variants co-occur:
<img width="600px" src="heatmap.vartab.png">

<h2>future: mutual information, structural variants, CLIA test</h2>

To determine the dependence on one variant to another, we can compute
mutual information between all pairs. This quantifies whether one
variant always occurs with another but might be hindered by smaller
sample size.

This analysis examined amino-acid variants to a known reference. We
have also observed more structural events like ~150-base deletions in
the reads. A characterization of these structural events to clinical
status might be very interesting.

The PacBio assay appears to have sensitivity comparable to SEQUENOM
and SEQUENCING and additionally gives the entire 862-base
domain. Would this be a valuable clinical CLIA-type assay?