At the end of this demo, your browser will look like the TCGA Copy Number Profiles in the browser bookmark.

This example shows how different cancers have distinct fingerprints in their whole genome copy number profiles. For example, lung adenocarcinoma has copy number variation across the whole genome whereas for glioblastoma multiforme the copy number variation is concentrated in a few chromosomes.

The Brain Lower Grade Glioma dataset is the default dataset, which shows segmented copy number variation profile after removing common germline copy number variation. Red indicates amplification while blue indicates deletion.

This dataset is automatically displayed in Heatmap mode, where each row of data across both maps corresponds to one sample. To change to Box Plot, use the drop down menu next to the dataset name above the heatmap image.

When zoomed out with a Box Plot, the median is represented by a black dotted curve and the outer quartiles are shown. When zoomed in, a box is drawn around these inner quartiles and the median becomes a line in the box. Color intensity of the outer quartiles is proportional to the deviation from zero. More information on box plots can be found in our user guide and at wikipedia.

Next add the breast invasive carcinoma copy number dataset, click "Add datasets" in the thumbnails panel. From there open the TCGA breast invasive carcinoma group or search for "breast copy number". Click on the dataset named "BRCA copy number (delete germline cnv)" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

Change this dataset to Box plot as well.

Repeat steps 1 and 2 for acute myeloid leukemia, glioblastoma multiforme and lung adenocarcinoma.

Type "chr1-chr22" in the search box at the top of the browser to remove chromosome X and Y from view. Copy number variation for sex chromosomes is much more difficult to estimate than for autosomes, as such, it is usually best to remove them from view.

At the end of this demo, your browser will look like the PAM50 Subtypes bookmark.

Breast cancer research studies have used gene expression to classify invasive breast cancers into biologically- and clinically-distinct subtypes that have become known as Luminal A, Luminal B, HER2-enriched and Basal-like [1]. Subtype information has repeatedly shown to be an independent predictor of survival in breast cancer when used in multivariate analyses with standard clinical-pathological variables. In 2009, Parker et al. derived a minimal gene set (PAM50) for classifying intrinsic subtypes of breast cancer [2]. The PAM50 gene set has high agreement in classification with larger intrinsic gene sets previously used for subtyping, and is now commonly employed.

Reference
1. PMID: 23035882
2. PMID: 19204204

We show the TCGA breast cancer subtypes defined by the PAM50 method. The gene expression profile is contrasted with the PAM50 classifications from Nature 2012 TCGA publication (Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like), as well as ER, PR and HER2 status. We will also examine differences in Kaplan-Meier survival curves between the different subtypes.

Additionally, we will examine the survivorship curves for patients categorized as triple negative (negative for ER status, PR status and HER2 status).

Click "Add datasets" in the thumbnails panel. From there open the TCGA breast invasive carcinoma group or search for "breast agilent". Click on the dataset named "BRCA gene expression (AgilentG4502A_07_3)" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

This dataset shows gene expression data from the Agilent G4502A_07_3 microarray. Red indicates over expression while green indicates under expression.

Right now we are viewing the whole genome; in order to view individual genes, click the "Genes" toggle above the name of the dataset in the right viewing panel. In this view each genes are laid out next to each other, each given the same space.

The set of genes shown is the default PANCAN Mutated Genes. To change the genes shown, click on "Advanced Genesets" next to the gene entry box.

We are going select a predefined set of genes, PAM50, which are the 50 genes that are used to call the subtypes. Click "Select a Favorite" dropdown menu, and select "Breast Cancer:PAM50" gene set. This will replace the current geneset with the PAM50 geneset.

Close the dialog box.

To get a closer look at the PAM50 calls, click and drag vertically, either top to bottom or bottom to top, on the clinical heatmap to zoom in. To zoom out, click the zoom out icon at the top of the map.

Your browser should now look like this bookmark.

Click on the PAM50 subtype classification column to bring up a context menu that will allow you to change it to the first column. Click the "KM" button to view how the subtypes stratify patients in terms of overall survival.

Patients classified as Luminal A and Luminal B tend to have better survivorship for the first 3,000 days (~10 years) but survivorship continues to decline after that. Patients classified as Basal-like or HER2-enriched tend to have poor survival early on but if they make it to ~10 years their survival prognosis improves.

Patients who are classified as triple negative are negative for ER status, PR status and HER2 status. While the browser contains these statuses individually, there is not clinical feature that shows whether a patient is negative for all 3 of these markers. To examine the Kaplan-Meier plot for triple negative patients we will have to build our own custom data that stratifies patients into triple negative and non-triple negative.

We will first download the current clinical data in view, manipulate it in a spreadsheet program (such as Microsoft Excel), and then upload it as custom data.

To download the data, click on the "Tools" button above the clinical heatmap and select "Download". Make sure that "Clinical data in view" is selected, which is just the clinical data in the heatmap right now, and click "ok". Choosing "Clinical data in cohort" will download data for all individuals in the TCGA breast cancer study, not just those for which there is genomic Agilent data. Choosing "Full processed dataset" will download all of the files that make up the dataset.

This will download the data as a .tsv, which means that all the values in the file are separated by a tab. Any spreadsheet program will be able to read this file.

In the end, we want a column next to the sample_ID that has a 1 if the patient is triple negative (i.e. negative for ER Status, PR status and HER2 status) and 0 if they are not triple negative. We have to use numerical values to indicate triple negative status as the website can only accept numerical clinical data, not categorical.

If you are using Excel these links may be helpful: Overview of formulas, IF function, and Copy cell values, not formulas.

If you are using Google Sheets this link may be helpful: Functions and formulas.

Your final file should look like this. Please refer to the documentation of the spreadsheet program you are using for any help you may need reaching this state.

Now we are going to load the data from the file we created into the clinical heatmap. First, click on the "Tools" button and select "Upload". This will open a dialog box where you can copy and paste the first two columns of the spreadsheet, making sure to leave out the header of both columns.

Next, name the custom data "Triple Negative Status" and click "update". This will add the custom data we created to the clinical heatmap.

You can also look at the Kaplan Meier plot for your custom data. Click the KM button to see how the survivorship curves differ for triple negative patients versus non-triple negative patients.

At the end of this demo, your browser will look like the Figure 1 from Nature 2012 (pubmed:23000897) bookmark.

This example shows how to duplicate Figure 1 from the TCGA breast cancer Nature 2012 paper . It shows the relationship between copy number variation in the most mutated genes identified by the study, the gene expression driven breast cancer subtype classification (PAM50 subtypes) and the mutation status in 3 genes (PIK3CA, TP53 and GATA3).

From the paper:

"The luminal A subtype harboured the most significantly mutated genes, with the most frequent being PIK3CA (45%), followed by MAP3K1, GATA3, TP53, CDH1 and MAP2K4. .... Luminal B cancers exhibited a diversity of significantly mutated genes, with TP53 and PIK3CA (29% each) being the most frequent. The luminal tumour subtypes markedly contrasted with basal-like cancers where TP53 mutations occurred in 80% of cases and the majority of the luminal significantly mutated gene repertoire, except PIK3CA (9%), were absent or near absent. The HER2E subtype, which has frequent HER2 (i.e. ERBB2) amplification (80%), had a hybrid pattern with a high frequency of TP53 (72%) and PIK3CA (39%) mutations and a much lower frequency of other significantly mutated genes including PIK3R1 (4%)."

"A total of 773 breast tumours were assayed using Affymetrix 6.0 SNP arrays. Segmentation analysis and GISTIC were used to identify focal amplifications/deletions and arm-level gains and losses. These analyses confirmed all previously reported copy number variations and highlighted a number of significantly mutated genes including focal amplification of regions containing PIK3CA, EGFR, FOXA1 and HER2, as well as focal deletions of regions containing MLL3, PTEN, RB1 and MAP2K4; in all cases, multiple genes were included within each altered region. Importantly, many of these copy number changes correlated with mRNA subtype including characteristic loss of 5q and gain of 10p in basal-like cancers and gain of 1q and/or 16q loss in luminal tumours"

You can read the full paper at TCGA breast cancer Nature 2012 paper.

The brain lower grade glioma copy number dataset is already showing. To add the breast invasive carcinoma copy number dataset, click "Add datasets" in the thumbnails panel. From there open the TCGA breast invasive carcinoma group or search for "breast copy number". Click on the dataset named "BRCA copy number (delete germline cnv)" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

This dataset shows segmented copy number variation profile after removing common germline copy number variation. Red indicates amplification while blue indicates deletion.

The default view is of the whole genome; in order to view individual genes, click the "Genes" toggle above the name of the dataset in the right viewing panel. In this view, genes are laid out next to each other, each given the same space.

The set of genes shown is the default TCGA PanCan most mutated genes. To change the genes shown, enter the following genes in the gene box at the top of the page:

PIK3CA
ERBB2
TP53
MAP2K4
MLL3
CDKN2A
PTEN
RB1

Click "Go" to view these genes.

Let's add our own custom data to the clinical heatmap. This is going to be mutation data for three genes: PIK3CA, TP53 and GATA3. Lets start with the TP53 data.

The data format is a list of the sample or patient names on the left, and the data values on the right. If the sample contains a predicted non-silent somatic mutation in the gene, we mark it with a "1"; if not, then the sample is not included.

To add custom data, click the "Tools" button and choose "Custom Annotations". This will open a window where you can copy and paste mutation data from the above TP53 data file into the box. Name the feature "TP53 mutation", and click "Update" to add it to the clinical heatmap.

Repeat this for the other two data files:

GATA3 data
PIK3CA data

There are some clinical features that are hidden from the current view that we want to add. Click the "More Features" button to show all the clinical features available for a dataset, including those hidden from view. Type "nature" in the search field at the top of the menu to show features which only have "nature" in their name. Click "Node--Coded (nature 2012)" and "Tumor--T1 (nature 2012)" to add them to the clinical heatmap.

Click on "PAM50 array" clinical feature column and move it to the 1st position. Changing the feature order changes the sort order as both heatmaps are sorted on the left-most clinical feature.

To finish, we need to hide some clinical features and change the order so that it matches the bookmark. Click on "sample type" and choose "hide" from the context menu. Then hide other clinical features until the clinical heatmap matches the bookmark.

Click the "KM" button to view how the PAM50 subtype classification stratifies patients in terms of overall survival. Move different features to the front to see how overall survival changes depending on how you group patients.

Somatic mutation data for other significantly mutated genes from the study have already been added to the bookmark as custom data. You can add them using the "More Features" button.

At the end of this demo, your browser will look like the TCGA Somatic Mutation Profiles in the browser bookmark.

Somatic mutation frequency profile (proportions view) of the significantly mutation genes identified by TCGA Pan-Cancer AWG across 19 TCGA cancers

The brain lower grade glioma copy number dataset is already showing. To add the breast invasive carcinoma mutation dataset, click "Add datasets" in the thumbnails panel. From there open the TCGA breast invasive carcinoma group or search for "breast mutation". Click on the dataset named "BRCA mutation" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

This dataset shows mutations across the genome. Red indicates a non-silent mutation was found in the gene while white indicates that there was a silent mutation or no mutation was found.

The dataset is automatically displayed in Heatmap mode, where each row of data across both maps corresponds to one sample. To change to Box Plot, use the drop down menu next to the dataset name above the heatmap image.

When zoomed out with a Box Plot, the median is represented by a black dotted curve and the outer quartiles are shown. When zoomed in, a box is drawn around these inner quartiles and the median becomes a line in the box. Color intensity of the outer quartiles is proportional to the deviation from zero. More information on box plots can be found in our user guide and at wikipedia.

Repeat steps 1 and 2 for all other TCGA cancer types. Note that not all cancer types will have mutation data.

Right now we are viewing the whole genome; in order to view individual genes, click the "Genes" toggle above the name of the dataset in the right viewing panel. In this view each genes are laid out next to each other, each given the same space.

Note that by default the set of genes displayed is the most mutated genes as called by the TCGA Pancan Analysis Working Group.

At the end of this demo, your browser will look like the Prostate Cancer TMPRSS2-ERG Gene Fusion bookmark.

We are going to walk through the steps one of our researchers here at UCSC took to rediscover the TMPRSS2-ERG gene fusion event that is known to be present in approximately 50% of individuals with prostate cancer.

Reference:
1. TMPRSS2-ERG Fusion Gene Expression in Prostate Tumor Cells and Its Clinical and Biological Significance in Prostate Cancer Progression

We will first find the gene fusion in the copy number variation data. Then we will see if the fusion is being expressed in the exon expression data. To do this we will create a signature that indicates if the fusion is occurring and then visualize it in the exon expression data.

This example assumes you are familiar with the steps in the PAM50 demo.

The brain lower grade glioma copy number dataset is already showing. To add the TCGA prostate adenocarcinoma copy number dataset, click "Add datasets" in the thumbnails panel. From there open the TCGA prostate adenocarcinoma group or search for "prostate copy number". Click on the dataset named "PRAD copy number (delete germline cnv)" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

This dataset shows segmented copy number variation profile after removing common germline copy number variation. Red indicates amplification while blue indicates deletion.

ERG is a known oncogene. Let's see what is happening in the data for this gene. To zoom in, click in the search bar at the top of the screen and enter "ERG". Click on the auto-suggested gene of ERG and it will fill in the coordinates for the ERG gene. Hit enter to go to this position.

When looking at the ERG gene, it is interesting to see that for some patients, about half the gene has been deleted. To more clearly see the pattern we are going to sort both heatmaps by the ERG gene by adding to the clinical heatmap. To do this, we are going to open the signature menu by clicking the "Tools" button and selecting "Signatures" from the drop down.

The signatures menu is a way to either add a gene signature, that are algebraic expressions on genes to predict a clinical value such as survivorship (see some of the existing signatures in the drop down Favorite menu), or to add single or multiple genes to the clinical heatmap. This menu is very similar to the genesets menu.

Enter "ERG" in the gene expression text box, name the signature "ERG" and click update. This will add the average value for the ERG gene to the clinical heatmap. Looking at the heatmap we can estimate that about 25-30% of individuals in the dataset have this partial deletion.

Your browser should now look like this bookmark.

Since only part of the gene is missing, this suggests that this gene is actually being fused with another gene. To see what other gene it could be fused to we need to zoom out until we can see the other end of the deletion. To zoom out progressively, click the smaller zoom out icon above the genomic heatmap (the larger zoom out icon will zoom you out to the whole chromosome). Continue to zoom out until you see the other end of the deletion.

Now that we can see the other end of the deletion, click and drag to zoom in on that end. Below the chromosome ideogram (showing where you are in genome) there is the RefSeq track, pulled directly from the UCSC Genome Browser. If you hover over this track you can see that the deletion ends in the gene TMPRSS2.

To verify that patients that are missing half of the ERG gene are indeed also missing part of TMPRSS2, we are going to visualize both genes in genes mode. First, go into genes mode by clicking the "Genes" button in the upper left corner. Then enter "TMPRSS2 ERG" into the genes box and click "Replace". Looking at these genes, we can indeed see that patients that are missing part of one gene are missing part of the other.

Your browser should now look like this bookmark.

Looking at the CNV data we can hypothesize that ERG is being fused with TMPRSS2, especially since they are on the same strand. TMPRSS2 is expressed in prostate cells and so this could lead to over-expression of ERG which is a known oncogene.

What we want to know is if this gene fusion is being expressed. We are going to build a signature that indicates if there is a gene fusion in the CNV data and then visualize this signature in the exon expression data.

If we make a signature that is an addition of these genes then the resulting value should tell us if the the gene fusion event is happening in the CNV data. If the value is negative then there is at least a deletion in one of the genes, if not both. To make this example simpler, if the value is negative we are going to assume there is a deletion and as we will soon see, this is an ok assumption to make.

Open the signatures menu again and put "TMPRSS2 + ERG" in the gene expression box. Name the signature "Gene fusion CNV signature" and click "Update".

If we opened the exon expression dataset right now, the signature would be automatically populated to that dataset. However the signature for each dataset is calculated based on the genomic data from that dataset. In order to visualize the signature from the CNV data in the exon expression dataset we are going to need to upload it ourselves.

To download the data, click on the "Tools" button above the clinical heatmap and select "Download". Make sure that " Clinical data in view" is selected and click "ok". Open the file in a spreadsheet program like Microsoft Excel.

Since signatures are automatically added to every new dataset is open we should delete the signature to make the heatmaps cleaner. Delete the signatures we've added from the active signature list by clicking the "x". Note that the signature will still be available from the favorite menu even after it has been deleted.

To add the exon expression dataset, click "Add datasets" in the thumbnails panel. From there open the TCGA prostate adenocarcinoma group or search for "prostate expression". Click on the dataset named "PRAD exon expression (IlluminaHiSeq)" to display it. Close the dataset selection window by clicking the x at the top or the "Return to the Cancer Browser" button at the bottom.

This dataset shows gene expression data from the Illumina HiSeq 2000 RNA Sequencing platform. Red indicates over expression while green indicates under expression.

Notice that it automatically opens to the same genes as the dataset before.

Click the "Tools" button on the expression dataset and select "Upload" from the drop down. Paste in the first two columns from the download from the CNV data into the dialog box. Name it "Gene fusion CNV signature" and click "Update".

We can see that for negative values of this signature (i.e. where there was a gene fusion event) that there is indeed expression of the ERG gene (red is over expression and green is under expression). Thus patients that have this gene fusion event are indeed expressing it in their tumors.