|
Array-based Genome Comparison of Arabidopsis Ecotypes Using Hidden Markov Models |
Seifert, M., Banaei, A., Keilwagen, J., Mette, M.F., Houben, A., Roudier, F., Colot, V., Grosse, I., and Strickert, M. (2009): Array-based Genome Comparison of Arabidopsis Ecotypes Using Hidden Markov Models, Biosignals 2009, Porto, Portugal.
|
|
|
|
|
Three-state HMM with Gaussian emission densities for the analysis of Array-CGH data. States of the HMM are represented by circles labeled with '-' (decreased), '=' (unchanged), and '+' (increased) modeling copy numbers of DNA segments in an ecotype compared to a reference genome. Transitions between states are represented by arrows modeling all possible transitions in an Array-CGH profile. Gaussian emission densities characterize the states. Thus, the emission density of the unchanged state (gray) has its mean about zero, whereas the emission densities of the decreased state (green) and the increased state (red) have means significantly different from zero. |
||
|
Overview |
Why is it interesting to compare genomes of different Arabidopsis thaliana ecotypes
TilingArrayAnalyzer: Executable JAR-file for HMM-based analysis of tiling array data including
HMM training
Detection of copy number changes
GFF and Tab-delimited output
Array-CGH data set of ecotypes C24 and Columbia
Shows how to use the TilingArrayAnalyzer in combination with the Array-CGH data set.
Shows how a data set must be structured for the application of the TilingArrayAnalyzer.
|
Motivation |
Arabidopsis thaliana ecotypes have a broad geographic distribution around the world
Natural variation of different ecotypes is expected to be reflected in the DNA of the ecotypes including changes in
Circadian Clock
Flowering Time
Pathogen Resistance
Drought Response
Freezing Tolerance
Transposon Distribution
Epigenetic Variation
Copy Number Changes between different ecotypes can be measured by Array Comparative Genomic Hybridization (Array-CGH) using whole genome tiling arrays
The detection of copy number changes in Array-CGH data by convenient bioinformatics approaches is one essential point for studying natural variation
The TilingArrayAnalyzer is an HMM-based method that allows you to detect these copy number changes
Exemplary comparison of segmentation results for DNA regions on chromosome 4 for ecotype C24 compared to Columbia. From left to right separated by gray dashed lines: Region 1 [654,108 bp - 697,518 bp], Region 2 [1,305,320 bp - 1,324,132 bp], Region 3 [3,731,013 bp - 3,761,229 bp], and Region 4 [5,411,025 bp - 5,433,126 bp]. The two top plots represent segmentation results of the HMM approach for two interleaved arrays Array 1 and Array 2. Green dots label tiles predicted by the HMM to have a decreased copy number. Red dots label tiles predicted by the HMM to have an increased copy number. Blue dashed lines highlight DNA segments significantly different from permuted data at a Score-value threshold of 0.01. Black dots label tiles predicted by the HMM to have unchanged copy numbers. The two bottom plots represent segmentation results of the segMNTalgorithm for both arrays. Red dashed lines show that no segmentation was obtained. Both approaches provide a quite different segmentation of the DNA regions. Here, the segMNT algorithm failed to identify segments with decreased or increased copy numbers. The HMM approach clearly identifies segments with significantly decreased or increased copy numbers, and in addition, these biologically interesting results are reproducible for both arrays.
|
Downloads |
TilingArrayAnalyzer 64-Bit (5.2 MB) and TilingArrayAnalyzer 32-Bit (5.2 MB)
Requires Java 1.6
Usage information
Unpack TilingArrayAnalyzer.tar.gz
Type: tar xvzf TilingArrayAnalyzer.tar.gz
Go into folder TilingArrayAnalyzer
Type: java -jar TilingArrayAnalyzer.jar
Ensure to have enough heap space for the 32-Bit version using the java flag -Xmx500m
|
Case Study |
Download the TilingArrayAnalyzer
Unpack the file TilingArrayAnalyzer.tar.gz
You obtain the folder TilingArrayAnalyzer containing
Folder: Analysis
Structured storage of analysis results
Folder: RawData
Storage of data sets including the data set of ecotype C24 and Columbia
JAR-File: TilingArrayAnalyzer.jar
TilingArrayAnalyzer
Here we analyze the data set 6486702.txt of ecotype C24 compared to Columbia stored in the directory TilingArrayAnalyzer/RawData
Training
Go to the directory TilingArrayAnalyzer
Start the TilingArrayAnalyzer
java -jar TilingArrayAnalyzer.jar -TRAINING -startDistribution 0.2 0.75 0.05 -stateDurationScalingFactor 0.025 -means -2.5 0.0 1.5 -stds 1 1 0.5 -ess 10 -scaleOfAprioriMeans 10000 1000 7500 -shapeOfStandardDeviations 20000 1 1000 -scaleOfStandardDeviations 1E-4 1E-4 1E-4 -output true -dataSetNumber 6486702
You obtain the HMM file in the directory TilingArrayAnalyzer/Analysis/HMM
HMM_6486702.txt
You obtain the GFF file of Viterbi annotations in the directory TilingArrayAnalyzer/Analysis/Score/ViterbiAnnotation
VibAn_6486702.gff
Use a GFF-viewer like SignalMap or the Integrative Genome Browser to look if the results satisfy your expectations (red is state '+', black is state '=', and green is state '-')
If yes than go to scoring, otherwise test other parameter settings restarting the training.
Scoring
Go to the directory TilingArrayAnalyzer
Start the TilingArrayAnalyzer
java -jar TilingArrayAnalyzer.jar -SCORING -output true -dataSetNumber 6486702 -numberOfPermutations 10 -scoreValueLevel 0.01
You obtain scores of '-' and '+' segments in the original data in the directory TilingArrayAnalyzer/Analysis/Score/OrgScores
Loss_6486702.txt
Gain_6486702.txt
You obtain tab-delimited files including the genome-wide segmentation into '-', '=', and '+' segments for the original data in the directory TilingArrayAnalyzer/Analysis/Score/Segments
Seg_6486702.txt
Seg_Extended_6486702.txt
You obtain scores of '-' and '+' segments for permutated data in the directory TilingArrayAnalyzer/Analysis/Score/H0
Score_H0_Loss_6486702.txt
Score_H0_Gain_6486702.txt
You obtain tab-delimited files for significantly changed '-' and '+' segments in the original data stored in the directory TilingArrayAnalyzer/Analysis/Score/PValueLists
LossSeg_6486702.txt
GainSeg_6486702.txt
You obtain the GFF file of significant '-' and '+' Viterbi segments in the directory TilingArrayAnalyzer/Analysis/Score/ViterbiAnnotation
SigSeg_6486702.txt
|
General Usage |
The TilingArrayAnalyzer can be applied to data sets structured in the following manner:
Headline with columns ID, Chr, Chr, End, Pos, and Log-Ratio
Chr: Chromosome where the tile is located
Pos: Position of the tile on the chromosome
LogRatio: Log-Ratio of Test vs. Control
The column delimiter must be the tabulator
No missing values
All rows in the data set file must be sorted first by increasing values for Chr and second by increasing values for Pos
Data set must be stored in the directory TilingArrayAnalyzer/RawData and its file name extension must be .txt
|
Chr |
Pos |
Log-Ratio |
|---|---|---|
|
chr1 |
108 |
1.87 |
|
chr1 |
216 |
-0.25 |
|
chr1 |
324 |
1.37 |
|
chr2 |
44 |
-1.66 |
|
chr2 |
88 |
0.66 |
|
© Michael Seifert April 2009 |