Convert Sequenza *segment.txt to GISTIC input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.

Input format for gistic:

segment file:

(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)

The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
Every segment start and end in the segments file should appear in the markers file, not the other way around.
when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5

marker file:

https://groups.google.com/a/broadinstitute.org/forum/#!searchin/gistic-forum/marker$20file/gistic-forum/Vq9WWDiy7jU/BSFg2zmBZ1EJ
(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)
Note gistic2 does not require a marker file anymore.

sequenza gives a segment file. Segmentation was done by copynumber bioconductor package.
13 columns of the segments.txt file:
"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"
We only need the chromosome, start.pos, end.pos, N.BAF and depth.ratio columns.
The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).
It is not log2(2^ depth.ratio) -1 rather log2(2 depth.ratio) - 1

I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.

library(tidyverse)
library(readr)
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample
        return(dat)
})

seg_dat <- do.call(rbind, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")

Back to bash:

## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic
modify the gistic2 script a bit. e.g. change MCR_ROOT folder path

#!/bin/sh
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location
MCR_ROOT=/scratch/genomic_med/apps/Matlab_Complier_runTime
MCR_VER=v83

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/runtime/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/bin/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/sys/os/glnxa64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
XAPPLRESDIR=$MCR_ROOT/$MCR_VER/MATLAB_Component_Runtime/v83/X11/app-defaults
export XAPPLRESDIR

## launch GISTIC executable
./gp_gistic2_from_seg $@

一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31