manta生成的包含structural variants(SV)结构变异的注释vcf文件,通过染色体位置获得基因symbol名称
首先确定流程:
.vcf(包含起始位点,染色体)----> .annotated.vcf(包含基因名称)
通过流程可知:
我们需要bed文件。因为bed文件包含:
染色体序号,起,止位点,基因的symbol
创建虚拟环境
conda create -n bcftools
conda activate bcftools
安装软件tabix和bcftools:
conda install -c bioconda bcftools
conda install -c bioconda tabix
这时候直接敲bcftools,如出现报错,说明还不能正常使用bcftools:
error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory
.so文件是动态库文件,库包含的是程序运行需要的函数库,libbz2.so是bzip2的库文件,那么下载一个bzip不就有了嘛
conda install -c conda-forge bzip2
安装完成之后再敲bcftools,出现了该软件的说明文档
数据准备:
# 用GFFutils里的gtf2bed将gtf文件转换为bed文件
gtf2bed gencode.v39lift37.annotation.gtf -o gencode.v39lift37.annotation.bed
# 用R将bed文件修改为以下顺序```
# CHROM FROM TO GENE STRAND INFO
> df <- read.table("gencode.v39lift37.annotation.bed",sep='\t',header=T)
> head(df)
GENE CHROM FROM TO STRAND INFO
1 DDX11L1 chr1 11869 14409 + gene
2 WASH7P chr1 14404 29570 - gene
3 MIR1302-2HG chr1 29554 31109 + gene
4 FAM138A chr1 34554 36081 - gene
5 OR4G4P chr1 52473 53312 + gene
6 OR4G11P chr1 57598 64116 + gene
> df <- df[,c(2,3,4,1,5,6)]
> head(df)
CHROM FROM TO GENE STRAND INFO
1 chr1 11869 14409 DDX11L1 + gene
2 chr1 14404 29570 WASH7P - gene
3 chr1 29554 31109 MIR1302-2HG + gene
4 chr1 34554 36081 FAM138A - gene
5 chr1 52473 53312 OR4G4P + gene
6 chr1 57598 64116 OR4G11P + gene
> write.table(df,"gencode.v39lift37.annotation2.bed",row.names=F,col.names=T,sep='\t',quote=F
压缩文件
bgzip /home/zhoukaiwen/database/GRCh37_hg19/gencode.v39lift37.annotation2.bed
tabix前的必须步骤
# tabix为bed文件建立索引,搜寻更快
tabix -pbed /home/zhoukaiwen/database/GRCh37_hg19/gencode.v39lift37.annotation2.bed.gz
# bcftools要求是.vcf.gz文件
bgzip somaticSV.vcf
注释:
bcftools annotate -a /home/zhoukaiwen/database/GRCh37_hg19/gencode.v39lift37.annotation2.bed.gz -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') somaticSV.vcf.gz -o somaticSV.annotated.vcf
# 此时vcf文件info那一列即为基因的symbol
创建perl脚本批量运行
cd WES
for n in `ls manta|grep "IBC"`;do echo bcftools annotate -a /home/zhoukaiwen/database/GRCh37_hg19/gencode.v39lift37.annotation2.bed.gz -c CHROM,FROM,TO,GENE -h "<(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">')" $n/results/variants/somaticSV.vcf.gz -o $n/results/variants/somaticSV.annotated.vcf>>manta/manta_annotate.sh;done
cd somaticSV
for n in `ls *.vcf.gz|sed 's/_somatic_SV_convertedInversion.vcf.gz//'`;do echo bcftools annotate -a /home/zhoukaiwen/database/GRCh37_hg19/gencode.v39lift37.annotation2.bed.gz -c CHROM,FROM,TO,GENE -h "<(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">')" $n\_somatic\_SV\_convertedInversion.vcf.gz -o $n.annotated.vcf>>manta_annotate.sh;done
cat IBC??.annotated.vcf|grep -v "#"|grep "PASS">IBC_SomaticSV.annotated.vcf
cat nonIBC??.annotated.vcf|grep -v "#"|grep "PASS">nonIBC_SomaticSV.annotated.vcf
for n in `ls *.annotated.vcf|sed 's/\.annotated\.vcf//'`;do cat $n.annotated.vcf|grep "PASS">$n.annotated.filtered.vcf;done
# 有时候vcf文件不是chr开头而直接是数字,要先进行修改
vi addchr2vcf.sh
zcat example.vcf.gz | awk -F"\t" '{if ($0 !~ /^#/) {print "chr"$0} else{print $0}}' > example.chr.vcf
#再进行替换
for n in `ls *_filtered.vcf.gz|sed 's/.vcf.gz//'`;do cat addchr2vcf.sh|sed -e "s/example/$n/g" >> renameChr.sh;done