InferCNV + UPhyloplot2 构建单细胞进化树

1. 下载UPhyloplot2软件

UPhyloplot2 Github官网： https://github.com/harbourlab/UPhyloplot2

参考文档： https://www.jianshu.com/p/38280bda882a

2. 运行 InferCNV

# 构建infercnv_obj软件报的提示：
# Please use "options(scipen = 100)" before running infercnv if you are using the analysis_mode="subclusters" option or you may encounter an error while the hclust is being generated.
# 一定要设定为random_trees运行，否则自动选择Leiden clustering，但是这样运行巨慢，跑1个树的分支就要10h，应用后台R nohup运行
options(scipen = 100)
infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                             out_dir="InferCNV_MPT01_UPhyloplot2",
                             cluster_by_groups=FALSE, 
                             analysis_mode="subclusters", #默认是"samples"
                             denoise=TRUE,
                             write_expr_matrix = T,
                             HMM=TRUE,
                             plot_steps=T,
                             scale_data=T,
                             noise_filter=0.12,
                             HMM_type='i6',
                             num_threads=8,
                             hclust_method='ward.D2',
                             tumor_subcluster_partition_method='random_trees',
                             tumor_subcluster_pval=0.05)

3. InferCNV跑完后的数据预处理

输出结果有几个文件在后面画进化树会用到：

画进化树需要: 17_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.cell_groupings（包含了根据CNV分类的结果，一共两列，一列是类别名称，共8类，但是有一类是参考细胞，所以要去掉参考，剩下7类；另一列是细胞编号。）
注释进化树的分支需要:
① HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.pred_cnv_regions.dat
② HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.pred_cnv_genes.dat
注意这两个文件里state的意义：
State 1: 0x: complete loss
State 2: 0.5x: loss of one copy
State 3: 1x: neutral
State 4: 1.5x: addition of one copy
State 5: 2x: addition of two copies
State 6: 3x: essentially a placeholder for >2x copies but modeled as 3x

# 去掉画进化树需要的文件的参考细胞的行
sed '/^all_references/d' <  17_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.cell_groupings > trimmed_infercnv.cell_groupings

# 处理后添加header: cell_group_name  cell
cat 17_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.cell_groupings| grep "all_observations" > trimmed_infercnv.cell_groupings

# 去掉注释进化树需要的文件的参考细胞的行
# genes.dat文件处理后添加header: cell_group_name   gene_region_name    state   gene    chr start   end
# regions.dat文件处理后添加header: cell_group_name cnv_name    state   chr start   end

cat 17_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_genes.dat|grep "all_observations" > trimmed_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_genes.dat

cat 17_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_regions.dat|grep "all_observations" > trimmed_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_regions.dat

4. 绘图

使用的时候，将主程序uphyloplot2.py和文件夹Inputs放在一起，上面提到cell_groupings文件放到Inputs文件夹里面。UPhyloplot2 将生成一个“output.svg”矢量图形图。此外，它将生成一个名为“CNV_files”的新文件夹，其中包含每个输入的 CNV 文件，其中包含第 1 列中由 inferCNV 标识的亚克隆 ID、第 2 列中每个亚克隆的细胞百分比以及标记亚克隆的字母第 3 列中的 output.svg 文件。

# 在文件夹下直接运行即可
python uphyloplot2.py

5. 注释图片

筛选每个克隆独特的CNV区域

# unique_cnv_region_for_cellgroup.py
import pandas as pd

# Load the data into a pandas DataFrame
file_path = './trimmed_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_regions.dat'
df = pd.read_csv(file_path, sep='\t')

# Extract unique CNV regions for each cell group
unique_cnv_regions = df.groupby('cell_group_name')['cnv_name'].unique().reset_index()

# Convert the unique CNV regions to a dictionary for easy viewing
unique_cnv_dict = unique_cnv_regions.set_index('cell_group_name').to_dict()['cnv_name']

# Print the unique CNV regions for each cell group
for cell_group, cnv_regions in unique_cnv_dict.items():
    print(f"Cell Group: {cell_group}")
    print(f"Unique CNV Regions: {', '.join(cnv_regions)}")
    print()

筛选每个克隆独特的CNV 基因

# unique_cnv_gene_for_cellgroup.py
import pandas as pd

# Load the data into a pandas DataFrame
file_path = './trimmed_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.pred_cnv_genes.dat'
df = pd.read_csv(file_path, sep='\t')

# Extract unique CNV regions for each cell group
unique_cnv_genes = df.groupby('cell_group_name')['gene'].unique().reset_index()

# Convert the unique CNV regions to a dictionary for easy viewing
unique_cnv_dict = unique_cnv_genes.set_index('cell_group_name').to_dict()['gene']

# Print the unique CNV regions for each cell group
for cell_group, cnv_genes in unique_cnv_dict.items():
    print(f"Cell Group: {cell_group}")
    print(f"Unique CNV Genes: {', '.join(cnv_genes)}")
    print()

# 运行并输出结果：
python unique_cnv_region_for_cellgroup.py > unique_cnv_region_for_cellgroup.txt

python unique_cnv_gene_for_cellgroup.py > unique_cnv_gene_for_cellgroup.txt

一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31