{"id":566,"date":"2025-06-25T10:51:07","date_gmt":"2025-06-25T02:51:07","guid":{"rendered":"https:\/\/www.kz-hub.tech\/?p=566"},"modified":"2025-07-10T10:30:26","modified_gmt":"2025-07-10T02:30:26","slug":"%e4%bd%bf%e7%94%a8scallele%e8%ae%a1%e7%ae%97cell-level-snv","status":"publish","type":"post","link":"https:\/\/www.kz-hub.tech\/index.php\/2025\/06\/25\/%e4%bd%bf%e7%94%a8scallele%e8%ae%a1%e7%ae%97cell-level-snv\/","title":{"rendered":"\u4f7f\u7528scAllele\u8ba1\u7b97cell-level SNV"},"content":{"rendered":"<h3>1. \u4e0b\u8f7d\u5e76\u5b89\u88c5scAllele<\/h3>\n<p>\u5b98\u7f51\uff1a<a href=\"https:\/\/github.com\/gxiaolab\/scAllele\">https:\/\/github.com\/gxiaolab\/scAllele<\/a><\/p>\n<pre><code>conda create -n scallele\nmamba install -c bioconda -c conda-forge python=3.10 pysam samtools bamtools\n\npip install scallele<\/code><\/pre>\n<h3>2. \u6309\u7167Barcode\u5206\u5272\u6bd4\u5bf9\u540e\u5355\u7ec6\u80deBam\u6587\u4ef6(Cellranger out)<\/h3>\n<h4>2.1 \u65b9\u6cd5\u4e00<\/h4>\n<p>\u53c2\u8003\uff1a<a href=\"https:\/\/github.com\/pezmaster31\/bamtools\/issues\/135\">https:\/\/github.com\/pezmaster31\/bamtools\/issues\/135<\/a><\/p>\n<pre><code>ulimit -n 65535\nsamtools sort -t CB \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/possorted_genome_bam.bam  &gt; .\/CBsorted_tags.bam<\/code><\/pre>\n<p>split_script.py:<\/p>\n<pre><code>##### Code has not been tested on unsorted bam files, sort on barcode (CB):\n##### samtools sort -t CB unsorted.bam &gt; sorted_tags.bam\n###\n##### INPUT: .bam file to be sorted and output directory to place split BC\n##### OUTPUT: .bam file for each unique barcode, best to make a new directory\n\n### Python 3.6.8\nimport pysam\n\n### Input varibles to set\n# file to split on\nunsplit_file = &quot;\/path\/to\/dir\/of\/data\/sorted_tags.bam&quot;\n# where to place output files\nout_dir = &quot;\/path\/to\/dir\/of\/out_data\/&quot;\n\n# variable to hold barcode index\nCB_hold = &#039;unset&#039;\nitr = 0\n# read in upsplit file and loop reads by line\nsamfile = pysam.AlignmentFile( unsplit_file, &quot;rb&quot;)\nfor read in samfile.fetch( until_eof=True):\n    # barcode itr for current read\n    CB_itr = read.get_tag( &#039;CB&#039;)\n    # if change in barcode or first line; open new file  \n    if( CB_itr!=CB_hold or itr==0):\n        # close previous split file, only if not first read in file\n        if( itr!=0):\n            split_file.close()\n        CB_hold = CB_itr\n        itr+=1\n        split_file = pysam.AlignmentFile( out_dir + &quot;CB_{}.bam&quot;.format( itr), &quot;wb&quot;, template=samfile)\n\n    # write read with same barcode to file\n    split_file.write( read)\nsplit_file.close()\nsamfile.close()<\/code><\/pre>\n<pre><code>python split_script.py<\/code><\/pre>\n<h4>2.2 \u65b9\u6cd5\u4e8c<\/h4>\n<p>\u4e5f\u53ef\u5c1d\u8bd5\u76f4\u63a5\u7528bamtools\uff1a<\/p>\n<pre><code>cp \/groups\/phyllodes\/home\/share\/Results\/scRNA\/starsolo\/FETB01-MPT-01Aligned.sortedByCoord.out.bam .\/\nbamtools split -tag CB -in FETB01-MPT-01Aligned.sortedByCoord.out.bam<\/code><\/pre>\n<h4>2.3 \u65b9\u6cd5\u4e09<\/h4>\n<p><a href=\"https:\/\/github.com\/10XGenomics\/subset-bam\">https:\/\/github.com\/10XGenomics\/subset-bam<\/a><\/p>\n<pre><code>conda activate subset-bam\n\ngunzip -c \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/filtered_feature_bc_matrix\/barcodes.tsv.gz &gt; \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/filtered_feature_bc_matrix\/barcodes.tsv\nsubset-bam -b \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/possorted_genome_bam.bam -c \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/filtered_feature_bc_matrix\/barcodes.tsv.gz -o \/groups\/phyllodes\/home\/share\/Results\/scRNA\/scallele\/FETB01-MPT-01_bam<\/code><\/pre>\n<h4>2.4 \u65b9\u6cd5\u56db (\u8fd9\u4e2a\u624d\u80fd\u7528)<\/h4>\n<p><a href=\"https:\/\/github.com\/10XGenomics\/subset-bam\/issues\/17\">https:\/\/github.com\/10XGenomics\/subset-bam\/issues\/17<\/a><br \/>\n<a href=\"https:\/\/github.com\/aertslab\/single_cell_toolkit\/blob\/master\/subset_bam_per_cb.sh\">https:\/\/github.com\/aertslab\/single_cell_toolkit\/blob\/master\/subset_bam_per_cb.sh<\/a><br \/>\nsubset_bam_per_cb.sh:<\/p>\n<pre><code>#!\/bin\/bash\n#\n# Copyright (C) 2024 - Gert Hulselmans\n#\n# Purpose:\n#   Subset BAM file per cell barcode.\n\nset -e\nset -o pipefail\n\nsubset_bam_file_per_cb_chunk () {\n    local bam_file=&quot;${1}&quot;;\n    local barcodes_file=&quot;${2}&quot;;\n    local bam_output_prefix=&quot;${3%.}&quot;;\n\n    if [ ${#@} -ne 3 ] ; then\n        printf &#039;Usage:\\n&#039;;\n        printf &#039;  subset_bam_file_per_cb_chunk bam_file barcodes_file bam_output_prefix\\n\\n&#039;;\n        printf &#039;Arguments:\\n&#039;;\n        printf &#039;  bam_file:          BAM file to subset per provided cell barcode.\\n&#039;;\n        printf &#039;  barcodes_file:     File with cell barcodes to subset input BAM file.\\n&#039;;\n        printf &#039;  bam_output_prefix: Prefix used for output per CB BAM files.\\n&#039;;\n        return 1;\n    fi\n\n    if [ &quot;${bam_file##*.}&quot; != &quot;bam&quot; ] ; then\n        printf &#039;Error: BAM file should have &quot;.bam&quot; extension.\\n&#039;;\n        return 1;\n    fi\n\n    if [ ! -f &quot;${bam_file}&quot; ] ; then\n        printf &#039;Error: BAM file &quot;%s&quot; does not exist.\\n&#039; &quot;${bam_file}&quot;;\n        return 1;\n    fi\n\n    if [ ! -f &quot;${barcodes_file}&quot; ] ; then\n        printf &#039;Error: Barcodes file &quot;%s&quot; does not exist.\\n&#039; &quot;${barcodes_file}&quot;;\n        return 1;\n    fi\n\n    if [ ! -d $(dirname &quot;${bam_output_prefix}&quot;) ] ; then\n        printf &#039;Error: Directory &quot;%s&quot; does not exist.\\n&#039; &quot;$(dirname &quot;${bam_output_prefix}&quot;)&quot;;\n        return 1;\n    fi\n\n    # First extract all requested cell barcodes from BAM file\n    # before writing individual per CB BAM files.\n    samtools view -@ 4 -h -D CB:&quot;${barcodes_file}&quot; &quot;${bam_file}&quot; \\\n      | mawk \\\n        -v bam_output_prefix=&quot;${bam_output_prefix}&quot; \\\n        -F &#039;\\t&#039; \\\n        &#039;\n        BEGIN {\n            header = &quot;&quot;;\n            CB_tag_found = 0;\n        }\n        {\n            if ( $0 ~ \/^@\/ ) {\n                # Collect SAM header.\n                if ( header == &quot;&quot; ) {\n                    header = $0;\n                } else {\n                    header = header &quot;\\n&quot; $0;\n                }\n            } else {\n                CB_tag_found = 0;\n\n                # Search for CB tag.\n                for ( i = 12; i &lt;= NF; i++ ) {\n                    if ( substr($i, 1, 5) == &quot;CB:Z:&quot; ) {\n                        CB = substr($i, 6);\n                        CB_tag_found = 1;\n                        break;\n                    }\n                }\n\n                if ( CB_tag_found == 1 ) {\n                    if (! (CB in CBs_found) ) {\n                        # Write BAM header if this is the first time this CB was seen.\n                        print header | &quot;samtools view -O bam -o &quot; bam_output_prefix &quot;.&quot; CB &quot;.bam -&quot;;\n                        CBs_found[CB] = 1;\n                    }\n\n                    # Write read to CB specific BAM file.\n                    print $0 | &quot;samtools view -O bam -o &quot; bam_output_prefix &quot;.&quot; CB &quot;.bam -&quot;;\n                }\n            }\n        }\n        END {\n            # Close all file handles.\n            for ( CB in CBs_found ) {\n                close(&quot;samtools view -O bam -o &quot; bam_output_prefix &quot;.&quot; CB &quot;.bam -&quot;);\n            }\n        }\n        &#039;\n}\n\nsubset_bam_file_per_cb () {\n    local bam_file=&quot;${1}&quot;;\n    local barcodes_file=&quot;${2}&quot;;\n    local bam_output_prefix=&quot;${3%.}&quot;;\n\n    # Number of barcodes to process in one invocation of subset_bam_file_per_cb_chunk.\n    # This is used to limit the number of open file handles and the number of parallel\n    # spawned samtools processes.\n    local chunk_size=&quot;${4:-1000}&quot;;\n\n    if [ ${#@} -lt 3 ] ; then\n        printf &#039;Usage:\\n&#039;;\n        printf &#039;  subset_bam_file_per_cb bam_file barcodes_file bam_output_prefix [chunk_size]\\n\\n&#039;;\n        printf &#039;Arguments:\\n&#039;;\n        printf &#039;  bam_file:          BAM file to subset per provided cell barcode.\\n&#039;;\n        printf &#039;  barcodes_file:     File with cell barcodes to subset input BAM file.\\n&#039;;\n        printf &#039;  bam_output_prefix: Prefix used for output per CB BAM files.\\n&#039;;\n        printf &#039;  chunk_size:        Number of cell barcodes to process simultaniously.\\n&#039;\n        printf &#039;                     If more cell barcodes are provided, the input BAM file\\n&#039;;\n        printf &#039;                     will be read multiple times. It is recommended to not\\n&#039;;\n        printf &#039;                     set this value too high as the same number of parallel\\n&#039;;\n        printf &#039;                     spawned samtools processes will be created for writing\\n&#039;;\n        printf &#039;                     the per CB BAM files.\\n&#039;;\n        printf &#039;                     Default: 1000.\\n&#039;;\n        return 1;\n    fi\n\n    if [ &quot;${bam_file##*.}&quot; != &quot;bam&quot; ] ; then\n        printf &#039;Error: BAM file should have &quot;.bam&quot; extension.\\n&#039;;\n        return 1;\n    fi\n\n    if [ ! -f &quot;${bam_file}&quot; ] ; then\n        printf &#039;Error: BAM file &quot;%s&quot; does not exist.\\n&#039; &quot;${bam_file}&quot;;\n        return 1;\n    fi\n\n    if [ ! -f &quot;${barcodes_file}&quot; ] ; then\n        printf &#039;Error: Barcodes file &quot;%s&quot; does not exist.\\n&#039; &quot;${barcodes_file}&quot;;\n        return 1;\n    fi\n\n    if [ ! -d $(dirname &quot;${bam_output_prefix}&quot;) ] ; then\n        printf &#039;Error: Directory &quot;%s&quot; does not exist.\\n&#039; &quot;$(dirname &quot;${bam_output_prefix}&quot;)&quot;;\n        return 1;\n    fi\n\n    if [ $(cat ${barcodes_file} | wc -l) -le ${chunk_size} ] ; then\n        # If barcodes file contains less than ${chunk_size} lines, process all barcodes at once.\n        subset_bam_file_per_cb_chunk &quot;${bam_file}&quot; &quot;${barcodes_file}&quot; &quot;${bam_output_prefix}&quot;;\n        return $?;\n    else\n        # Split barcodes file in chunks of ${chunk_size} lines.\n        local subset_bam_file_per_cb_tmp_dir=$(mktemp -t -d subset_bam_file_per_cb_XXXXXX);\n        split -l ${chunk_size} -d &quot;${barcodes_file}&quot; ${subset_bam_file_per_cb_tmp_dir}\/barcodes_part\n\n        # Subset BAM file in per chunk of barcodes.\n        for barcodes_part_file in ${subset_bam_file_per_cb_tmp_dir}\/barcodes_part* ; do\n            subset_bam_file_per_cb_chunk &quot;${bam_file}&quot; &quot;${barcodes_part_file}&quot; &quot;${bam_output_prefix}&quot;;\n        done\n\n        # Remove temporary files.\n        rm ${subset_bam_file_per_cb_tmp_dir}\/barcodes_part*;\n        rmdir &quot;${subset_bam_file_per_cb_tmp_dir}&quot;;\n    fi\n}\n\nsubset_bam_file_per_cb &quot;${@}&quot;;<\/code><\/pre>\n<pre><code>conda create -n subset-bam\nmamba install -c bioconda -c conda-forge mawk samtools\n\n.\/subset_bam_per_cb.sh \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/possorted_genome_bam.bam \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/FETB01-MPT-01\/outs\/filtered_feature_bc_matrix\/barcodes.tsv \/groups\/phyllodes\/home\/share\/Results\/scRNA\/scallele\/FETB01-MPT-01\n\n# \u6784\u5efa\u6279\u91cf\u8fd0\u884c\u811a\u672c\uff1a\nfor d in \/groups\/phyllodes\/home\/share\/Results\/scRNA\/cellranger_count\/*\/; do s=$(basename &quot;$d&quot;); echo &quot;mkdir $s &amp;&amp; .\/subset_bam_per_cb.sh $d\/outs\/possorted_genome_bam.bam $d\/outs\/filtered_feature_bc_matrix\/barcodes.tsv \/groups\/phyllodes\/home\/share\/Results\/scRNA\/scallele\/$s\/$s&quot;; done &gt; SubSetBam.sh\n<\/code><\/pre>\n<h3>3. \u5bf9\u6bcf\u4e2abam\u6587\u4ef6\u8fd0\u884cscAllele\uff1a<\/h3>\n<p>\u6b64\u5904\u6ce8\u610f -o \u7684 prefix \u4e0d\u80fd\u4e3abam\u6587\u4ef6\u7684\u540d\u5b57\uff0c\u5426\u5219\u4f1a\u5220\u9664bam\u6587\u4ef6\uff08bug?\uff09<\/p>\n<pre><code>for f in *.bam; do base=${f%.bam}; echo &quot;samtools index $f &amp;&amp; scAllele -n 4 -b $f -g \/groups\/phyllodes\/home\/share\/Reference\/GRCh38_Ensembl_Ensembl104\/fasta\/genome.fa -o $base-scallele &amp;&amp; echo $base ok&quot;; done &gt; run_scAllele.sh<\/code><\/pre>\n<h3>4. \u5bf9\u6bcf\u4e2a\u6837\u672c\u6587\u4ef6\u5939\u4e0b\u7684vcf\u6587\u4ef6\u7b5b\u9009PASS\u7a81\u53d8\u5e76\u6dfb\u52a0header<\/h3>\n<pre><code>#!\/bin\/bash\n\n# Headers to add (with placeholder for SAMPLE)\nHEADER=&quot;##fileformat=VCFv4.2\n##FILTER=&lt;ID=PASS,Description=\\&quot;Variant passes all filters\\&quot;&gt;\n##FILTER=&lt;ID=LowQual,Description=\\&quot;Variant does not pass one or more filtering criteria\\&quot;&gt;\n##INFO=&lt;ID=varType,Number=1,Type=String,Description=\\&quot;Variant type\\&quot;&gt;\n##INFO=&lt;ID=varGroup,Number=1,Type=String,Description=\\&quot;Variant Group\\&quot;&gt;\n##INFO=&lt;ID=varLength,Number=1,Type=Integer,Description=\\&quot;Variant length\\&quot;&gt;\n##INFO=&lt;ID=varEnd,Number=1,Type=Integer,Description=\\&quot;End position of the variant\\&quot;&gt;\n##INFO=&lt;ID=TandemRep,Number=1,Type=Integer,Description=\\&quot;Count of repeat sequence adjacent to the variant\\&quot;&gt;\n##FORMAT=&lt;ID=DP,Number=1,Type=Integer,Description=\\&quot;Number of reads overlapping the variant position\\&quot;&gt;\n##FORMAT=&lt;ID=AC,Number=R,Type=Integer,Description=\\&quot;Number of reads containing the variant allele\\&quot;&gt;\n##FORMAT=&lt;ID=RC,Number=R,Type=Integer,Description=\\&quot;Number of reads containing the reference allele\\&quot;&gt;\n##FORMAT=&lt;ID=AB,Number=1,Type=Float,Description=\\&quot;Allelic balance of the variant\\&quot;&gt;\n##FORMAT=&lt;ID=GT,Number=1,Type=String,Description=\\&quot;Consensus genotype\\&quot;&gt;\n##FORMAT=&lt;ID=GQ,Number=1,Type=Integer,Description=\\&quot;Genotype quality\\&quot;&gt;\n##FORMAT=&lt;ID=RCL,Number=1,Type=Integer,Description=\\&quot;Read Cluster id\\&quot;&gt;\n#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE&quot;\n\n# Loop through all VCF files\nfor file in *scallele.vcf; do\n    echo &quot;Processing: $file&quot;\n\n    # 1. \u63d0\u53d6cell barcode\n    cell_barcode=$(echo &quot;$file&quot; | sed -r &#039;s\/^.*\\.([A-Z0-9\\-]+\\-[0-9]+)\\-scallele\\.vcf$\/\\1\/&#039;)\n    echo &quot;Cell Barcode: $cell_barcode&quot;\n\n    # 2. \u521b\u5efa\u65b0\u6587\u4ef6\u540d\uff08\u6dfb\u52a0 .pass \u540e\u7f00\uff09\n    new_file=&quot;${file%.vcf}.pass.vcf&quot;\n\n    # 3. \u66ff\u6362header\u4e2d\u7684SAMPLE\n    header_with_cell_barcode=$(echo &quot;$HEADER&quot; | sed &quot;s\/SAMPLE\/$cell_barcode\/&quot;)\n\n    # 4. \u521b\u5efa\u65b0\u6587\u4ef6\uff08\u5305\u542b\u65b0header\u548c\u7b5b\u9009\u7684PASS\u884c\uff09\n    # \u5148\u5199\u5165\u65b0header\n    echo &quot;$header_with_cell_barcode&quot; &gt; &quot;$new_file&quot;\n\n    # \u8ffd\u52a0\u7b5b\u9009\u7684PASS\u884c\uff08\u4fdd\u7559\u539fheader\u4f1a\u5e72\u6270\uff0c\u6240\u4ee5\u8df3\u8fc7\u6240\u6709\u6ce8\u91ca\u884c\uff09\n    awk &#039;!\/^#\/ &amp;&amp; $7 == &quot;PASS&quot;&#039; &quot;$file&quot; &gt;&gt; &quot;$new_file&quot;\n\n    echo &quot;Created new file: $new_file (contains only PASS variants)&quot;\ndone<\/code><\/pre>\n<h3>5. \u5bf9\u8fc7\u6ee4\u540e\u7684vcf\u8fdb\u884cAnnovar+vcf2maf<\/h3>\n<pre><code># annovar\nls *.pass.vcf | perl -ne &#039;chomp; my $name = $1 if ($_ =~ \/([^\\\/]+)\\.pass\\.vcf\/); print &quot;~\/software\/annovar\/table_annovar.pl $name.pass.vcf ~\/database\/Annovar\/humandb_hg38 -buildver hg38 -out $name.pass -remove -protocol refGene,cytoBand,avsnp151,exac03,gnomad41_exome,EAS.sites.2015_08,cosmic102_unique -operation g,r,f,f,f,f,f -nastring . -vcfinput \\n&quot;&#039;&gt;annovar.sh\n\n# vcf2maf<\/code><\/pre>\n<h3>6. \u5bf9\u540c\u4e00\u4e2a\u6837\u672c\u7684\u4e0d\u540c\u7ec6\u80de\u7684maf\u6587\u4ef6\u8fdb\u884c\u5408\u5e76<\/h3>\n<pre><code><\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>1. \u4e0b\u8f7d\u5e76\u5b89\u88c5scAllele \u5b98\u7f51\uff1ahttps:\/\/github.com\/gxiaolab\/scAllel&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-566","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/566","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/comments?post=566"}],"version-history":[{"count":18,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/566\/revisions"}],"predecessor-version":[{"id":605,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/566\/revisions\/605"}],"wp:attachment":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/media?parent=566"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/categories?post=566"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/tags?post=566"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}