{"id":667,"date":"2025-12-26T15:11:55","date_gmt":"2025-12-26T07:11:55","guid":{"rendered":"https:\/\/www.kz-hub.tech\/?p=667"},"modified":"2025-12-26T15:13:44","modified_gmt":"2025-12-26T07:13:44","slug":"cellranger-mkref","status":"publish","type":"post","link":"https:\/\/www.kz-hub.tech\/index.php\/2025\/12\/26\/cellranger-mkref\/","title":{"rendered":"cellranger mkref \u6784\u5efa\u53c2\u8003\u57fa\u56e0\u6587\u4ef6"},"content":{"rendered":"<p>\u5b98\u65b9\u6559\u7a0b\uff1a<a href=\"https:\/\/www.10xgenomics.com\/support\/software\/cell-ranger\/latest\/tutorials\/cr-tutorial-mr\">https:\/\/www.10xgenomics.com\/support\/software\/cell-ranger\/latest\/tutorials\/cr-tutorial-mr<\/a><\/p>\n<p>create_count_ref_Ensembl_Gencode.sh<\/p>\n<pre><code># Genome metadata\ngenome=&quot;GRCh38_Ensembl_GENCODE48&quot;\nversion=&quot;Ensembl114&quot;\n\n# Set up source and build directories\nbuild=&quot;GRCh38_Ensembl_GENCODE48_build&quot;\nmkdir -p &quot;$build&quot;\n\nsource=&quot;\/data02\/zhangmengmeng\/database\/hg38&quot;\nfasta_in=&quot;${source}\/Homo_sapiens.GRCh38.114.dna.primary_assembly.fa&quot;\n#gtf_in=&quot;${source}\/Homo_sapiens.GRCh38.114.gtf&quot;\ngtf_in=&quot;${source}\/gencode.v48.primary_assembly.basic.annotation.gtf&quot;\n\n# Modify sequence headers in the Ensembl FASTA to match the file\n# &quot;GRCh38.primary_assembly.genome.fa&quot; from GENCODE. Unplaced and unlocalized\n# sequences such as &quot;KI270728.1&quot; have the same names in both versions.\n#\n# Input FASTA:\n#   &gt;1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF\n#\n# Output FASTA:\n#   &gt;chr1 1\nfasta_modified=&quot;$build\/$(basename &quot;$fasta_in&quot;).modified&quot;\n# sed commands:\n# 1. Replace metadata after space with original contig name, as in GENCODE\n# 2. Add &quot;chr&quot; to names of autosomes and sex chromosomes\n# 3. Handle the mitochrondrial chromosome\ncat &quot;$fasta_in&quot; \\\n    | sed -E &#039;s\/^&gt;(\\S+).*\/&gt;\\1 \\1\/&#039; \\\n    | sed -E &#039;s\/^&gt;([0-9]+|[XY]) \/&gt;chr\\1 \/&#039; \\\n    | sed -E &#039;s\/^&gt;MT \/&gt;chrM \/&#039; \\\n    &gt; &quot;$fasta_modified&quot;\n\n# Remove version suffix from transcript, gene, and exon IDs in order to match\n# previous Cell Ranger reference packages\n#\n# Input GTF:\n#     ... gene_id &quot;ENSG00000223972.5&quot;; ...\n# Output GTF:\n#     ... gene_id &quot;ENSG00000223972&quot;; gene_version &quot;5&quot;; ...\ngtf_modified=&quot;$build\/$(basename &quot;$gtf_in&quot;).modified&quot;\n# Pattern matches Ensembl gene, transcript, and exon IDs for human or mouse:\nID=&quot;(ENS(MUS)?[GTE][0-9]+)\\.([0-9]+)&quot;\ncat &quot;$gtf_in&quot; \\\n    | sed -E &#039;s\/gene_id &quot;&#039;&quot;$ID&quot;&#039;&quot;;\/gene_id &quot;\\1&quot;; gene_version &quot;\\3&quot;;\/&#039; \\\n    | sed -E &#039;s\/transcript_id &quot;&#039;&quot;$ID&quot;&#039;&quot;;\/transcript_id &quot;\\1&quot;; transcript_version &quot;\\3&quot;;\/&#039; \\\n    | sed -E &#039;s\/exon_id &quot;&#039;&quot;$ID&quot;&#039;&quot;;\/exon_id &quot;\\1&quot;; exon_version &quot;\\3&quot;;\/&#039; \\\n    &gt; &quot;$gtf_modified&quot;\n\n# Define string patterns for GTF tags\n# NOTES:\n# Since Ensembl 110, polymorphic pseudogenes are now just protein_coding.\n# Readthrough genes are annotated with the readthrough_transcript tag.\nBIOTYPE_PATTERN=\\\n&quot;(protein_coding|protein_coding_LoF|lncRNA|\\\nIG_C_gene|IG_D_gene|IG_J_gene|IG_LV_gene|IG_V_gene|\\\nIG_V_pseudogene|IG_J_pseudogene|IG_C_pseudogene|\\\nTR_C_gene|TR_D_gene|TR_J_gene|TR_V_gene|\\\nTR_V_pseudogene|TR_J_pseudogene)&quot;\nGENE_PATTERN=&quot;gene_type \\&quot;${BIOTYPE_PATTERN}\\&quot;&quot;\nTX_PATTERN=&quot;transcript_type \\&quot;${BIOTYPE_PATTERN}\\&quot;&quot;\nREADTHROUGH_PATTERN=&quot;tag \\&quot;readthrough_transcript\\&quot;&quot;\n\n# Construct the gene ID allowlist. We filter the list of all transcripts\n# based on these criteria:\n#   - allowable gene_type (biotype)\n#   - allowable transcript_type (biotype)\n#   - no &quot;readthrough_transcript&quot; tag\n# We then collect the list of gene IDs that have at least one associated\n# transcript passing the filters.\ncat &quot;$gtf_modified&quot; \\\n    | awk &#039;$3 == &quot;transcript&quot;&#039; \\\n    | grep -E &quot;$GENE_PATTERN&quot; \\\n    | grep -E &quot;$TX_PATTERN&quot; \\\n    | grep -Ev &quot;$READTHROUGH_PATTERN&quot; \\\n    | sed -E &#039;s\/.*(gene_id &quot;[^&quot;]+&quot;).*\/\\1\/&#039; \\\n    | sort \\\n    | uniq \\\n    &gt; &quot;${build}\/gene_allowlist&quot;\n\n# NOTES:\n# Since Ensembl 110, the PAR locus genes are included on chrY as copies of chrX\n# Using the GRCh38.p13 assembly hard masks these regions on chrY, but removing the\n# chrY PAR genes is still desirable so they do not end up as extra entries in the output.\n# The awk command below excludes all PAR_Y genes, including XGY2.\n# The non-coding gene XGY2 straddles the PAR1 boundary on chrY, and is homologous to XG on chrX.\n# GRCh38-2024-A excludes XGY2, but includes SRY and ENSG00000286130, which are in an intron of XGY2,\n# and RPS4Y1, which overlaps XGY2.\n\n# Filter the GTF file based on the gene allowlist\ngtf_filtered=&quot;${build}\/$(basename &quot;$gtf_in&quot;).filtered&quot;\n# Copy header lines beginning with &quot;#&quot;\ngrep -E &quot;^#&quot; &quot;$gtf_modified&quot; &gt; &quot;$gtf_filtered&quot;\n# Filter to the gene allowlist, and then remove PAR_Y genes\ngrep -Ff &quot;${build}\/gene_allowlist&quot; &quot;$gtf_modified&quot; \\\n    | awk -F &quot;\\t&quot; &#039;$1 != &quot;chrY&quot; || $1 == &quot;chrY&quot; &amp;&amp; $4 &gt;= 2752083 &amp;&amp; $4 &lt; 56887903 &amp;&amp; !\/ENSG00000290840\/&#039; \\\n    &gt;&gt; &quot;$gtf_filtered&quot;\n\n# Create reference package\ncellranger mkref --ref-version=&quot;$version&quot; \\\n    --genome=&quot;$genome&quot; --fasta=&quot;$fasta_modified&quot; --genes=&quot;$gtf_filtered&quot; \\\n    --nthreads=8<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>\u5b98\u65b9\u6559\u7a0b\uff1ahttps:\/\/www.10xgenomics.com\/support\/software\/cell-&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-667","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/667","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/comments?post=667"}],"version-history":[{"count":2,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/667\/revisions"}],"predecessor-version":[{"id":669,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/posts\/667\/revisions\/669"}],"wp:attachment":[{"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/media?parent=667"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/categories?post=667"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kz-hub.tech\/index.php\/wp-json\/wp\/v2\/tags?post=667"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}