3

线粒体的重复介导重组的鉴定和计算重组频率

 1 year ago
source link: https://yanzhongsino.github.io/2023/03/14/omics_organelle_recombination.frequency/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

1. 线粒体的重复介导重组(repeat-mediated recombination)

  1. 线粒体的重复序列
  • 通常在植物线粒体上存在重复序列(这里的重复序列不是核基因组的TE等转座元件的概念,就是一段序列在另一个位置同时存在完全一致的序列),则有可能介导重组,从而生成线粒体的多种构象。重复序列常常成对存在(也有一些超过两个),被称作重复对(repeat pairs)。
  1. 鉴定的背景
  • Illumina PE测序是双端测序,测序获得的reads常见的长度是150bp。一对reads的联合体的长度被称为插入尺寸(insert size)。插入尺寸(insert size)常在350bp左右,但具体的每对测序reads都不一样,要通过把reads进行mapping到参考序列上,才能测量insert size。
  • 如果是repeat pairs的长度小于测序reads的insert size长度,可以构建两种构象(参考构象和重组构象),把reads mapping到两种构象来判断repeat pairs是否介导重组(如果两种构象都有mapped reads则是repeat pair介导重组的证据),并且通过mapped reads的计数来计算重组频率(recombination frequency)。

2. 鉴定重复

2.1. 鉴定重复的脚本

这篇文章Repeats of Unusual Size in Plant Mitochondrial Genomes: Identification, Incidence and Evolution: https://academic.oup.com/g3journal/article/9/2/549/6026745 总结了植物线粒体基因组的重复序列的特征。研究表明,植物线粒体基因组比动物的要大,包含大量的非编码DNA,突变率低,重排率高。

  • 文章提供了python脚本ROUSFinder.py,用在鉴定线粒体的重复序列上。
  • 脚本是python2写的,依赖主要有blastn。
  • 有以下三个版本。
  1. ROUSFinder.py
  • 调用blastn进行一条序列与自身的成对比对。
  • 默认参数是最小重复50bp,E value 10,000, match的赏分是+1,mismatch的罚分是-20。
  • 脚本贴在下面:
#! /usr/bin/env python  
import sys, math, os, argparse, csv
csv.field_size_limit(sys.maxsize)

# January 16, 2018 version 1.1
# Find dispersed repeated sequences in genomes.
# Designed for plant mitochondrial genomes of up to a few Mbp.
# May be very slow with larger genomes.
# Blast can also sometimes give odd results with large or highly repetetive genomes.
# Gaps, or runs of 'N's in the sequence will definitely give weird results.
# The program assumes there aren't any, and that the longest repeat will be the full sequence to itself.
# If there are long repeats in the output that are listed as being only at one location, this is probably what happened.
# If there are a lot of repeats within repeats the results can also be odd.
# Copyright Alan C. Christensen, University of Nebraska, 2018
# No guarantees, warranties, support, or anything else is implicit or explicit.
# Input is a fasta format file of a sequence. Genbank format works but generates lots of error messages in stdout.
# Output is a list of unique, ungapped repeated sequences, fasta formatted.
# The names are in the format '>Repeat/ROUS_name_start_end_length'.
# Percent identity is limited to >=99%, to allow for sequencing errors of <1%.
# A table of repeats with the coordinates of each one is generated.
# A list of repeat name, length and copy number is generated.
# A binned table of the total number of repeats in size ranges is generated.
#
# PARAMETERS
# REQUIRED:
# input file in fasta format
# Optional
# -o output file name
# -m minimum length of exact matches to keep
# -b path to blastn (default is /usr/bin/)
# -k keep temp files
# -gb to write the repeats to a genbank format file

parser = argparse.ArgumentParser(description='Find repeats in a fasta sequence file')
parser.add_argument('infile', action='store', help='Input .fasta file')
parser.add_argument('-o', action='store', dest='outfile', help='Output file name seed, default is input_repeats', default='default')
parser.add_argument('-m', action='store', dest='minlen', help='Minimum length of matches to keep, default=24', default='24')
parser.add_argument('-b', action='store', dest='blast_path', help='Path to blastn program, default is /usr/bin/', default='/usr/bin/')
parser.add_argument('-k', action='store_true', dest='keep', help='True to keep temp files', default=False)
parser.add_argument('-gb', action='store_true', dest='genbank', help='True to write GenBank format file', default=False)
results = parser.parse_args()
infile = results.infile
outfile = results.outfile
minlen = int(results.minlen)
blast_path = results.blast_path
keep = results.keep
genbank = results.genbank

# It might be useful to define the wordsize as something less than minlen, so both variables are used.
# Wordsize smaller than minlen would give smaller core identical sequences in the middle of repeats.
# An example might be to change this to wordsize = str(int(minlen/2)).
wordsize = str(minlen)

# If no output file seed is specified, make one by stripping leading directory information
# and stripping trailing .fa or .fasta from the input file name and using that.
if outfile == 'default':
outfile = infile
if outfile.count('/') > 0:
for i in range(outfile.count('/')):
index = outfile.index('/')
outfile = outfile[index+1:]
if outfile.endswith('.fa') or outfile.endswith('.fasta'):
outfile = outfile.rstrip('fasta')
outfile = outfile.rstrip('.')
outfa = outfile+'_rep.fasta'
outtab = outfile+'_rep_table.txt'
outbin = outfile+'_binned.txt'
outcount = outfile+'_rep_counts.txt'
outgb = outfile+'_repeats.gb.txt'
tempblast = outfile+'_tempblast.txt'
temprepeats = outfile+'_temprepeats.txt'
tempparse = outfile+'_sequence_parsing.txt'

# Get sequence name and length from fasta file.
seq = open(infile, 'r')
seqname = seq.readline()
seqname = seqname.lstrip('> ')
seqname = seqname.rstrip()
seqlen = 0
for line in seq:
if(line[0] == ">"):
continue
seqlen += len(line.strip())
seq.close()

# run blastn with query file plus strand (removing first line which is full length sequence), minus strand, and concatenate
print 'Performing self-blastn comparison with '+seqname
os.system(blast_path+'blastn -query '+infile+' -strand plus -subject '+infile+' -word_size '+wordsize+' -reward 1 -penalty -20 -ungapped -dust no -soft_masking false -evalue 1000 -outfmt "10 qstart qend length sstart send mismatch sstrand qseq" | tail -n+2 > tempblast1.txt')
os.system(blast_path+'blastn -query '+infile+' -strand minus -subject '+infile+' -word_size '+wordsize+' -reward 1 -penalty -20 -ungapped -dust no -soft_masking false -evalue 1000 -outfmt "10 qstart qend length sstart send mismatch sstrand qseq" > tempblast2.txt')
os.system('cat tempblast1.txt tempblast2.txt > '+tempblast)
os.system('rm tempblast1.txt tempblast2.txt')

# open tempblast.txt, convert to list of lists, and sort by length and position descending
# This is necessary because blastn does not output every possible pair of hits when there are more than 2 copies of a repeat

print 'Sorting alignments...'
f = open(tempblast, 'r')
reader = csv.reader(f)
alignments = list(reader)
f.close()
alignments = sorted(alignments, key=lambda x: (-1*int(x[2]), -1*int(x[0])))
alignments.append(['1','1','1','1','1','0','A','X'])

# New list of uniques
# Text file '_sequence_parsing.txt' includes the information on how duplicates were found.
# Start at row 0. Compare to subsequent rows.
# If repeat length is different from the next row, it has passed all the tests, write it to the file.
# If query or subject coordinates are the same as the query or subject or reversed coordinates
# of a subsequent row, it is not unique, so go to the next row and do the comparisons again.
# Thanks to Alex Kozik for repeatedly testing and finding bugs in the algorithm.
print 'Finding unique repeats...'
uniques = []
sp = open(tempparse, 'w')
for row in range(len(alignments)):
sp.write('row '+str(row)+'\n')

if int(alignments[row][2]) < minlen:
# This won't happen unless the word_size is defined as something other than minlen.
# That could be useful under some circumstances.
sp.write('row '+str(row)+' is less than minlength')
break
else:

for compare in range(row+1,len(alignments)):
if alignments[row][2] != alignments[compare][2]:
uniques.append(alignments[row])
sp.write('\tadding row '+str(row)+' to unique list\n')
break
else:
sp.write('\tcomparing to '+str(compare)+'\n')

if alignments[row][0] == alignments[compare][0] and alignments[row][1] == alignments[compare][1]:
sp.write('\tqstart and qend of row '+str(row)+' and '+str(compare)+' are the same\n')
break
elif alignments[row][0] == alignments[compare][1] and alignments[row][1] == alignments[compare][0]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as qend and qstart of '+str(compare)+'\n')
break
elif alignments[row][0] == alignments[compare][3] and alignments[row][1] == alignments[compare][4]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as sstart and send of '+str(compare)+'\n')
break
elif alignments[row][0] == alignments[compare][4] and alignments[row][1] == alignments[compare][3]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as send and sstart of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][0] and alignments[row][4] == alignments[compare][1]:
sp.write('\tsstart and send of row '+str(row)+' is the same as qstart and qend of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][1] and alignments[row][4] == alignments[compare][0]:
sp.write('\tsstart and send of row '+str(row)+' is the same as qend and qstart of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][3] and alignments[row][4] == alignments[compare][4]:
sp.write('\tsstart and send of row '+str(row)+' is the same as sstart and send of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][4] and alignments[row][4] == alignments[compare][3]:
sp.write('\tsstart and send of row '+str(row)+' is the same as send and sstart of '+str(compare)+'\n')
break
else:
sp.write('\t'+str(row)+' is different\n')

sp.close()

# Write uniques into output file
# Start list for copy number table
rous_count = 0
g = open(outfa, 'w')
repcopies = []

for i in range(len(uniques)):
qstart = uniques[i][0]
qend = uniques[i][1]
length = uniques[i][2]
seq = uniques[i][7]

rous_count += 1
g.write('>Repeat_'+str(rous_count)+'\n'+seq+'\n')
repcopies.append(['Repeat_'+str(rous_count),length])

if rous_count == 0:
print "\tRepeats of unusual size? I don't think they exist"
g.close()
print 'Repeat fasta file is done, as you wish.'

# Now find each copy of each repeat. Again, this is because the blastn output file does not have every possible alignment.
# It is also because the information on locations and strand is not organized well in the blastn output.
# In addition, this subroutine eliminates duplicates of nested repeats.

print "Finding all copies of repeats..."
g = open(outfa, 'r')
os.system(blast_path+'blastn -query '+outfa+' -strand both -subject '+infile+' -word_size '+wordsize+' -reward 1 -penalty -20 -ungapped -dust no -soft_masking false -evalue 1000 -outfmt "10 qseqid length sstart send sstrand qcovhsp" > '+temprepeats)
g.close()

tempr = open(temprepeats, 'r')
reader = csv.reader(tempr)
replist = list(reader)
tempr.close()

print "Making a table of the repeats..."
sum_rep_len = 0
bin_dict = {}
binned = [seqname,seqlen,0]

# defining the bins
i = 0
j = 50
while j < 1000:
bin_dict[i] = j
binned.append(0)
i += 1
j += 50
while j <= 10000:
bin_dict[i] = j
binned.append(0)
i +=1
j += 250

# make list for entire sequence, set each position as 0
posit = []
for n in range(seqlen):
posit.append(0)

# Thanks to Emily Wynn for suggesting qcovhsp for this loop.
# if qcovhsp is >98%, write to the file
# write tab separated values of repeat name, length, start, end, strand to outtab
# make list for genbank file
# Keep stats on lengths
rt = open(outtab, 'w')
rt.write(seqname+'\t'+str(seqlen)+'\n')
templist = []
gblist =[]

# look at each repeat in turn
for i in range(len(replist)):
# if repeat is good (>98% identical to another one), write it to the file, and put the name in a list
if int(replist[i][5])>98:
rt.write(str(replist[i][0])+'\t'+str(replist[i][1])+'\t'+str(replist[i][2])+'\t'+str(replist[i][3])+'\t'+str(replist[i][4])+'\n')
if replist[i][4] == 'minus':
location = 'complement('+replist[i][3]+'..'+replist[i][2]+')'
else:
location = replist[i][2]+'..'+replist[i][3]
gblist.append(' repeat_region '+location+'\n /rpt_type=dispersed\n /label='+replist[i][0]+'\n')
templist.append(replist[i][0])
# then write 1's at every position in the sequence covered by that repeat
# these can then be summed to get total bases of repeats
# bases in overlapping repeats are only counted once
for n in range(int(replist[i][2]), int(replist[i][3])):
posit[n] = 1
# then scan through bin sizes and if a repeat is greater than the
# bin_dict size cutoff, add one to the bin
for j in range(len(binned)-4, -1, -1):
if int(replist[i][1]) >= bin_dict[j]:
binned[j+3] +=1
break
sum_rep_len = posit.count(1)
binned[2] = sum_rep_len
rt.close()
if genbank == True:
gb = open(outgb, 'w')
for i in range(len(gblist)):
gb.write(gblist[i])
gb.close()

# write tab separated values of repeat name, length, copy number to outcount
# first two lines are also a table of stats on repeats
rc = open(outcount,'w')
rc.write('Sequence\tGenome_size\tNumRepeats\tAvgSize\tAvgCopyNum\n')

numrous = 0
sizerous = 0
copyrous = 0

for i in range(len(repcopies)):
repname = repcopies[i][0]
replen = float(repcopies[i][1])
repcop = float(templist.count(repname))

numrous += 1
sizerous += replen
copyrous += repcop

if numrous == 0:
avsizerous = 'NA'
avcopyrous = 'NA'
else:
avsizerous = sizerous/numrous
avcopyrous = copyrous/numrous


rc.write(seqname+'\t'+str(seqlen)+'\t'+str(numrous)+'\t'+str(avsizerous)+'\t'+str(avcopyrous)+'\n')

for i in range(len(repcopies)):
rc.write(repcopies[i][0]+'\t'+repcopies[i][1]+'\t'+str(templist.count(repcopies[i][0]))+'\n')

rc.close()

# Write binned table headers, then stats for this sequence.
binfile = open(outbin, 'w')
binfile.write('Sequence\tSeq_len\tRep_len\t')
for i in range(len(bin_dict)):
binfile.write(str(bin_dict[i])+'\t')
binfile.write('\n')
for i in range(len(binned)):
binfile.write(str(binned[i])+'\t')
binfile.write('\n')
binfile.close()
print "Repeat tables are done, as you wish."

# Removing temp files if necessary
if keep == False:
os.system('rm '+tempblast+' '+temprepeats+' '+tempparse)

# Rachael Schulte, William Goldman and Rob Reiner inspired this section of code
quote_dict = {0:"48656c6c6f2e204d79206e616d6520697320496e69676f204d6f6e746f79612e20596f75206b696c6c6564206d79206661746865722e205072657061726520746f206469652e", 1:"5768656e20492077617320796f7572206167652c2074656c65766973696f6e207761732063616c6c656420626f6f6b732e", 2:"486176652066756e2073746f726d696e2720646120636173746c6521", 3:"4d79207761792773206e6f7420766572792073706f7274736d616e6c696b652e", 4:"596f75206b656570207573696e67207468617420776f72642e204920646f206e6f74207468696e6b206974206d65616e73207768617420796f75207468696e6b206974206d65616e732e", 5:"4d75726465726564206279207069726174657320697320676f6f642e",6:"496e636f6e6365697661626c6521", 7:"5468657265277320612062696720646966666572656e6365206265747765656e206d6f73746c79206465616420616e6420616c6c20646561642e", 8:"596f7520727573682061206d697261636c65206d616e2c20796f752067657420726f7474656e206d697261636c65732e", 9:"476f6f64206e696768742c20576573746c65792e20476f6f6420776f726b2e20536c6565702077656c6c2e2049276c6c206d6f7374206c696b656c79206b696c6c20796f7520696e20746865206d6f726e696e672e",10:"4e6f206d6f7265207268796d65732c2049206d65616e2069742120416e79626f64792077616e742061207065616e75743f"}
import random, binascii
z = random.randint(0,10)
print binascii.unhexlify(quote_dict[z])+'\n'
  1. MultipleRepeats.py
  • MultipleRepeats.py批量运行一个文件夹下的每个序列文件。
  • 脚本贴在下面:
#! /usr/bin/env python  
import sys, math, os, argparse

# Usage: -din directory of files to find repeats in
# -word word_size

parser = argparse.ArgumentParser(description='Find repeats in a directory of fasta sequence files')
parser.add_argument('-din', action='store', dest='din', help='Input .fasta directory')
parser.add_argument('-word', action='store', dest='word', help='Word size for blast')
results = parser.parse_args()
din = results.din
word = results.word

li = os.listdir(din)
inputs = filter(lambda x: '.fasta' in x, li)
inputs.sort()

for i in range(len(inputs)):
infile = str(inputs[i])
os.system("/home/alan/applications/ROUSFinder.py -m "+word+" "+din+infile)
  1. ROUSFinder2.py
  • ROUSFinder2.py可以在命令行设置match赏分和mismatch罚分。
  • 脚本贴在下面:
#! /usr/bin/env python  
import sys, math, os, argparse, csv
csv.field_size_limit(sys.maxsize)

# Version 2.0, November 21, 2018
# Changes: uses variable parameters
# Find dispersed repeated sequences in genomes.
# Designed for plant mitochondrial genomes of up to a few Mbp.
# May be very slow with larger genomes.
# Blast can also sometimes give odd results with large or highly repetetive genomes.
# Gaps, or runs of 'N's in the sequence will definitely give weird results.
# The program assumes there aren't any, and that the longest repeat will be the full sequence to itself.
# If there are long repeats in the output that are listed as being only at one location, this is probably what happened.
# If there are a lot of repeats within repeats the results can also be odd.
# Copyright Alan C. Christensen, University of Nebraska, 2018
# No guarantees, warranties, support, or anything else is implicit or explicit.
# Input is a fasta format file of a sequence. Genbank format works but generates lots of error messages in stdout.
# Output is a list of unique, ungapped repeated sequences, fasta formatted.
# The names are in the format '>Repeat/ROUS_name_start_end_length'.
# A table of repeats with the coordinates of each one is generated.
# A list of repeat name, length and copy number is generated.
# A binned table of the total number of repeats in size ranges is generated.
#
# PARAMETERS
# REQUIRED:
# input file in fasta format
# Optional
# -o output file name
# -m minimum length of exact matches to keep
# -b path to blastn (default is /usr/bin/)
# -k keep temp files
# -gb to write the repeats to a genbank format file
# -rew reward for match (default is 1)
# -pen penalty for mismatch (default is 20)

parser = argparse.ArgumentParser(description='Find repeats in a fasta sequence file')
parser.add_argument('infile', action='store', help='Input .fasta file')
parser.add_argument('-o', action='store', dest='outfile', help='Output file name seed, default is input_repeats', default='default')
parser.add_argument('-m', action='store', dest='minlen', help='Minimum length of matches to keep, default=50', default='50')
parser.add_argument('-b', action='store', dest='blast_path', help='Path to blastn program, default is /usr/bin/', default='/usr/bin/')
parser.add_argument('-k', action='store_true', dest='keep', help='True to keep temp files', default=False)
parser.add_argument('-gb', action='store_true', dest='genbank', help='True to write GenBank format file', default=False)
parser.add_argument('-rew', action='store', dest='reward', help='Reward for match', default='1')
parser.add_argument('-pen', action='store', dest='penalty', help='Penalty for mismatch', default='20')
results = parser.parse_args()
infile = results.infile
outfile = results.outfile
minlen = int(results.minlen)
blast_path = results.blast_path
keep = results.keep
genbank = results.genbank
reward = results.reward
penalty = results.penalty

# It might be useful to define the wordsize as something less than minlen, so both variables are used.
# Wordsize smaller than minlen would give smaller core identical sequences in the middle of repeats.
# An example might be to change this to wordsize = str(int(minlen/2)).
wordsize = str(minlen)

# If no output file seed is specified, make one by stripping leading directory information
# and stripping trailing .fa or .fasta from the input file name and using that.
if outfile == 'default':
outfile = infile
if outfile.count('/') > 0:
for i in range(outfile.count('/')):
index = outfile.index('/')
outfile = outfile[index+1:]
if outfile.endswith('.fa') or outfile.endswith('.fasta'):
outfile = outfile.rstrip('fasta')
outfile = outfile.rstrip('.')
outfa = outfile+'_rep.fasta'
outtab = outfile+'_rep_table.txt'
outbin = outfile+'_binned.txt'
outcount = outfile+'_rep_counts.txt'
outgb = outfile+'_repeats.gb.txt'
tempblast = outfile+'_tempblast.txt'
temprepeats = outfile+'_temprepeats.txt'
tempparse = outfile+'_sequence_parsing.txt'

# Get sequence name and length from fasta file.
seq = open(infile, 'r')
seqname = seq.readline()
seqname = seqname.lstrip('> ')
seqname = seqname.rstrip()
seqlen = 0
for line in seq:
if(line[0] == ">"):
continue
seqlen += len(line.strip())
seq.close()

# run blastn with query file plus strand (removing first line which is full length sequence), minus strand, and concatenate
print 'Performing self-blastn comparison with '+seqname
os.system(blast_path+'blastn -query '+infile+' -strand plus -subject '+infile+' -word_size '+wordsize+' -reward '+reward+' -penalty -'+penalty+' -ungapped -dust no -soft_masking false -evalue 10 -outfmt "10 qstart qend length sstart send mismatch sstrand qseq" | tail -n+2 > tempblast1.txt')
os.system(blast_path+'blastn -query '+infile+' -strand minus -subject '+infile+' -word_size '+wordsize+' -reward '+reward+' -penalty -'+penalty+' -ungapped -dust no -soft_masking false -evalue 10 -outfmt "10 qstart qend length sstart send mismatch sstrand qseq" > tempblast2.txt')
os.system('cat tempblast1.txt tempblast2.txt > '+tempblast)
os.system('rm tempblast1.txt tempblast2.txt')

# open tempblast.txt, convert to list of lists, and sort by length and position descending
# This is necessary because blastn does not output every possible pair of hits when there are more than 2 copies of a repeat

print 'Sorting alignments...'
f = open(tempblast, 'r')
reader = csv.reader(f)
alignments = list(reader)
f.close()
alignments = sorted(alignments, key=lambda x: (-1*int(x[2]), -1*int(x[0])))
alignments.append(['1','1','1','1','1','0','A','X'])

# New list of uniques
# Text file '_sequence_parsing.txt' includes the information on how duplicates were found.
# Start at row 0. Compare to subsequent rows.
# If repeat length is different from the next row, it has passed all the tests, write it to the file.
# If query or subject coordinates are the same as the query or subject or reversed coordinates
# of a subsequent row, it is not unique, so go to the next row and do the comparisons again.
# Thanks to Alex Kozik for repeatedly testing and finding bugs in the algorithm.
print 'Finding unique repeats...'
uniques = []
sp = open(tempparse, 'w')
for row in range(len(alignments)):
sp.write('row '+str(row)+'\n')

if int(alignments[row][2]) < minlen:
# This won't happen unless the word_size is defined as something other than minlen.
# That could be useful under some circumstances.
sp.write('row '+str(row)+' is less than minlength')
break
else:

for compare in range(row+1,len(alignments)):
if alignments[row][2] != alignments[compare][2]:
uniques.append(alignments[row])
sp.write('\tadding row '+str(row)+' to unique list\n')
break
else:
sp.write('\tcomparing to '+str(compare)+'\n')

if alignments[row][0] == alignments[compare][0] and alignments[row][1] == alignments[compare][1]:
sp.write('\tqstart and qend of row '+str(row)+' and '+str(compare)+' are the same\n')
break
elif alignments[row][0] == alignments[compare][1] and alignments[row][1] == alignments[compare][0]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as qend and qstart of '+str(compare)+'\n')
break
elif alignments[row][0] == alignments[compare][3] and alignments[row][1] == alignments[compare][4]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as sstart and send of '+str(compare)+'\n')
break
elif alignments[row][0] == alignments[compare][4] and alignments[row][1] == alignments[compare][3]:
sp.write('\tqstart and qend of row '+str(row)+' is the same as send and sstart of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][0] and alignments[row][4] == alignments[compare][1]:
sp.write('\tsstart and send of row '+str(row)+' is the same as qstart and qend of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][1] and alignments[row][4] == alignments[compare][0]:
sp.write('\tsstart and send of row '+str(row)+' is the same as qend and qstart of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][3] and alignments[row][4] == alignments[compare][4]:
sp.write('\tsstart and send of row '+str(row)+' is the same as sstart and send of '+str(compare)+'\n')
break
elif alignments[row][3] == alignments[compare][4] and alignments[row][4] == alignments[compare][3]:
sp.write('\tsstart and send of row '+str(row)+' is the same as send and sstart of '+str(compare)+'\n')
break
else:
sp.write('\t'+str(row)+' is different\n')

sp.close()

# Write uniques into output file
# Start list for copy number table
rous_count = 0
g = open(outfa, 'w')
repcopies = []

for i in range(len(uniques)):
qstart = uniques[i][0]
qend = uniques[i][1]
length = uniques[i][2]
seq = uniques[i][7]

rous_count += 1
g.write('>Repeat_'+str(rous_count)+'\n'+seq+'\n')
repcopies.append(['Repeat_'+str(rous_count),length])

if rous_count == 0:
print "\tRepeats of unusual size? I don't think they exist"
g.close()
print 'Repeat fasta file is done, as you wish.'

# Now find each copy of each repeat. Again, this is because the blastn output file does not have every possible alignment.
# It is also because the information on locations and strand is not organized well in the blastn output.
# In addition, this subroutine eliminates duplicates of nested repeats.

print "Finding all copies of repeats..."
g = open(outfa, 'r')
os.system(blast_path+'blastn -query '+outfa+' -strand both -subject '+infile+' -word_size '+wordsize+' -reward 1 -penalty -20 -ungapped -dust no -soft_masking false -evalue 1000 -outfmt "10 qseqid length sstart send sstrand qcovhsp" > '+temprepeats)
g.close()

tempr = open(temprepeats, 'r')
reader = csv.reader(tempr)
replist = list(reader)
tempr.close()

print "Making a table of the repeats..."
sum_rep_len = 0
bin_dict = {}
binned = [seqname,seqlen,0]

# defining the bins
i = 0
j = 50
while j < 1000:
bin_dict[i] = j
binned.append(0)
i += 1
j += 50
while j <= 10000:
bin_dict[i] = j
binned.append(0)
i +=1
j += 250

# make list for entire sequence, set each position as 0
posit = []
for n in range(seqlen):
posit.append(0)

# Thanks to Emily Wynn for suggesting qcovhsp for this loop.
# if qcovhsp is >98%, write to the file
# write tab separated values of repeat name, length, start, end, strand to outtab
# make list for genbank file
# Keep stats on lengths
rt = open(outtab, 'w')
rt.write(seqname+'\t'+str(seqlen)+'\n')
templist = []
gblist =[]

# look at each repeat in turn
for i in range(len(replist)):
# if repeat is good (>98% identical to another one), write it to the file, and put the name in a list
if int(replist[i][5])>98:
rt.write(str(replist[i][0])+'\t'+str(replist[i][1])+'\t'+str(replist[i][2])+'\t'+str(replist[i][3])+'\t'+str(replist[i][4])+'\n')
if replist[i][4] == 'minus':
location = 'complement('+replist[i][3]+'..'+replist[i][2]+')'
else:
location = replist[i][2]+'..'+replist[i][3]
gblist.append(' repeat_region '+location+'\n /rpt_type=dispersed\n /label='+replist[i][0]+'\n')
templist.append(replist[i][0])
# then write 1's at every position in the sequence covered by that repeat
# these can then be summed to get total bases of repeats
# bases in overlapping repeats are only counted once
for n in range(int(replist[i][2]), int(replist[i][3])):
posit[n] = 1
# then scan through bin sizes and if a repeat is greater than the
# bin_dict size cutoff, add one to the bin
for j in range(len(binned)-4, -1, -1):
if int(replist[i][1]) >= bin_dict[j]:
binned[j+3] +=1
break
sum_rep_len = posit.count(1)
binned[2] = sum_rep_len
rt.close()
if genbank == True:
gb = open(outgb, 'w')
for i in range(len(gblist)):
gb.write(gblist[i])
gb.close()

# write tab separated values of repeat name, length, copy number to outcount
# first two lines are also a table of stats on repeats
rc = open(outcount,'w')
rc.write('Sequence\tGenome_size\tNumROUS\tAvgSize\tAvgCopyNum\n')

numrous = 0
sizerous = 0
copyrous = 0

for i in range(len(repcopies)):
repname = repcopies[i][0]
replen = float(repcopies[i][1])
repcop = float(templist.count(repname))

numrous += 1
sizerous += replen
copyrous += repcop

if numrous == 0:
avsizerous = 'NA'
avcopyrous = 'NA'
else:
avsizerous = sizerous/numrous
avcopyrous = copyrous/numrous


rc.write(seqname+'\t'+str(seqlen)+'\t'+str(numrous)+'\t'+str(avsizerous)+'\t'+str(avcopyrous)+'\n')

for i in range(len(repcopies)):
rc.write(repcopies[i][0]+'\t'+repcopies[i][1]+'\t'+str(templist.count(repcopies[i][0]))+'\n')

rc.close()

# Write binned table headers, then stats for this sequence.
binfile = open(outbin, 'w')
binfile.write('Sequence\tSeq_len\tRep_len\t')
for i in range(len(bin_dict)):
binfile.write(str(bin_dict[i])+'\t')
binfile.write('\n')
for i in range(len(binned)):
binfile.write(str(binned[i])+'\t')
binfile.write('\n')
binfile.close()
print "Repeat tables are done, as you wish."

# Removing temp files if necessary
if keep == False:
os.system('rm '+tempblast+' '+temprepeats+' '+tempparse)

# Rachael Schulte, William Goldman and Rob Reiner inspired this section of code
quote_dict = {0:"48656c6c6f2e204d79206e616d6520697320496e69676f204d6f6e746f79612e20596f75206b696c6c6564206d79206661746865722e205072657061726520746f206469652e", 1:"5768656e20492077617320796f7572206167652c2074656c65766973696f6e207761732063616c6c656420626f6f6b732e", 2:"486176652066756e2073746f726d696e2720646120636173746c6521", 3:"4d79207761792773206e6f7420766572792073706f7274736d616e6c696b652e", 4:"596f75206b656570207573696e67207468617420776f72642e204920646f206e6f74207468696e6b206974206d65616e73207768617420796f75207468696e6b206974206d65616e732e", 5:"4d75726465726564206279207069726174657320697320676f6f642e",6:"496e636f6e6365697661626c6521", 7:"5468657265277320612062696720646966666572656e6365206265747765656e206d6f73746c79206465616420616e6420616c6c20646561642e", 8:"596f7520727573682061206d697261636c65206d616e2c20796f752067657420726f7474656e206d697261636c65732e", 9:"476f6f64206e696768742c20576573746c65792e20476f6f6420776f726b2e20536c6565702077656c6c2e2049276c6c206d6f7374206c696b656c79206b696c6c20796f7520696e20746865206d6f726e696e672e",10:"4e6f206d6f7265207268796d65732c2049206d65616e2069742120416e79626f64792077616e742061207065616e75743f"}
import random, binascii
z = random.randint(0,10)
print binascii.unhexlify(quote_dict[z])+'\n'

2.2. 鉴定重复的步骤

选择ROUSFinder2.py脚本来鉴定线粒体的重复序列。

  • 必需参数:fasta格式的输入序列
  • -o out:输出文件名称
  • -m 50:鉴定的重复序列的最小长度,默认是50bp。
  • -b /path/to/blastn/:blastn命令的路径,默认是/usr/bin/。
  • -k:保留临时文件,out_tempblast.txt和out_temprepeats.txt
  • -gb:生成重复序列的genbank格式文件,out_repeats.gb.txt
  • -rew 1:match赏分,默认是+1
  • -pen 20:mismatch罚分,默认是-20
  • ROUSFinder2.py input.fa -m 30 -gb -o out
  • out_binned.txt:包含>50bp,100bp,150bp,…,1000bp,1250bp,1500bp,…,10000bp的repeat的数量
  • out_rep.fasta:包含fasta格式的repeat序列
  • out_rep_counts.txt:包含repeat的总数(NumROUS),平均长度(AvgSize),平均数量(AvgCopyNum),和每个repeat的长度和数量。
  • out_rep_table.txt:包含每个repeat单元的长度,起始位置和终止位置,正负链(plus或minus)。
  • out_repeats.gb.txt:使用-gb参数则生成此文件,是genbank格式的repeat的注释文件。

2.3. 重复的应用

  1. 辅助线粒体组装
  • 通过鉴定出来的重复,可以帮助线粒体组装得更完整。
  • 调整线粒体的已有组装。比如通过计算重组率,把主导的优势构象作为最后的线粒体组装构象。
  • 重复序列可能介导重组,所以鉴定重复是鉴定重组的基础。

3. 鉴定重组和计算重组率

鉴定出线粒体的重复序列后,还可以进一步鉴定这些重复序列是否介导了重组。

重组会导致构象的变化,构象变化的形式包括:

  1. 位于不同染色体环上的一对重复序列可能介导重组,使得两个小环变成一个大环。
  2. 位于同一个染色体环上的同方向的一对重复序列可能介导重组,使得一个大环变成两个小环。
  3. 位于同一个染色体环上的反方向的一对重复序列,可能在其他重复对的重组变换下变成同方向或者两个小环的状态。

3.1. 鉴定重组的步骤

3.1.1. 构建参考构象和重组构象的序列

  1. 针对一对可能造成重组的重复序列,构建参考构象和重组构象
  2. 在参考构象和重组构象中,以重复序列加上两翼各300bp作为参考构象序列(ref_1和ref_2)和重组构象序列(rec_1和rec_2)
  3. 构建参考序列和mapping注意事项
  • 可以把参考构象序列和重组构象序列(一共四条)一同作为参考序列(reference)用于mapping,避免同一条reads mapping到多个构象序列从而重复计数的情况。
  • 如果担心叶绿体派生的reads影响重组率的计算,还可以加上叶绿体基因组作为参考序列(reference),这样叶绿体的reads会被mapping到叶绿体上,从而起到过滤的作用。
  • 如果担心核基因派生的reads影响,可以用一个cutoff值(比如100bp),小于100bp的mapping被筛除。

3.1.2. 把reads往参考序列进行mapping

  1. 把reads往参考序列进行mapping,之后再根据mapping的reads的位置进行筛选,筛选出横跨repeat区域的有效mapping。
  2. 比如repeat长度为220bp,两边各截取300bp,共820bp长度的参考序列。这里把跨越220bp的repeat,并且两边各至少映射上10bp的reads作为有效mapping。这意味着,mapping的起始位置需要小于290,终止位置需要大于530bp。
  3. bwa进行mapping后得到的bam文件的第4列代表read比对到的参考序列最左侧的位置坐标(未必对上第4列为0);第9列代表pair read完全匹配到同一条参考序列时,两个read之间的长度。可以理解为insert size。
a = 530 # 设置终止位置需要大于的值,这里是530bp
bwa index reference.fa # 为参考序列建立索引
bwa mem -t 4 reference.fa sample_1.clean.fq sample_2.clean.fq | samtools sort -@ 4 -m 4G > reference.bam # 映射reads到参考序列
samtools view reference.bam |awk -F"\t" -v awka="$a" '$4<290 && ($4+$9)>awka {print $0}' > reference.bam.filter # 筛选起始位置小于290,终止位置大于530bp的mapped reads。这里的`awk -F"\t"`参数设定列分隔符非常重要,已经坑过我两把了。

3.1.3. 计算重组率(recombination rate)

  1. 把reads往参考序列进行mapping得到有效映射的比对文件reference.bam.filter后,计算不同参考序列的有效映射的reads数量(即数据行数)。
  2. 重组构象的两条序列(rec_1和rec_2)的比对reads数量之和作为分子,重组构象的两条序列(rec_1和rec_2)的比对reads数量与参考构象的两条序列(ref_1和ref_2)的比对reads数量之和作为分母,两者的比值可作为重组率。
n_ref_1 = $(cat reference.bam.filter | awk '$3 == "ref_1" {print $0}' |wc -l) # 统计参考构象序列ref_1上有效映射reads的数量
n_ref_2 = $(cat reference.bam.filter | awk '$3 == "ref_2" {print $0}' |wc -l) # 统计参考构象序列ref_2上有效映射reads的数量
n_rec_1 = $(cat reference.bam.filter | awk '$3 == "rec_1" {print $0}' |wc -l) # 统计参考构象序列rec_1上有效映射reads的数量
n_rec_2 = $(cat reference.bam.filter | awk '$3 == "rec_2" {print $0}' |wc -l) # 统计参考构象序列rec_2上有效映射reads的数量
mapping_rate = $(($n_rec_1+$n_rec_2)/($n_ref_1+$n_ref_2+$n_rec_1+$n_rec_2)) # 计算重组率

4. references

  1. ROUSFinder.py脚本:https://academic.oup.com/g3journal/article/9/2/549/6026745

  • 欢迎关注微信公众号:生信技工
  • 公众号主要分享生信分析、生信软件、基因组学、转录组学、植物进化、生物学概念等相关内容,包括生物信息学工具的基本原理、操作步骤和学习心得。

Wechat_public_qrcode.jpg?raw=true


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK