Identifying and Annotating Genes
RNAseq data was obtained
using Illumina sequencing on cDNA that had been prepared from RNA
harvested from a variety of conditions, as described in the next
section on "Gene Expression Changes." All cDNA sequencing
was performed as single (i.e., unpaired) reads using 36 cycles.
The quality of these
reads was evaluated using the program FASTQC. Of the original 412,236,998
RNAseq reads, we removed 196,647,675 that are duplicates (either
from artifacts of the sequencing process or coincidental correspondence,
especially for the most highly expressed transcripts), leaving 215,589,323
RNAseq reads for assembly of the transcriptome. These reads were
trimmed to an error rate of less than approximately 1:100, then
trimmed until no ambiguous nucleotides (e.g. “N”) remain,
then all reads shorter than 15 nucleotides were discarded. This
retained 212,887,812 unique, high-quality RNAseq reads.
FASTQC results are
Details for this processing of RNAseq reads are in <RNAseq_processing.pdf>
These reads were assembled
into a set of 27,303 transcript contigs that are at least 200 nts
in length using a DeBruijn graph method. The longest open reading
frame (ORF) was determined for each transcript contig and then conceptually
translated into amino acid sequences.
The sequences of these
transcript contigs are in <Transcriptome_sequences.fa>
The sequences of the longest ORF in each transcript contig are in
The conceptually translated amino acid sequences of these ORFs are
A BLAST database of the transcriptome assembly is at <Transcriptome_db.zip>
A BLAST database of the set of longest ORFs from the transcriptome
assembly is at <Transcriptome_longORF_db.zip>
A BLAST database of the translated amino acid sequences of these
ORFs is at <Transcriptome_pep_db.zip>
A summary report of this transcriptome assembly is in <Transcriptome_summary.pdf>
A detailed report of this transcriptome assembly is in <Transcriptome_details.pdf>
An Excel file detailing length and RNAseq coverage of these transcript
contigs is in <Transcriptome_stats.xlsx>
After masking repeated
genomic elements, genes were modeled in the 3,713 genome scaffolds
using serveral methods, and then reconciled into a set of 7,112
genes (Gene Set version 1.0) using Maker.
Of these, 12 were found in the highly repeated portions of the genome
(113 scaffolds) and 7,100 in the more moderatly covered portion
(3,600 scaffolds). This is the most conservative, most reliable
among these various individual sets of gene models, but each is
provided individually also. Here are the various methods used:
(1) We aligned all
processed RNAseq reads to the Genome Assembly version 1.0 using
followed by adjustment for intron-exon boundaries using Tophat
and creation of gene models using Cufflinks.
Tophat parameters were set to: single-end, unstranded, anchor length
8, no mismatches in the anchor region of the spliced alignment,
min intron 15, max intron 5000, no indel search, max alignments
1000, initial mismatches 2, minimum read segments 25, no microexon
search. Cufflinks parameters were set to: max intron 5000, min isoform
fraction 0.1, pre-mRNA fraction 0.15, no quartile normalization,
no bias correction. This created a set of 31,683 Cufflinks
gene models (327 on the deeply covered portion plus 31,356 on the
moderately covered portion) and, separately, were used as part of
Maker evidence (see below).
(2) All of these 31,683
gene models were entered into Maker
as a GFF file as evidence from the RNA sequencing.
(3) The 9,791 transcripts
of the filtered gene models from JGI
of Chlorella sp. NC64A were matched to the genome contigs
using BLASTn and these alignments refined using est2genome
based on best modeling of intron-exon boundaries to create 3,449
est2genome gene models (none on the deeply covered portion, so all
on the moderately covered portion).
(4) We chose sets of
protein sequences from six chlorophytes to align to the genome assembly
using BLASTx, with further refinement using protein2genome
based on best modeling of intron-exon boundaries. The organisms
chosen were (a) Chlorella sp. NC64A, 9,791 filtered models
(b) Coccomyxa sp. C169, 9,629 filtered models (v. 2) from
JGI; (c) Chlamydomonas
reinhardtii, 17,114 models from Phytozome;
(d) Volvox carteri, 15,285 models from JGI;
(e) Ostreococcus sp. RCC809, 7,492 filtered models from JGI;
(f) Micromonas pusilla CCMP1545, 10,475 models from JGI.
Download the file of concatenated peptide sequences from these six
chlorophytes at <Six_Chlorophyte_peps.fa>.
This created 29.871 protein2genome gene models 104
on the deeply covered portion plus 29,767 on the moderately
covered portion) .
(5) We created 8,599
ab initio gene models (9 on the deeply covered portion plus
8,590 on the moderately covered portion) using Augustus
trained on the gene structures of Chlamydomonas reinhardtii.
(6) We created
43,040 ab initio gene models (77 on the deeply covered portion
plus 42,963 on the moderately covered portion) using SNAP
trained on the gene structures of Arabidopsis thaliana.
(7) We created
12,869 ab initio gene models (125 on the deeply covered portion
plus 12,744 on the moderately covered portion) using Genemark
trained on the gene structures of Chlamydomonas reinhardtii.
(8) We used Maker
to reconcile all of these lines of evidence into a single set of
7,112 well-supported gene models (12 on the deeply covered portion
plus 7,100 on the moderately covered portion). This is the set of
genes that are supported by multiple lines of evidence, and so comprise
what we designate as Gene Set version 1.0.
The GFFs (tables showing
genome features plotted on scaffolds from the genome assembly in
standardized format) for each of these sets of gene models are available
below. These have been used to create annotated versions of the
genome assembly scaffolds in both "CLC" and GenBank formats
for viewing in a genome browser. These files can be viewed using
free software from CLC Bio called the Sequence
Viewer that works on any platform (or several other alternatives).
The GFF, CLC, and GenBank files have been concatenated for "All
models" to allow users to view all of the gene models simultaneously
in a genome browser format (although concatenating their sequences
does not seem useful and so was not done). The corresponding transcript
nucleotide sequences and their inferred peptide sequences can be
downloaded in fasta format and as BLAST databases. Cases with numerals
of 113 and 3600 link to separate files for the 113 deepest coverage
(>1,000x) genome scaffolds versus the 3,600 genome scaffolds
that have between 100 and 999x genome sequence coverage.
In addition to this
gene set, a manual search was made for any mitochondrial and plastid
genome sequences. First, we can confidently exclude the mtDNA from
this sequencing project. Of all 10,201 scaffolds in the genome assembly
v. 0.5, none map to the known complete mtDNA sequence of Pedinomonas
minor, and of 174,577,225 reads, only 428 map to that mtDNA,
almost surely as an artifact. In contrast, there is a small proportion
of the plastid genome, with 851,875 reads mapping to the complete
cpDNA of Chlorella vulgaris that has been published, corresponding
to four scaffolds in the genome assembly that sum to far less than
a complete cpDNA. We have annotated
the genes contained in these four scaffolds using DOGMA
and provide these in both "clc" and GenBank format, either
of which can be viewed in a genome browser such as is available
for free from CLC Bio (see above) or other software. These are available