Genome Project Solutions    
GPS home
 
   
    Chlorella vulgaris Genome Project, in collaboration with  
    National Renewable Energy Laboratory  
 Partnering for Discovery
 

 

 

This project is led by Dr. Michael Guarnieri.of the National Renewable Energy Lab.

(From the NREL website) Dr. Guarnieri uses a systems biology approach – utilizing genomics, transcriptomics, metabolomics, and proteomics – to identify and target pathways involved in algal lipid and hydrogen production.

For information or to report a problem: Chlorella@GenomeProjectSolutions.com

The Genome of the Green Alga, Chlorella vulgaris

Project Status

The genome sequencing of strain UTEX 395 of Chlorella vulgaris has been completed, an assembly performed (v.1.0), a gene set created (v.1.0), and gene expression measured and compared among several conditions. A manuscript describing these results is being prepared.

DNA sequencing was done by Eureka Genomics in Hercules, California. Genome Project Solutions advised the sequencing stragegy, led the genome assembly, determined the gene content and relative gene expression levels, performed the informatics to interpret and present this genome to the scientific community, and is participating in drafting the manuscript describing this project.

 
 

Genome Sequencing and Assembly

All sequencing was done using Illumina technology with 108 cycles (determining the maximum possible read length). The 171,758,456 paired-end (SIPES) reads were trimmed to an error rate of less than approximately 1:100, then trimmed until no ambiguous nucleotides (e.g. “N”) remain, then all reads shorter than 20 nucleotides were discarded. This retained 168,611,711 reads, of which 165,874,962 remain as pairs. (The rest became unpaired when their mates were eliminated by the quality and length trimming.)

The initial quality report on the genome sequencing is in <Initial_quality_report.zip>
Details of the sequence trimming is in: <Sequence trim_report.pdf>

These trimmed, paired-end sequencing reads were repeatedly assembled using a DeBruijn graph method with varying parameters. All results were similar, and the best one was chosen for version 0.5 of the assembly. This had 10,201 total scaffolds. The file <Assembly_v0.5_Summary> gives an overview of how this was done and what was found. The file <Assembly_v0.5.fa> is a fasta-formatted file containing each of these 10,201 scaffolds.

We then sorted this initial v.0.5 assembly scaffolds by genome coverage because those with few reads supporting them are likely to be an artifact. Using 100x depth of coverage as the cutoff retains 3,713 scaffolds, which we publish for further analysis as Assembly Version 1.0. Ignoring the 6,488 scaffolds at lower than 100x coverage is of minor effect, eliminating only relatively small scaffolds. Only one is larger than 2 kb and only eight are larger than 1 kb and, of these, none is at greater than 11x coverage. The sum of the sizes of the 1,076 assembly v. 0.5 scaffolds that are between 5x and 100x coverage is less than 418 kb. (Their sequences are available in FASTA format at <Assembly_v0.5_1076.fa>.

The scaffolds of Genome Assembly v. 1.0 were then further divided for analysis into two components based on depth of coverage: (1) There are 113 scaffolds that are at least 1,000x depth of coverage, likely to be from highly repeated portions of the genome. These scaffolds sum to only 73,940 nts, but are estimated to represent 1,028,937 nts of the genome based on their relative overrepresentation among the reads relative to the single-copy portion of the genome. These calculations are shown in the Excel spreadsheet linked below. (2) There are 3,600 scaffolds that are between 100x and 999x depth of coverage (mean of 366x).

Genome Assembly v.1.0 contains 24 scaffolds that are longer than 100 kb (summing to 3,158,309 nts), a total of 566 longer than 20 kb (summing to 23,596,862 nts), and a total of 2,172 that are longer than 2 kb (summing to 36,189,064 nts).

The sequences of the 113 very deeply covered (>1,000x) scaffolds of Assembly v.1.0 are in <Assembly_v1.0_113.fa>
The sequences of the 3,600 moderately covered (100x-999x) scaffolds of Assembly v.1.0 are in <Assembly_v1.0_3600.fa>
A detailed report of the assembly for the 113 very deeply covered (>1,000x) scaffolds of Assembly v.1.0 is in <Assembly_v1.0_113_Details.pdf>
A detailed report of the assembly for the 3,600 moderately covered (100x-999x) scaffolds of Assembly v.1.0 is in <Assembly_v1.0_3600_Details.pdf>
Read mapping statistics for the two components of Assembly v.1.0 and the low-coverage scaffolds that were eliminated are at <Read_mapping_stats.xlsx>
A listing of these scaffolds in order of lengths, their cumulative lengths, and the cumulative genome coverage is at <Assembly_v.1.0_scaffolds.xls>

Download compressed BLAST databases for:

<Assembly v.0.5 all 10,201 scaffolds>
<Assembly v.0.5 1,076 scaffolds from 5x to 100x coverage>
<Assembly v.1.0 113 deeply covered (>1,000x) scaffolds>
<Assembly v.1.0 3,600 scaffolds from 100 to 999x coverage>

 
 

Identifying and Annotating Genes

RNAseq data was obtained using Illumina sequencing on cDNA that had been prepared from RNA harvested from a variety of conditions, as described in the next section on "Gene Expression Changes." All cDNA sequencing was performed as single (i.e., unpaired) reads using 36 cycles.

The quality of these reads was evaluated using the program FASTQC. Of the original 412,236,998 RNAseq reads, we removed 196,647,675 that are duplicates (either from artifacts of the sequencing process or coincidental correspondence, especially for the most highly expressed transcripts), leaving 215,589,323 RNAseq reads for assembly of the transcriptome. These reads were trimmed to an error rate of less than approximately 1:100, then trimmed until no ambiguous nucleotides (e.g. “N”) remain, then all reads shorter than 15 nucleotides were discarded. This retained 212,887,812 unique, high-quality RNAseq reads.

FASTQC results are in <RNAseq_FASTQC.zip>
Details for this processing of RNAseq reads are in <RNAseq_processing.pdf>

These reads were assembled into a set of 27,303 transcript contigs that are at least 200 nts in length using a DeBruijn graph method. The longest open reading frame (ORF) was determined for each transcript contig and then conceptually translated into amino acid sequences.

The sequences of these transcript contigs are in <Transcriptome_sequences.fa>
The sequences of the longest ORF in each transcript contig are in <Transcriptome_longORFs.fa>
The conceptually translated amino acid sequences of these ORFs are in <Transcriptome_longORF_peptides.fa>
A BLAST database of the transcriptome assembly is at <Transcriptome_db.zip>
A BLAST database of the set of longest ORFs from the transcriptome assembly is at <Transcriptome_longORF_db.zip>
A BLAST database of the translated amino acid sequences of these ORFs is at <Transcriptome_pep_db.zip>
A summary report of this transcriptome assembly is in <Transcriptome_summary.pdf>
A detailed report of this transcriptome assembly is in <Transcriptome_details.pdf>
An Excel file detailing length and RNAseq coverage of these transcript contigs is in <Transcriptome_stats.xlsx>

After masking repeated genomic elements, genes were modeled in the 3,713 genome scaffolds using serveral methods, and then reconciled into a set of 7,112 genes (Gene Set version 1.0) using Maker. Of these, 12 were found in the highly repeated portions of the genome (113 scaffolds) and 7,100 in the more moderatly covered portion (3,600 scaffolds). This is the most conservative, most reliable among these various individual sets of gene models, but each is provided individually also. Here are the various methods used:

(1) We aligned all processed RNAseq reads to the Genome Assembly version 1.0 using Bowtie followed by adjustment for intron-exon boundaries using Tophat and creation of gene models using Cufflinks. Tophat parameters were set to: single-end, unstranded, anchor length 8, no mismatches in the anchor region of the spliced alignment, min intron 15, max intron 5000, no indel search, max alignments 1000, initial mismatches 2, minimum read segments 25, no microexon search. Cufflinks parameters were set to: max intron 5000, min isoform fraction 0.1, pre-mRNA fraction 0.15, no quartile normalization, no bias correction. This created a set of 31,683 Cufflinks gene models (327 on the deeply covered portion plus 31,356 on the moderately covered portion) and, separately, were used as part of Maker evidence (see below).

(2) All of these 31,683 Cufflinks gene models were entered into Maker as a GFF file as evidence from the RNA sequencing.

(3) The 9,791 transcripts of the filtered gene models from JGI of Chlorella sp. NC64A were matched to the genome contigs using BLASTn and these alignments refined using est2genome based on best modeling of intron-exon boundaries to create 3,449 est2genome gene models (none on the deeply covered portion, so all on the moderately covered portion).

(4) We chose sets of protein sequences from six chlorophytes to align to the genome assembly using BLASTx, with further refinement using protein2genome based on best modeling of intron-exon boundaries. The organisms chosen were (a) Chlorella sp. NC64A, 9,791 filtered models from JGI; (b) Coccomyxa sp. C169, 9,629 filtered models (v. 2) from JGI; (c) Chlamydomonas reinhardtii, 17,114 models from Phytozome; (d) Volvox carteri, 15,285 models from JGI; (e) Ostreococcus sp. RCC809, 7,492 filtered models from JGI; (f) Micromonas pusilla CCMP1545, 10,475 models from JGI. Download the file of concatenated peptide sequences from these six chlorophytes at <Six_Chlorophyte_peps.fa>. This created 29.871 protein2genome gene models 104 on the deeply covered portion plus 29,767 on the moderately covered portion) .

(5) We created 8,599 ab initio gene models (9 on the deeply covered portion plus 8,590 on the moderately covered portion) using Augustus trained on the gene structures of Chlamydomonas reinhardtii.

(6) We created 43,040 ab initio gene models (77 on the deeply covered portion plus 42,963 on the moderately covered portion) using SNAP trained on the gene structures of Arabidopsis thaliana.

(7) We created 12,869 ab initio gene models (125 on the deeply covered portion plus 12,744 on the moderately covered portion) using Genemark trained on the gene structures of Chlamydomonas reinhardtii.

(8) We used Maker to reconcile all of these lines of evidence into a single set of 7,112 well-supported gene models (12 on the deeply covered portion plus 7,100 on the moderately covered portion). This is the set of genes that are supported by multiple lines of evidence, and so comprise what we designate as Gene Set version 1.0.

The GFFs (tables showing genome features plotted on scaffolds from the genome assembly in standardized format) for each of these sets of gene models are available below. These have been used to create annotated versions of the genome assembly scaffolds in both "CLC" and GenBank formats for viewing in a genome browser. These files can be viewed using free software from CLC Bio called the Sequence Viewer that works on any platform (or several other alternatives). The GFF, CLC, and GenBank files have been concatenated for "All models" to allow users to view all of the gene models simultaneously in a genome browser format (although concatenating their sequences does not seem useful and so was not done). The corresponding transcript nucleotide sequences and their inferred peptide sequences can be downloaded in fasta format and as BLAST databases. Cases with numerals of 113 and 3600 link to separate files for the 113 deepest coverage (>1,000x) genome scaffolds versus the 3,600 genome scaffolds that have between 100 and 999x genome sequence coverage.

Gene Modeling Downloads
Type of gene modeling
GFF
CLC
GenBank
Transcript sequences
Inferred protein sequences
BLAST db of nucleotides
BLAST db of proteins
Cufflinks only
113 / 3600
est2genome only
none / 3600
none / 3600
none / 3600
none / 3600
none / 3600
none / 3600
protein2genome only
113 / 3600
Augustus only
SNAP only
Genemark only
All models (highly redundant)
N/A
N/A
N/A
N/A
Reconciled Maker models

In addition to this gene set, a manual search was made for any mitochondrial and plastid genome sequences. First, we can confidently exclude the mtDNA from this sequencing project. Of all 10,201 scaffolds in the genome assembly v. 0.5, none map to the known complete mtDNA sequence of Pedinomonas minor, and of 174,577,225 reads, only 428 map to that mtDNA, almost surely as an artifact. In contrast, there is a small proportion of the plastid genome, with 851,875 reads mapping to the complete cpDNA of Chlorella vulgaris that has been published, corresponding to four scaffolds in the genome assembly that sum to far less than a complete cpDNA. We have annotated the genes contained in these four scaffolds using DOGMA and provide these in both "clc" and GenBank format, either of which can be viewed in a genome browser such as is available for free from CLC Bio (see above) or other software. These are available at <cpDNA_scaffolds.zip>.

 
 

Gene Expression Changes

Chlorella vulgaris was grown in nitrogen-rich conditions (corresponding to a low-lipid state) at the entry of log phase of growth (optical density, "OD", of 2). Three replicate samples (OD2a, OD2b, and OD2c) were collected. This culture was then resuspended in nitrogen-poor media (corresponding to a high-lipid state) at increasing cell density and lipid content, sampled at OD of 4 where two samples were collected (OD4A and OD4B), at OD of 6 where two samples were collected (OD6A and OD6B), and at OD of 8 where three samples were collected (OD8a, OD8b, and OD8c). Additionally, two samples at OD of 7 were taken (OD7Ra and OD7Rb) from a high density, nitrogen-poor state that was naturally depleted (i.e., it did not undergo resuspension in nitrogen-poor media), and a technical replicate of one sample (OD7Ra2) was collected. Lastly, a single sample was taken in heterotrophic growth conditions (i.e., without light and with the addition of glucose). Each of these samples was processed to isolate RNA, which was converted into cDNA, sheared, and processed for illumina sequencing.

All of these raw (i.e., without removing duplicates or trimming as above) Illumina reads were separately aligned to the 7,112 gene models created by Maker (gene set 1.0) to evaluate changes in gene expression among these various conditions. These were normalized for length of each gene model  and for the differing numbers of total aligning reads per sample.

The Excel file with these results can be downloaded at <RNAseq_mapped2Maker_genes.xlsx>.

This spreadsheet gives the number of reads aligning to each gene model for each sample and the normalized coverage, averages the normalized coverage within the various sets of samples from identical conditions, calculates the deviation of each separate sample from this average (with red highlighting for any that vary at or greater than 30% from the mean), and compares these averaged values pairwise for each condition to the OD2 samples (with red highlighting for any decreased in expression at or greater than two-fold and in green highlighting for any increased in expression at or greater than two-fold).

 
  Last update: May 30, 2012