Population genomic dataset of Gafftopsail Catfish obtained aboard several vessels including the R/V Weatherbird II, WB1603, in the Gulf of Mexico from 2015-09-21 to 2017-09-03
Funded By:
Gulf of Mexico Research Initiative
Funding Cycle:
RFP-VI
Research Group:
Center for the Integrated Modeling and Analysis of Gulf Ecosystems III (C-IMAGE III)
David Seth Portnoy
Texas A&M University-Corpus Christi / The Harte Research Institute for Gulf of Mexico Studies
david.portnoy@tamucc.edu
Populations genetics, next-generation sequencing, marine catfish, Gafftopsail Catfish, population
Abstract:
This dataset contains population genomic dataset Gafftopsail Catfish obtained aboard several vessels including the R/V Weatherbird II (WB1603) in the Gulf of Mexico from 2015-09-21 to 2017-09-03. The dataset consists of the metadata describing location, size, date of capture and other pertinent information associated with sequenced individuals. The dataset also includes genepop files which will have the multilocus genotype of every individuals sampled. Every individual has a unique identifier present in the metadata and the genpop file allowing users to cross reference information. The dataset contains the date, latitudes and longitudes of the sampling location.
Suggested Citation:
Portnoy, David and Shannon O'Leary. 2021. Population genomic dataset of Gafftopsail Catfish obtained aboard several vessels including the R/V Weatherbird II, WB1603, in the Gulf of Mexico from 2015-09-21 to 2017-09-03. Distributed by: GRIIDC, Harte Research Institute, Texas A&M University–Corpus Christi. doi:10.7266/n7-8jya-4m57
Purpose:
Assessment of population structure and local adaptation of Gafftopsail Catfish.
Data Parameters and Units:
Genpop files are a list of alleles defined as a three-digit integer. Alleles are single-nucleotide polymorphisms. No special units will be associated with the metadata but a key will be provided for clarity. In metadata under Sampler: CIMAGE = Center for Integrated Modeling and Analysis of the Gulf Ecosystem, TAMUCC= Texas A&M University - Corpus Christi, FSU = Florida State University, FWC = Florida Fish and Wildlife Conservation Commission, NCF = New College of Florida, GCRL= Gulf Coast Research Laboratory, LDWF = Louisiana Department of Wildlife and Fisheries, MSU = Mississippi State University. The headers are Sample ID, Lat, Long, Sex, Weight, Sampler, Population, Field ID, Collection Date, SL(mm), FL(mm), TL(mm), Surface Temp C, Surface Salinity ppt, Surface DO mg/L, Tubidity, 3m Temp C, 3m Salinity, ppt 3m, DO mg/L, T_celsius_bottom, S_bottom, DO_bottom_mgl, Depth_m, TTL WT KG, LIVER WT KG, GI WT KG, GONAD WT KG, Liver Lipid (g), SL(mm)- standard length, FL(mm)-Fork length, TL(mm)-total length, and Temp- temperature The cruise documentation was provided for the R/V Weatherbird II (WB1603). The cruise was led by chief scientist Dr. Steven Murawski.
Methods:
Sampling & Library prep Double-digest restriction-site associated (ddRAD) libraries, consisting of ~40,000 sequence fragments, were generated following an improved modification of ddRAD-tag procedures. Briefly, genomic DNA was digested with two restriction endonucleases (EcoRI, MspI) and resulting fragments ligated to two adapter oligonucleotides (P1 and P2) that serve as barcodes. Following adapter ligation, barcoded and indexed sequences were pooled and size-selected using a PippinPrep size-selection system (Sage Science) to a specific size range (338 – 412 base pairs). Polymerase chain-reaction (PCR) amplification of fragments was performed to incorporate adaptors necessary for annealing to an Illumina flow cell during a sequencing run via the Illumina HiSeq platform. Approximately 150 individuals from through the geographic spread of sampling were sequenced on a single HiSeq Illumina lane and the data represent 458 individuals sequenced on three sequencing runs. Reduced-representation reference assembly RAD sequences retrieved from each sequencing run were separated by individual fish, using barcode-index sequences, quality trimmed, and homologous sequences for each individual assembled into multiple alignments to identify individual single-nucleotide polymorphisms (SNPs) using the dDocent pipeline (Puritz et al., 2014). Low quality bases (<20) are trimmed from beginning/end of reads, additionally, bases are trimmed when the average quality drips below 5 in a sliding window (5 base pairs). For de novo reference assembly, the ten individuals with the highest number of reads were selected from each library (sequencing lane) using the overlapping read (OL) assembly option in dDocent. First CD-HIT (Fu et al., 2012) is used to cluster reads into putative loci according to a user-selected value of %-similarity (c) and cut-off values for within (K1) - and between-individual (K2) coverage. A custom script was used to assemble references for c = 0.8 – 0.92 for K1 and K2 = 1 – 10. A c-value of 0.8 was chosen based on the point at which a sudden increase in the number of loci in a reference is observed, indicating that loci are being over-split. K1 and K2 were chosen after comparing mapping statistics for ten individuals randomly chosen from each library mapped to references generated for c = 0.8, K1 = 2 – 10, and K2 = 1 – 10 using BWA (Li and Durbin, 2009) to maximize the number of reads mapped as a proper pair and minimize reads where forward and reverse reads mapped to different contigs. The final reduced-representation reference genome was constructed for c = 0.8, K1 = 5, and K2 = 2, encompassing a total 10,874,990 basepairs over 37,872 fragments (mean 287 bp; mode 307 bp). Genotyping Reads were mapped to the reduced-representation library using BWA and SNPs called using freebayes (Garrison and Marth, 2012). The resulting data set was filtered to remove low quality and artefactual SNP sites, paralogs, and low quality individuals using vcftools (Danecek et al., 2011) and custom scripts following (O’Leary et al., 2018). Genotypes with quality < 20 and < 5 reads were coded as missing and retaining loci with quality > 20, genotype call rate > 90%. mean depth 15 – 300. Further, loci were filtered based on allelic balance, mapping quality ratios, strandedness, paired status, depth/quality ratio, and excess heterozygosity. Individuals with > 25% missing data were removed. Finally, rad_haplotyper (Willis et al., 2017) was used to determine haplotypes (SNP combinations) for each contig. The resulting haplotyped data set was further filtered to remove loci haplotyped in < 90% of individuals, flagged as potential paralogs in > 4 individuals, or as affected by genotyping error in > 10 individuals. Duplicate samples were compared to assess genotyping error using discordant genotyped (< 5% of loci affected) and loci systematically affected by genotyping error or flagged as out of HWE in > 5 populations were removed, resulting in a final data set consisting of 376 individuals genotyped for 5,556 SNP-containing loci (18,611 alleles).
Provenance and Historical References:
Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin. 2011. The variant call format and VCFtools. Bioinformatics 27(15): 2156–2158. DOI: 10.1093/bioinformatics/btr330 Fu, Limin, Beifang Niu, Zhengwei Zhu, Sitao Wu, Weizhong Li. 2012. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23): 3150–3152. DOI: 10.1093/bioinformatics/bts565 Garrison, Erik and Gabor Marth. 2012. Haplotype-based variant detection from short-read sequencing. PLoS One 11, e0151651. DOI: arXiv:1207.3907 Li, Heng and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14): 1754–1760. DOI: 10.1093/bioinformatics/btp324 O’Leary, Shannon J., Jonathan B. Puritz, Stuart C. Willis, Christopher M. Hollenbeck, David S. Portnoy. 2018. These aren’t the loci you’re looking for: Principles of effective SNP filtering for molecular ecologists. Molecular Ecology 27(16): 3193–3206. DOI: 10.1111/mec.14792 Puritz, Jonathan B., Christopher M. Hollenbeck, John R. Gold. 2014. dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ 2(1): e431. DOI: 10.7717/peerj.431 Willis, Stuart C., Christopher M. Hollenbeck, Jonathan B. Puritz, John R. Gold, David S. Portnoy. 2017. Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage. Molecular Ecology Resources 17(5): 955-965. DOI: 10.1111/1755-0998.12647