Recent advances in sequencing technologies have made genome sequencing of non-model organisms with very large and complex genomes possible. The data can be used to estimate diverse genome characteristics, including genome size, repeat content, and levels of heterozygosity. K-mer analysis is a powerful biocomputational approach with a wide range of applications, including estimation of genome sizes. However, interpretation of the results is not always straightforward. Here, I review k-mer-based genome size estimation, focusing specifically on k-mer theory and peak calling in k-mer frequency histograms. I highlight common pitfalls in data analysis and result interpretation, and provide a comprehensive overview on current methods and programs developed to conduct these analyses.
Keywords: BB-tools; CovEST; FindGSE; GCE; GenomeScope; Jellyfish; KSA; Kmergenie; RESPECT.
© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.