Afbeelding auteur
1 werk(en) 1 lid 1 Geef een beoordeling

Werken van Edmund M. Hart

Tagged

Algemene kennis

Er zijn nog geen Algemene Kennis-gegevens over deze auteur. Je kunt helpen.

Leden

Besprekingen

PDFR32 | Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of various data formats, dataset sizes, data complexity, data use cases, and data-sharing practices. Improvements in high-throughput DNA sequencing, sustained institutional support for large sensor networks [1,2], and sky surveys with large-format digital cameras [3] have created massive quantities of data. At the same time, the combination of increasingly diverse research teams [4] and data aggregation in portals (e.g., for biodiversity data, GBIF.org, or iDigBio) necessitates increased coordination among data collectors and institutions [5,6]. Consequently, “data” can now mean anything from petabytes of information stored in professionally maintained databases to spreadsheets on a single computer to handwritten tables in lab notebooks on shelves. All remain important, but data curation practices continue to keep pace with the changes brought about by new forms of data and new data collection and storage practices | This article describes ten simple rules for digital data storage that grew out of a long discussion among instructors for the Software and Data Carpentry initiatives [17,18]. Software and Data Carpentry instructors are scientists from diverse backgrounds who have encountered a variety of data storage challenges and are active in teaching other scientists best practices for scientific computing and data management. Thus, this paper represents a distillation of collective experience and hopefully will be useful to scientists facing a variety of data storage challenges. We additionally provide a glossary of common vocabulary for readers who may not be familiar with particular terms |

Contents
1. Introduction pg. 1
2. Rule 1: Anticipate How Your Data Will Be Used pg. 2
3. Rule 2: Know Your Use Case pg. 2
4. Rule 3: Keep Raw Data Raw pg. 3
5. Rule 4: Store Data in Open Formats pg. 3
6. Rule 5: Data Should Be Structured for Analysis pg. 4
7. Table A&B Fig. 1: Untidy Data Example & Tidy Data Sample pg. 5
8 Rule 6: Data Should Be Uniquely Identifiable pg. 5
9. Rule 7: Link Relevant Metadata pg. 6
10. Rule 8: Adopt the Proper Privacy Protocols
11. Rule 9: Have a Systematic Backup Scheme pg. 7
12. Rule 10: The Location and Method of Data Storage Depend on How Much Data You Have
13. Further Reading and Resources pg. 8
14. Glossary: pg. 8
15. Acknowledgments pg. 10
16. References pg. 11

Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, et al. Launching genomics
into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC bioinformatics.
2014; 15: 30. doi: 10.1186/1471-2105-15-30 PMID: 24475911
2. Hampton SE, Strasser C, Tewksbury JJ, Gram WK, Budden AE, Batcheller AL, et al. Big data and the
future of ecology. Frontiers in Ecology and the Environment. 2013; 130312142848005. doi: 10.1890/
120103
3. Eisenstein DJ, others. SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky
Way, and Extra-Solar Planetary Systems. The Astronomical Journal. 2011; 142: 72. doi: 10.1088/
0004-6256/142/3/72
4. Adams J. Collaborations: The rise of research networks. Nature. Nature Publishing Group, a division
of Macmillan Publishers Limited. All Rights Reserved. 2012; 490: 335±6.
5. Fraser LH, Henry HA, Carlyle CN, White SR, Beierkuhnlein C, Cahill JF, et al. Coordinated distributed
experiments: an emerging tool for testing global hypotheses in ecology and environmental science.
Frontiers in Ecology and the Environment. Ecological Society of America; 2013; 11: 147±155. doi: 10.
1890/110279
6. Robertson MAG Tim AND DoÈ ring. The GBIF integrated publishing toolkit: Facilitating the efficient publishing
of biodiversity data on the internet. PLoS ONE. Public Library of Science; 2014; 9: e102623.
doi: 10.1371/journal.pone.0102623 PMID: 25099149
7. Wolkovich EM, Regetz J, O'Connor MI. Advances in global change research require open science by
individual researchers. Global Change Biology. 2012; 18: 2102±2110. doi: 10.1111/j.1365-2486.2012.
02693.x
8. Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, Cain KE, et al. Troubleshooting public data
archiving: suggestions to increase participation. PLoS biology. 2014; 12: e1001779. doi: 10.1371/
journal.pbio.1001779 PMID: 24492920
9. White E, Baldridge E, Brym Z, Locey K, McGlinn D, Supp S. Nine simple ways to make it easier to (re)
use your data. Ideas in Ecology and Evolution. 2013; 6: 1±10. doi: 10.4033/iee.2013.6b.6.f
10. Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, et al. Ten simple rules for the
care and feeding of scientific data. PLoS computational biology. 2014; 10: e1003542. doi: 10.1371/
journal.pcbi.1003542 PMID: 24763340
11. Pepe A, Goodman A, Muench A, Crosas M, Erdmann C. How Do Astronomers Share Data? Reliability
and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices
among US Astronomers. Golden AA-J, editor. PLoS ONE. 2014; 9: e104798. doi: 10.1371/journal.
pone.0104798 PMID: 25165807
12. Vines TH, Albert AYK, Andrew RL, DeÂbarre F, Bock DG, Franklin MT, et al. The availability of research
data declines rapidly with article age. Current biology: CB. Elsevier; 2014; 24: 94±7. doi: 10.1016/j.cub.
2013.11.014 PMID: 24361065
13. Michener WK, Jones MB. Ecoinformatics: supporting ecology as a data-intensive science. Trends in
ecology & evolution. 2012; 27: 85±93. doi: 10.1016/j.tree.2011.11.016 PMID: 22240191
14. Michener WK, Brunt JW, Helly JJ, Kirchner TB, Stafford SG. Nongeospatial metadata for the ecological
sciences. Ecological Applications. Eco Soc America; 1997; 7: 330±342. doi: 10.1890/1051-0761
(1997)007[0330:NMFTES]2.0.CO;2
15. Marcial LH, Hemminger BM. Scientific data repositories on the Web: An initial survey. Journal of the
American Society for Information Science and Technology. 2010; 61: 2029±2048. doi: 10.1002/asi.
21339
16. Michener WK. Ten simple rules for creating a good data management plan. PLoS computational biology.
Public Library of Science; 2015; 11: e1004525. doi: 10.1371/journal.pcbi.1004525 PMID:
26492633
17. Wilson G. Software Carpentry: lessons learned. F1000Research. 2014; 3: 62. doi: 10.12688/
f1000research.3-62.v1 PMID: 24715981
18. Teal TK, Cranston Ka, Lapp H, White E, Wilson G, Ram K, et al. Data Carpentry: Workshops to
Increase Data Literacy for Researchers. International Journal of Digital Curation. 2015; 10: 135±143.
doi: 10.2218/ijdc.v10i1.351
19. Roche DG, Kruuk LEB, Lanfear R, Binning SA. Public Data Archiving in Ecology and Evolution: How
Well Are We Doing? PLoS Biology. 2015; 13: e1002295. doi: 10.1371/journal.pbio.1002295 PMID:
26556502
20. Higgins D, Berkley C, Jones M. Managing heterogeneous ecological data using morpho [Internet].
IEEE Comput. Soc; 2002. pp. 69±76. doi: 10.1109/SSDM.2002.1029707
PLOS
21. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur:
Open-source, platform-independent, community-supported software for describing and comparing
microbial communities. Applied and environmental microbiology. Am Soc Microbiol; 2009; 75: 7537±
7541. doi: 10.1128/AEM.01541-09 PMID: 19801464
22. Koziol Q, Matzke R. Hdf5: A new generation of hdf: Reference manual and user guide. National Center
for Supercomputing Applications, Champaign, Illinois, USA, http://hdf.ncsa.uiuc.edu/nra/HDF5. 1998;
23. Rew R, Davis G. NetCDF: An interface for scientific data access. Computer Graphics and Applications,
IEEE. IEEE; 1990; 10: 76±82. doi: 10.1109/38.56302
24. Wickham H. Tidy Data. Journal of Statistical Software. 2014; 59: 1±23. http://www.jstatsoft.org/v59/i10
doi: 10.18637/jss.v059.i10
25. Buffalo V. Bioinformatics data skills: Reproducible and robust research with open source tools.
O'Reilly Media, Inc. 2015.
26. NCBI. NCBI is phasing out sequence gIsÐuse accession.Version instead! http://www.ncbi.nlm.nih.
gov/news/03-02-2016-phase-out-of-GI-numbers/;
27. Preston-Werner T. Semantic Versioning 2.0.0. http://semver.org; 2014.
28. Johnson GT, Goodsell DS, Autin L, Forli S, Sanner MF, Olson AJ. 3D molecular models of whole HIV-
1 virions generated with cellPACK. Faraday Discuss. The Royal Society of Chemistry; 2014; 169: 23±
44. doi: 10.1039/C4FD00017J PMID: 25253262
29. Strasser C, Cook R, Michener W, Budden A. Primer on Data Management: What you always wanted
to know [Internet]. California Digital Libraries; 2012. doi: 10.5060/D2251G48
30. Goodin D. Poorly anonymized logs reveal NYC cab drivers' detailed whereabouts [Internet]. 2015.
http://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab...
31. Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, et al. Resolving individuals contributing
trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays.
PLoS genetics. Public Library of Science; 2008; 4: e1000167. doi: 10.1371/journal.pgen.
1000167 PMID: 18769715
32. Kahn SD. On the future of genomic data. Science (New York, NY). 2011; 331: 728±9. doi: 10.1126/
science.1197891 PMID: 21311016
33. Wandelt S, Bux M, Leser U. Trends in genome compression. Current Bioinformatics. Bentham Science
Publishers; 2014; 9: 315±326. doi: 10.2174/1574893609666140516010143
34. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: taking the analysis
to the data, not the data to the analysis. International journal of epidemiology. 2014; 43: 1929±44. doi:
10.1093/ije/dyu188 PMID: 25261970
35. Briney K. Rule of 3. DATA AB INITIO. http://dataabinitio.com/?p=320; 2013.
36. Henkel H, Hutchison V, Strasser C, Rebich Hespanha S, Vanderbilt K, Wayne L, et al. DataONE Education
Modules. https://www.dataone.org/education-modules; 2012.
37. Witt M, Carlson J, Brandt S, Cragin M. Constructing Data Curation Profiles. International Journal of
Digital Curation. 2009; 4: 93±103. doi: 10.2218/ijdc.v4i3.117

Repositories

• Global Biodiversity Information Facility (GBIF, http://www.gbif.org) provides an international
open data infrastructure to publish and disseminate biodiversity information.

• Integrated Digitized Biocollections (iDigBio, https://www.idigbio.org) is a project funded
by the National Science Foundation facilitates the digitization of natural history collections
and provides data and images for biological specimens.

• Integrated Taxonomic Information System (ITIS, http://www.itis.gov) is an international
partnership of governmental organizations that aims at providing authoritative taxonomic
information for plants, animals, fungi, and microbes.

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005097 October 20, 2016
… (meer)
 
Gemarkeerd
5653735991n | Jan 1, 2024 |

Statistieken

Werk
1
Lid
1
Populariteit
#2,962,640
Besprekingen
1