File Organization & File Formats
You may have planned your data management strategy down to the last detail and cleared all of the ethical issues and intellectual property rights, but if you don't organise your data properly on a day-to-day basis, there is always a risk that you won't be able to find things when you need them. Take a look at the software and web services available to you. It's worth investing a little time ensuring that you are using the most appropriate tools for structuring and working with the information you are gathering. Colleagues can be a useful source of recommendations - ask them what they use.
- It's generally useful to aim for file names which are concise, but informative - it makes life easier if you can tell what's in a file without having to open it.
- Similarly, being consistent in your file naming practices will make it easier to locate the file you want. Within a research group, you may want to agree on file naming conventions early on in the project
- Operating systems usually default to sorting files alphabetically, so it can be helpful to think about what comes at the start of the file name - is it more useful to order the files by date, by author, or by subject, for example?
- If you have multiple versions or drafts of a file, it can also be useful to include a version number in the file name - this makes it straightforward to see which copy is the most recent one. Oxford University's Research Skills Toolkit offers a fact-sheet on file naming.
- Your intended software of analysis may require a particular file format, as well as the particular archive into which you intend to deposit.
- Please check with your colleagues and the published datasets in your field. Is there a file format most often used them, and is there a good reason for this choice?
- Please also consider that file formats and standards are in constant flux. File formats of today may not be readable by the software of tomorrow. To minimize this risk, if possible please choose a format that is open, or one that is well known and used. The owner of this file format, and other software makers, willl most likely have developed many conversions from this format to others.
- Some examples: SPSS portable (.por) is open, although SPSS (.sav) is proprietary. JPEG is lossy (ie image data is lost), while TIFF version 6 uncompressed is considered archival quality. However in this last example, if images were originally created in JPEG, nothing is gained by converting them to TIFF, and therefore should be stored in JPEG. Microsoft Office formats prior to Office 2007 were closed. Office 2007 introduced the Office Open XML file formats (.docx, .xlsx, etc). These formats are considered open, and freely convertible to other formats.
- Open formats include,
- PNG — a raster image format standardized by ISO/IEC
- FLAC — lossless audio codec
- WebM — a video/audio container format
- HTML — HyperText Markup Language (HTML) is the main markup language for creating web pages and other information that can be displayed in a web browser.
- gzip — for compression
- CSS — style sheet format usually used with (X)HTML, standardized by W3C
- Closed proprietary formats include,
- CDR – CorelDraw's native format primarily used for vector graphic drawings
- DWG – AutoCAD drawing
- PSD – Adobe Photoshop's native image format
- RAR – archive and compression file format owned by Alexander L. Roshal
- WMA – a closed format, owned by Microsoft
- After considering open formats, you may still have good reasons for choosing a closed or proprietary file format for your research. The HKU Scholars Hub will store any format and commit to preservation at bit level. This means that after some years in the future, although the stored data is still viable, the software and platform to read it may no longer be available. Upon deposit of data, you may then wish to consider depositing data in the original closed format, but also converting and depositing in an open format that has more potential of weathering the future. The UK Data Archive shows formats they consider acceptable for sharing, reuse and preservation (middle column of table).
- For unknown file formats, please check the FileInfo registry to find software that can run this format, and see reviews of those software.
- Most operating systems default to a hierarchical file structure - files inside folders, which may be nested inside other folders. This great if your material can easily be grouped into relatively discrete categories.
- In planning a hierarchical folder structure, aim for a balance between breadth and depth - so no one category gets too big, but also so that you don't have to click through endless folders to find a file.
- In some cases, it may be more helpful to use a tag-based system - where each file is assigned one or more tags, or labels. This makes it easier to have overlapping categories, and files can be categorised in multiple ways simultaneously (by subject, by author, and by the project it relates to, for example).
- The more recent versions of the Windows and Mac operating systems both allow you to add tags to files; file tagging software is also available.
- It's worth taking time every now and then to reassess your folder or tag structure, perhaps moving old, unused items to a folder called 'Archive' or something similar so they don't clutter up the screen.
- Oxford University's The Research Skills Toolkit offers articles on organising research material , including fact-sheets on choosing an organisational system , and on using shortcuts and hyperlinks as ways of connecting related material within a hierarchical system.
Acknowledgement: Adapted from Oxford University's Research Data Oxford pages.