Control Character Separated Values files

Current Situation

Exchanging and converting data between various spreadsheet programs and databases is a common task. Often, a CSV (Comma Separated Values) or a TSV (Tab Separated Values) file transports data between programs. Unfortunately, the lack of an early standard for creating these files brought us many different implementations. Typically, issues arise when a character designated as a delimiter or qualifier in the destination file format appears in the data body. Developers are required to adopt a process that escapes or replaces delimiters in the body text to avoid breaking the handling of the file.

Introducing the .CCSV

To alleviate the problems associated with CSV/TSV files and embedded delimiters, we’re proposing the use of control characters as delimiters.

If you look at the ASCII character tables, you’ll find a collection of non-printable control characters. The specific characters we’re concerned with are the unit separator (US) and the record separator (RS). Note that the control characters cannot generally be displayed. In order to discuss their use, we can use either the ascii designation or the high ascii surrogates, but in all cases we are referencing the ascii control characters.

Description ASCII decimal ASCII hex Surrogate Character Surrogate Entity
Record Separator 30 0x1E ␞
Unit Separator 31 0x1F ␟

When creating a .ccsv file, you use unit separators where you would use commas in a .csv file and you use record separators instead of carriage returns.

Benefits

Using the control characters means that commas and quotes in your content do not disrupt the structure of your file. There is no need to surround text fields with quotes.

Drawbacks

The delimiters used in .ccsv files are generally not visible in an ordinary text editor. Editing the files by hand can be difficult, if not impossible, using an editor that is unaware of the file format.

comments powered by Disqus