org.broadinstitute.hellbender.utils.tsv.package-info Maven / Gradle / Ivy
/**
* Utility classes to read and write tab separated value (tsv) files.
* File format description
*
*
* Tab separated values may contain any number of comment lines (started with {@value org.broadinstitute.hellbender.utils.tsv.TableUtils#COMMENT_PREFIX}),
* a column name containing line (aka. the header line) and any number of data lines one per record.
*
* While comment lines can contain any sequence of characters, the header and data lines are divided in
* columns using exactly one {@value org.broadinstitute.hellbender.utils.tsv.TableUtils#COLUMN_SEPARATOR_STRING} character.
* Blank lines are treated as having a single column with the empty string as the only value (or column name)
*
* The header line is the first non-comment line, whereas any other non-comment line after that is
* considered a data line. Comment lines can appear anywhere in the file and their
* present is ignored by the reader ({@link org.broadinstitute.hellbender.utils.tsv.TableReader TableReader} implementations).
*
*
* The header line values, the column names, must all be different (otherwise a formatting exception will be thrown), and
* all data lines have to have as many values as there are columns in the header line.
*
* Values can be quoted using {@value org.broadinstitute.hellbender.utils.tsv.TableUtils#QUOTE_STRING}. This becomes necessary when the value contain
* any special formatting characters like a new-line, the quote character itself, the column separator character or
* the escape character {@value org.broadinstitute.hellbender.utils.tsv.TableUtils#ESCAPE_STRING}.
* Within quotes, especial characters must be escaped using the {@value org.broadinstitute.hellbender.utils.tsv.TableUtils#ESCAPE_STRING}
* Examples 1:
*
* # comment 1
* # comment 2
* CONTIG START END NAME SAMPLE1 SAMPLE2
* # comment 3
* chr1 123100 123134 tgt_0 100.0 102.0
* chr1 134012 134201 tgt_1 50 12
* # comment 4
* chr2 ...
*
* Reading tsv files
* You will need to extend class
* {@link org.broadinstitute.hellbender.utils.tsv.TableReader TableReader}, either using
* a top- or inner class and overriding {@link org.broadinstitute.hellbender.utils.tsv.TableReader#createRecord(DataLine) createRecord}
* method to map input data-lines, wrapped into a {@link org.broadinstitute.hellbender.utils.tsv.DataLine DataLine}, to
* your row element class of choice.
*
* Example, a SimpleInterval reader from a tsv file with three columns, CONTIG, START and END:
*
*
*
* ...
*
* public void doWork(final File inputFile) throws IOException {
*
* final TableReader<SimpleInterval> reader = new TableReader<SimpleInterval>(inputFile) {
*
* // Optional (but recommended) check that the columns in the file are the ones expected:
* @Override
* protected void processColumns(final TableColumns columns) {
* if (!columns.containsExactly("CONTIG","START","END"))
* throw formatException("Bad column names");
* }
*
* @Override
* protected TableCounts createRecord(final DataLine dataLine) {
* return new SimpleInterval(dataLine.get("CONTIG"),
* dataLine.getInt("START"),
* dataLine.getInt("END"));
* }
* };
*
* for (final SimpleInterval interval : reader) {
* // whatever you wanna do per interval.
* }
* reader.close();
* ...
*
* }
*
* Writing tsv files
* You will need to extend class
* {@link org.broadinstitute.hellbender.utils.tsv.TableWriter TableWriter}, either using
* a top- or inner class and overriding {@link org.broadinstitute.hellbender.utils.tsv.TableWriter#composeLine composeLine}
* method to map your record object type to a output line, represented by a {@link org.broadinstitute.hellbender.utils.tsv.DataLine DataLine}.
*
* Instances of {@link org.broadinstitute.hellbender.utils.tsv.DataLine DataLine} can be obtained by calling {@link org.broadinstitute.hellbender.utils.tsv.DataLine DataLine}
* can be obtained by calling the writers protected parameter-less method {@link org.broadinstitute.hellbender.utils.tsv.TableWriter#composeLine composeLine}.
*
*
* The column names are passed in order to the constructor.
*
*
* Example:
*
*
* public void doWork(final File outputFile) throws IOException {
*
* final TableWriter<SimpleInterval> writer =
* new TableWriter<SimpleInterval>(outputFile, new TableColumns("CONTIG","START","END")) {
* @Override
* protected void composeLine(final SimpleInterval interval, final DataLine dataLine) {
* // we can use append with confidence because we know the column order.
* dataLine
* .append(interval.getContig())
* .append(interval.getStart(),interval.getEnd());
* }
* };
*
* for (final SimpleInterval interval : intervalsToWrite) {
* writer.writeRecord(interval);
* }
* writer.close();
* ...
*
* }
*
* Readers and Writers using function composition
* {@link org.broadinstitute.hellbender.utils.tsv.TableUtils TableUtils} contains methods to create
* readers and writers without the need to explicitly extending {@link org.broadinstitute.hellbender.utils.tsv.TableReader TableReader}
* or {@link org.broadinstitute.hellbender.utils.tsv.TableWriter TableWriter} but by specifying their behaviour through
* lambda functions.
* Example of a reader:
*
* final TableReader<SimpleInterval> reader = TableUtils.reader(inputFile,
* (columns,formatExceptionFactory) -> {
* // we check the columns is what we except them to be:
* if (!columns.matchesExactly("CONTIG","START","END"))
* throw formatExceptionFactory.apply("Bad header");
* // we return the lambda to translate dataLines into intervals.
* return (dataLine) -> new SimpleIntervals(dataLine.get(0),dataLine.getInt(1),dataLine.getInt(2));
* });
*
*
* The lambda that you need to indicate seems a bit complicate but is not so... basically it receives the
* columns in the input and it must return another lambda that will translate data-lines into records considering
* those columns.
*
*
* Before doing that, it check whether the columns are the excepted ones and int the correct order (always recommended).
*
* The additional formatExceptionFactory parameter allows the reader implementation to correctly report formatting issues.
*
* Example of a writer:
*
* final TableWriter<SimpleInterval> reader = TableUtils.reader(outputFile,
* new TableColumnCollection("CONTIG","START","END"),
* (interval,dataLine) -> {
* dataLine.append(interval.getContig()
* .append(interval.getStart(),interval.getEnd());
* });
*
*
* The case of the writer is far more simple as there is no need to report formatting errors as we are
* the ones producing the file.
*
*/
package org.broadinstitute.hellbender.utils.tsv;