python csv module

Table of Contents intro creating data structures basic hierarchical defining streams simple streams streams with extra fields or mixed fields reading and writing data reading processing and writing iterating over input stream records comments the effect of size keyword arrays as part of records concise fields tied input and output streams missing fields local time zone performance

intro

Refer to csv for the motivation behind the module and to snark (and comma) for dummies, on Ubuntu/Debian for enabling comma python modules.

The csv module provides some basic functionality for processing csv-style ascii and binary streams in python. Its implementation relies on numpy package for describing data structures as well as reading and writing csv streams.

Two classes are provided:

struct: creating data structures
stream: reading and writing csv-style streams

creating data structures

basic

An object of type comma.csv.struct represents the meaning and type of data contained in a csv stream. As such, the user is required to provide field names and their numpy types when creating these objects. For instance, a struct for representing timestamped coordinates of a 3d point can be created like this:

import comma
event_t = comma.csv.struct( 't,x,y,z', 'datetime64[us]', 'float64', 'float64', 'float64' )

where the first argument describes the fields of a csv stream ('t,x,y,z') and the following four arguments specify the numpy types of the fields (or strings corresponding to numpy types). For more details on numpy types, consult the following pages:

http://docs.scipy.org/doc/numpy/user/basics.rec.html

http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

Alternatively, one can make use of comma.csv.format.to_numpy function to convert comma format string to numpy types:

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )

Note that event_t represents a type; it does not contain any data. It can be used in place of numpy dtype, for instance, to instantiate numpy arrays:

import comma
import numpy as np
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
events = np.empty( 10, dtype=event_t ) # create an array of 10 objects of type event_t.dtype

hierarchical

Types defined with comma.csv.struct can be used along with numpy types to describe the types of fields in other data structures. For instance, the definition of record_t below uses event_t and observer_t defined previously:

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )

A csv stream for the data of type record_t will have fields observer/name,observer/id,event/t,event/x,event/y,event/z and format s[10],ui,t,3d. Complex hierarchical types can be created this way.

defining streams

simple streams

The following command defines a csv stream for data of type record_t:

record_stream = comma.csv.stream( record_t )

By default, it is an ascii stream with entries delimited by a comma. The latter can be changed by specifying delimiter keyword, e.g.

record_stream = comma.csv.stream( record_t, delimiter='|', precision=4 )

Precision of the floating point output is controlled by precision keyword, which by default is set to 12.

For a binary stream, set binary keyword to True, e.g.

record_stream = comma.csv.stream( record_t, binary=True )

By default, the source of the stream is stdin and the target is stdout. If the user wishes to read data from a file and/or write to another file, set source and target keywords to the suitable file objects, e.g.

record_stream = comma.csv.stream( record_t, binary=True, source=open( 'input.bin', 'r' ), target=open( 'output.bin', 'w' ), flush=True )

The flush keyword used above ensures the output stream is flushed (by default, the output is buffered).

streams with extra fields or mixed fields

An input stream where the csv data matches the fields of a struct type is simple. However, it is common to have streams that have more fields than required or where the expected fields are in the wrong order. To deal with such streams, one needs to specify the appropriate fields keyword ( and format if the stream is binary) when creating a stream. For instance,

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )

fields = ',event/x,event/y,event/z,event/t,observer/name,,,observer/id'
format = ','.join( comma.csv.format.to_numpy( 't,3d,t,s[10],2i,ui' ) )
record_stream = comma.csv.stream( record_t, fields=fields, format=format )

defines a binary stream of format t,3d,t,s[10],2i,ui (specified by format keyword) where the positions of the expected fields are given by fields keyword. Omitting format keyword would create an ascii stream.

reading and writing data

reading

The following command will read a single record from record_stream and save it in a 1d numpy array record:

record = record_stream.read( size=1 )

In general, the size keyword tells read() how many records to read. The dtype of elements in record is the same as record_t.dtype and the number of elements equals or is less than size (it will be less than size if the stream ends before the specified number of records is read). If size is not given, read() will attempt to read as many records as is necessary to fill the 64Kb buffer. When reading from a file, all records can be read at once by using size=-1. For instance,

record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ) )
records = record_stream.read( size=-1 )

will read all records from file input.csv.

processing and writing

The records contained in records can be manipulated in the usual way as any numpy array. For instance, one can apply the following commands

record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ), target=open( 'output.csv', 'w' ) )
records = record_stream.read( size=-1 )
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1

to add 1 second to the time and subtract 0.1 from the x-coordinate of every event recorded in 'input.csv'. Then the modified records can be written to 'output.csv':

record_stream.write( records )

iterating over input stream records

To iterate over an input stream, make use of iter() as follows:

record_stream = comma.csv.stream( record_t )
for records in record_stream.iter():
  records['event']['t'] += numpy.timedelta64( 1, 's' )
  records['event']['x'] -= 0.1
  record_stream.write( records )

iter() accepts size keyword with the same meaning as in read(). By default, it will try to read many records at once. To read records one by one, use iter( size=1 ).

comments

the effect of size keyword

Suppose the input stream contains the following

Then, running the following code

import comma

point_t = comma.csv.struct( 'x', 'float64' )
record_stream = comma.csv.stream( point_t )

for i,points in enumerate( record_stream.iter( size=3 ), start=1 ):
  points['x'] += i
  record_stream.write( points )

reads points in batches of three and, therefore, yields

Note that the last read points contains only two elements.

arrays as part of records

Suppose the input stream starts like this

20150101T000000.123456,0,1,2,3,4,5
20150101T000001.123456,0,1,2,3,4,5
20150101T000002.123456,0,1,2,3,4,5
...

where the six numbers represent values of a 2x3 matrix. Then the matrix can be read into a record containing a 2d numpy array by using numpy type '(2,3)float64'. For instance, running the following code

import comma

event_t = comma.csv.struct( 't,signal', 'datetime64[us]', '(2,3)float64' )
stream = comma.csv.stream( event_t )

event = stream.read( size=1 )
event['signal'] += [ [0,-1,-2], [-3,-4,-5] ]
stream.write( event )

yields

20150101T000000.123456,0.0,0.0,0.0,0.0,0.0,0.0

concise fields

It is not necessary to spell out the full path of the fields if they follow the default order, that is the order used in the definition. For instance,

import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )

input_stream = comma.csv.stream( timestamped_position_t, fields='position/orientation,position/coordinates,t' )

defines a stream with fields position/orientation/yaw,position/coordinates/x,position/coordinates/y,t.

tied input and output streams

It is sometimes desirable to pass the records read from input stream to output stream with some extra fields attached at the end. This is accomplished by using tied keyword. This capability is illustrated in the example below.

Copy and save the following code in a file called attach-min-max:

#!/usr/bin/python
import comma
import numpy as np

point_t = comma.csv.struct( 'x,y,z', 'float64', 'float64', 'float64' )
event_t = comma.csv.struct( 't,coordinates', 'datetime64[us]', point_t )

fields = ',coordinates/y,coordinates/z,,,t,coordinates/x,,'
format = ','.join( comma.csv.format.to_numpy( 'i,d,d,s[3],s[7],t,d,ui,ui' ) )
input_stream = comma.csv.stream( event_t, fields=fields, format=format )

output_t = comma.csv.struct( 'min,max', 'float64', 'float64' )
output_stream = comma.csv.stream( output_t, binary=True, tied=input_stream )

for events in input_stream.iter():
  output = np.empty( events.size, dtype=output_t )
  output['min'] = np.min( events['coordinates'].view( '3float64' ), axis=1 )
  output['max'] = np.max( events['coordinates'].view( '3float64' ), axis=1 )
  output_stream.write( output )

and create a file called input.csv with the following content:

-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20

Then, executing

chmod u+x attach-min-max
cat input.csv |  csv-to-bin i,2d,s[3],s[7],t,d,2ui | ./attach-min-max | csv-from-bin i,2d,s[3],s[7],t,d,2ui,2d

yields

-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20,-2,3
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20,-3,4
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20,-4,5

missing fields

If some of the expected fields are not present in the input stream, these missing fields will be populated with zero values (a blank string is used for string types and the zero epoch for the time type). For instance, on an input stream

1
2
3

converted to the binary format with csv-to-bin i, the following code

import comma

t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x', format='i4' )

for r in s.iter( size=1 ):
  s.write( r )

yields a binary stream which upon converting back to ascii format with csv-from-bin s[2],2i,t becomes

,1,0,19700101T000000
,2,0,19700101T000000
,3,0,19700101T000000

local time zone

By default, the time imported from the input stream is converted to the local time zone. For instance, feeding input stream

20140101T000000
20150101T000000

to

import comma

t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )

for r in s.iter( size=1 ):
  print r['t'][0]

yields

2014-01-01T00:00:00.000000+1100
2015-01-01T00:00:00.000000+1100

where the time offset of the local time zone is +11 hours. A convenience function to change the time zone used by python is provided in comma.csv.time module and needs to be invoked before reading from a stream. For instance, feeding the same input stream to

import comma

comma.csv.time.zone( 'UTC' ) # set time zone to UTC

t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )

for r in s.iter( size=1 ):
  print r['t'][0]

yields

2014-01-01T00:00:00.000000+0000
2015-01-01T00:00:00.000000+0000

Note that the write() function of the stream class ignores the time zone and, therefore,

for r in s.iter( size=1 ):
  s.write( r )

will yield the same output regardless of time zone.

performance

Python utilities using comma.csv.stream are generally 15 times slower than the equivalent c++ utilities (both binary and ascii), provided that a large enough size is used when reading (the default size is usually sufficient). A moderate degradation in performance (a factor of a few) may be expected if size=1 is used.

python csv module

Table of Contents

intro

creating data structures

basic

hierarchical

defining streams

simple streams

streams with extra fields or mixed fields

reading and writing data

reading

processing and writing

iterating over input stream records

comments

the effect of size keyword

arrays as part of records

concise fields

tied input and output streams

missing fields

local time zone

performance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally