-
Notifications
You must be signed in to change notification settings - Fork 16
python csv module
Refer to csv for the motivation behind the module and to snark (and comma) for dummies, on Ubuntu/Debian for enabling comma python modules.
The csv module provides some basic functionality for processing csv-style ascii and binary streams in python. Its implementation relies on numpy package for describing data structures as well as reading and writing csv streams.
Two classes are provided:
- struct: creating data structures
- stream: reading and writing csv-style streams
An object of type comma.csv.struct represents the meaning and type of data contained in a csv stream. As such, the user is required to provide field names and their numpy types when creating these objects. For instance, a struct for representing timestamped coordinates of a 3d point can be created like this:
import comma
event_t = comma.csv.struct( 't,x,y,z', 'datetime64[us]', 'float64', 'float64', 'float64' )where the first argument describes the fields of a csv stream ('t,x,y,z') and the following four arguments specify the numpy types of the fields (or strings corresponding to numpy types). For more details on numpy types, consult the following pages:
http://docs.scipy.org/doc/numpy/user/basics.rec.html
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
Alternatively, one can make use of comma.csv.format.to_numpy function to convert comma format string to numpy types:
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )Note that event_t represents a type; it does not contain any data. It can be used in place of numpy dtype, for instance, to instantiate numpy arrays:
import comma
import numpy as np
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
events = np.empty( 10, dtype=event_t ) # create an array of 10 objects of type event_t.dtypeTypes defined with comma.csv.struct can be used along with numpy types to describe the types of fields in other data structures. For instance, the definition of record_t below uses event_t and observer_t defined previously:
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )A csv stream for the data of type record_t will have fields
observer/name,observer/id,event/t,event/x,event/y,event/z and format s[10],ui,t,3d. Complex hierarchical types can be created this way.
The following command defines a csv stream for data of type record_t:
record_stream = comma.csv.stream( record_t )By default, it is an ascii stream with entries delimited by a comma. The latter can be changed by specifying delimiter keyword, e.g.
record_stream = comma.csv.stream( record_t, delimiter='|', precision=4 )Precision of the floating point output is controlled by precision keyword, which by default is set to 12.
For a binary stream, set binary keyword to True, e.g.
record_stream = comma.csv.stream( record_t, binary=True )By default, the source of the stream is stdin and the target is stdout. If the user wishes to read data from a file and/or write to another file, set source and target keywords to the suitable file objects, e.g.
record_stream = comma.csv.stream( record_t, binary=True, source=open( 'input.bin', 'r' ), target=open( 'output.bin', 'w' ), flush=True )The flush keyword used above ensures the output stream is flushed (by default, the output is buffered).
An input stream where the csv data matches the fields of a struct type is simple. However, it is common to have streams that have more fields than required or where the expected fields are in the wrong order. To deal with such streams, one needs to specify the appropriate fields keyword ( and format if the stream is binary) when creating a stream. For instance,
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )
fields = ',event/x,event/y,event/z,event/t,observer/name,,,observer/id'
format = ','.join( comma.csv.format.to_numpy( 't,3d,t,s[10],2i,ui' ) )
record_stream = comma.csv.stream( record_t, fields=fields, format=format )defines a binary stream of format t,3d,t,s[10],2i,ui (specified by format keyword) where the positions of the expected fields are given by fields keyword. Omitting format keyword would create an ascii stream.
The following command will read a single record from record_stream and save it in a 1d numpy array record:
record = record_stream.read( size=1 )In general, the size keyword tells read() how many records to read. The dtype of elements in record is the same as record_t.dtype and the number of elements equals or is less than size (it will be less than size if the stream ends before the specified number of records is read). If size is not given, read() will attempt to read as many records as is necessary to fill the 64Kb buffer. When reading from a file, all records can be read at once by using size=-1. For instance,
record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ) )
records = record_stream.read( size=-1 )will read all records from file input.csv.
The records contained in records can be manipulated in the usual way as any numpy array. For instance, one can apply the following commands
record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ), target=open( 'output.csv', 'w' ) )
records = record_stream.read( size=-1 )
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1record_stream.write( records )To iterate over an input stream, make use of iter() as follows:
record_stream = comma.csv.stream( record_t )
for records in record_stream.iter():
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1
record_stream.write( records )iter() accepts size keyword with the same meaning as in read(). By default, it will try to read many records at once. To read records one by one, use iter( size=1 ).
Suppose the input stream contains the following
0
0
0
0
0
0
0
0Then, running the following code
import comma
point_t = comma.csv.struct( 'x', 'float64' )
record_stream = comma.csv.stream( point_t )
for i,points in enumerate( record_stream.iter( size=3 ), start=1 ):
points['x'] += i
record_stream.write( points )reads points in batches of three and, therefore, yields
1
1
1
2
2
2
3
3Note that the last read points contains only two elements.
Suppose the input stream starts like this
20150101T000000.123456,0,1,2,3,4,5
20150101T000001.123456,0,1,2,3,4,5
20150101T000002.123456,0,1,2,3,4,5
...where the six numbers represent values of a 2x3 matrix. Then the matrix can be read into a record containing a 2d numpy array by using numpy type '(2,3)float64'. For instance, running the following code
import comma
event_t = comma.csv.struct( 't,signal', 'datetime64[us]', '(2,3)float64' )
stream = comma.csv.stream( event_t )
event = stream.read( size=1 )
event['signal'] += [ [0,-1,-2], [-3,-4,-5] ]
stream.write( event )yields
20150101T000000.123456,0.0,0.0,0.0,0.0,0.0,0.0It is not necessary to spell out the full path of the fields if they follow the default order, that is the order used in the definition. For instance,
import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )
input_stream = comma.csv.stream( timestamped_position_t, fields='position/orientation,position/coordinates,t' )defines a stream with fields position/orientation/yaw,position/coordinates/x,position/coordinates/y,t.
It is sometimes desirable to pass the records read from input stream to output stream with some extra fields attached at the end. This is accomplished by using tied keyword. This capability is illustrated in the example below.
Copy and save the following code in a file called attach-min-max:
#!/usr/bin/python
import comma
import numpy as np
point_t = comma.csv.struct( 'x,y,z', 'float64', 'float64', 'float64' )
event_t = comma.csv.struct( 't,coordinates', 'datetime64[us]', point_t )
fields = ',coordinates/y,coordinates/z,,,t,coordinates/x,,'
format = ','.join( comma.csv.format.to_numpy( 'i,d,d,s[3],s[7],t,d,ui,ui' ) )
input_stream = comma.csv.stream( event_t, fields=fields, format=format )
output_t = comma.csv.struct( 'min,max', 'float64', 'float64' )
output_stream = comma.csv.stream( output_t, binary=True, tied=input_stream )
for events in input_stream.iter():
output = np.empty( events.size, dtype=output_t )
output['min'] = np.min( events['coordinates'].view( '3float64' ), axis=1 )
output['max'] = np.max( events['coordinates'].view( '3float64' ), axis=1 )
output_stream.write( output )and create a file called input.csv with the following content:
-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20Then, executing
chmod u+x attach-min-max
cat input.csv | csv-to-bin i,2d,s[3],s[7],t,d,2ui | ./attach-min-max | csv-from-bin i,2d,s[3],s[7],t,d,2ui,2dyields
-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20,-2,3
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20,-3,4
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20,-4,5If some of the expected fields are not present in the input stream, these missing fields will be populated with zero values (a blank string is used for string types and the zero epoch for the time type). For instance, on an input stream
1
2
3converted to the binary format with csv-to-bin i, the following code
import comma
t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x', format='i4' )
for r in s.iter( size=1 ):
s.write( r )yields a binary stream which upon converting back to ascii format with csv-from-bin s[2],2i,t becomes
,1,0,19700101T000000
,2,0,19700101T000000
,3,0,19700101T000000By default, the time imported from the input stream is converted to the local time zone. For instance, feeding input stream
20140101T000000
20150101T000000to
import comma
t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )
for r in s.iter( size=1 ):
print r['t'][0]yields
2014-01-01T00:00:00.000000+1100
2015-01-01T00:00:00.000000+1100where the time offset of the local time zone is +11 hours. A convenience function to change the time zone used by python is provided in comma.csv.time module and needs to be invoked before reading from a stream. For instance, feeding the same input stream to
import comma
comma.csv.time.zone( 'UTC' ) # set time zone to UTC
t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )
for r in s.iter( size=1 ):
print r['t'][0]yields
2014-01-01T00:00:00.000000+0000
2015-01-01T00:00:00.000000+0000Note that the write() function of the stream class ignores the time zone and, therefore,
for r in s.iter( size=1 ):
s.write( r )will yield the same output regardless of time zone.
Python utilities using comma.csv.stream are generally 15 times slower than the equivalent c++ utilities (both binary and ascii), provided that a large enough size is used when reading (the default size is usually sufficient). A moderate degradation in performance (a factor of a few) may be expected if size=1 is used.