-
Notifications
You must be signed in to change notification settings - Fork 156
Make Chunker.BUFSIZE a configuration option #107
Description
As noted in #40, long lines in the CSV files will cause an ArrayIndexOutOfBoundsException in Chunker.nextWord(). I believe that making the BUFSIZE a configurable quantity will allow users to avoid this error when they know they have large lines in their CSV file. Here is the stacktrace of the error itself when a long line is encountered. To avoid this issue, I am currently limiting all the lines of my CSV file to 32768 characters but would like to avoid this extra pre-processing step.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 32768
at org.neo4j.batchimport.utils.Chunker.nextWord(Chunker.java:54)
at org.neo4j.batchimport.importer.ChunkerLineData.nextWord(ChunkerLineData.java:37)
at org.neo4j.batchimport.importer.ChunkerLineData.readLine(ChunkerLineData.java:47)
at org.neo4j.batchimport.importer.AbstractLineData.parse(AbstractLineData.java:139)
at org.neo4j.batchimport.importer.AbstractLineData.processLine(AbstractLineData.java:72)
at org.neo4j.batchimport.Importer.importNodes(Importer.java:96)
at org.neo4j.batchimport.Importer.doImport(Importer.java:228)
at org.neo4j.batchimport.Importer.main(Importer.java:83)
And here is a test case that will reproduce the exception:
@Test
public void testLongLine() throws Exception {
// NOTE: Ideally this should read the BUFSIZE from chunker as create
// the string to be that length.
String longString =
new String(new char[32 * 1024]).replace("\0", "b");
Chunker chunker = newChunker(String.format("a\t%s\n", longString));
assertEquals("a", chunker.nextWord());
// This will trigger the out of bounds exception since this word
// will push the reader position beyond the buffer size defined
// in Chunker.
assertEquals(longString, chunker.nextWord());
}I took a quick look around the code but didn't see a very straightforward way to access the current Config from the Chunker. I will put together a PR for this if we can discuss a good way to make the Config available in nextWord() here as any approach to making the config available to the chunker will probably affect much more than the chunker itself.