fix: specify dtype for read_csv method#88
Conversation
|
@JosseVanDelm @jbelien Can you have a look at this? We use my own fork for now, but it would be nice to use your official repo in the future. |
| logger.info('Started reading input file') | ||
| try: | ||
| file = pd.read_csv(args.input_file) | ||
| file = pd.read_csv(args.input_file, dtype='unicode') |
There was a problem hiding this comment.
I'm not a Python expert so I may be wrong but, since the script is supposed to be run with Python 3, shouldn't it be dtype='str' ?
There was a problem hiding this comment.
@jbelien i just read this stackoverflow post:
https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
According to this post the proposed code should silence the error, but it does not resolve the main problem.
pandas is trying to guess the datatype for every csv column, but to do this, it has to load in all the data in memory.
If i get it correctly, the proper way to do it is to explicitly state the numpy datatypes for each column in order to make the code more efficient. I am not sure what implications this has on the rest on the code, so for now we can definitely accept this push request, but we should further investigate on this issue in the future.
There was a problem hiding this comment.
I'm not a Python developer either, this code change seemed to fix the issue. It tells panda to treat each column as unicode, which resolves the issue for strings, numbers, ... The fix has been working for several months now on my own fork, but could (temporarily) be applied here too. But I agree this should be further investigated by someone with more knowledge on Python and the panda module.
Fixing #87