fix: specify dtype for read_csv method by nielsbril · Pull Request #88 · oSoc19/best

nielsbril · 2019-09-10T13:49:46Z

Fixing #87

nielsbril · 2019-12-27T09:05:20Z

@JosseVanDelm @jbelien Can you have a look at this? We use my own fork for now, but it would be nice to use your official repo in the future.

jbelien · 2019-12-28T09:46:07Z

filter/filter.py

    logger.info('Started reading input file')
    try:
-        file = pd.read_csv(args.input_file)
+        file = pd.read_csv(args.input_file, dtype='unicode')


I'm not a Python expert so I may be wrong but, since the script is supposed to be run with Python 3, shouldn't it be dtype='str' ?

@jbelien i just read this stackoverflow post:
https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
According to this post the proposed code should silence the error, but it does not resolve the main problem.
pandas is trying to guess the datatype for every csv column, but to do this, it has to load in all the data in memory.

If i get it correctly, the proper way to do it is to explicitly state the numpy datatypes for each column in order to make the code more efficient. I am not sure what implications this has on the rest on the code, so for now we can definitely accept this push request, but we should further investigate on this issue in the future.

I'm not a Python developer either, this code change seemed to fix the issue. It tells panda to treat each column as unicode, which resolves the issue for strings, numbers, ... The fix has been working for several months now on my own fork, but could (temporarily) be applied here too. But I agree this should be further investigated by someone with more knowledge on Python and the panda module.

fix: specify dtype for read_csv method

33e5d35

jbelien reviewed Dec 28, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: specify dtype for read_csv method#88

fix: specify dtype for read_csv method#88
nielsbril wants to merge 1 commit intooSoc19:masterfrom
nielsbril:bugfix/specify-dtype-read-csv

nielsbril commented Sep 10, 2019 •

edited

Loading

Uh oh!

nielsbril commented Dec 27, 2019

Uh oh!

jbelien Dec 28, 2019

Uh oh!

JosseVanDelm Dec 30, 2019

Uh oh!

nielsbril Dec 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nielsbril commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielsbril commented Dec 27, 2019

Uh oh!

jbelien Dec 28, 2019

Choose a reason for hiding this comment

Uh oh!

JosseVanDelm Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

nielsbril Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nielsbril commented Sep 10, 2019 •

edited

Loading