-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Description
$ mkdir tmp
$ cd tmp
$ wget -P data/imdb https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/imdb/IMDB.csv.gz
$ zcat IMDB.csv.gz | head -2
imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,None,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey Depew",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
First issue:
normalize-nodes
does not recognize --id-column
option
https://kgtk.readthedocs.io/en/latest/transform/normalize_nodes/
--id-column ID_COLUMN_NAME
The name of the ID column. (default=id or alias)
$ kgtk normalize-nodes --id-column imdb_title_id -i IMDB.csv.gz -o out.tsv
In input header 'imdb_title_id title original_title year date_published genre duration country language director writer production_company actors description avg_vote votes budget usa_gross_income worlwide_gross_income metascore reviews_from_users reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found
Second issue:
After manually changing the column name imdb_title_id
to id
, there
is still an error:
$ zcat IMDB.csv.gz | perl -pe 's/^imdb_title_id/id/' >IMDB.id.csv
$ kgtk normalize-nodes -i IMDB.id.csv -o out.tsv
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics':
Warning: Column name 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics' contains a comma (,)
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found
-v
shows that the issue is the extra filename component.
$ kgtk normalize-nodes -v -i IMDB.id.csv -o out.tsv
Starting normalize_nodes pid=74318
Opening the input file: IMDB.id.csv
input format: kgtk <--- should be csv
The code is trying to detect compression suffixes like foo.csv.gz
.
It sees there is no gz
but then it mistakenly defaults to kgtk
.
Workarounds: name the file as foo.csv
with a single dot or pass
--input-format csv
.
Metadata
Metadata
Assignees
Labels
No labels