I have a number of PDF files.
Each file has a few pages (from 1 to a few dozen).
Each page contains one, two or four tables (not three or five or more).
Each page is characterized by:
1) a header ("class" field) at the top center;
2) a table with:
a) 1+6 columns ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday" and "Saturday"); are the days of the week ("lunedi", "martedi", "mercoledi","giovedi", "venerdi" e "sabato");
b) from 5 to 9 lines ("hour" field)
I would like to store the contents of each cell in a sqlite database having in following fields :
1) text - class
2) numeric - day (from 1 to 6)
3) numeric - hour (from 1 to 8)
4) text - contents of the cell.
In the attached example (assuming you have a single PDF file with single page and two table) the database would contain 65 records (33 with the class field set to "1A CA" and 32 with the class field set to "1A MME")
rec n. 1 : "1A CA";1;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 2 : "1A CA";2;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 3 : "1A CA";3;1;"ITALIANO E STORIA - PISANO"
rec n. 4 : "1A CA";4;1;"GEOGRAFIA- CHISU"
etc
How would you suggest we act given that a traditional OCR would most likely not obtain the desired result ?
Thanks for the attention.
Each file has a few pages (from 1 to a few dozen).
Each page contains one, two or four tables (not three or five or more).
Each page is characterized by:
1) a header ("class" field) at the top center;
2) a table with:
a) 1+6 columns ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday" and "Saturday"); are the days of the week ("lunedi", "martedi", "mercoledi","giovedi", "venerdi" e "sabato");
b) from 5 to 9 lines ("hour" field)
I would like to store the contents of each cell in a sqlite database having in following fields :
1) text - class
2) numeric - day (from 1 to 6)
3) numeric - hour (from 1 to 8)
4) text - contents of the cell.
In the attached example (assuming you have a single PDF file with single page and two table) the database would contain 65 records (33 with the class field set to "1A CA" and 32 with the class field set to "1A MME")
rec n. 1 : "1A CA";1;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 2 : "1A CA";2;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 3 : "1A CA";3;1;"ITALIANO E STORIA - PISANO"
rec n. 4 : "1A CA";4;1;"GEOGRAFIA- CHISU"
etc
How would you suggest we act given that a traditional OCR would most likely not obtain the desired result ?
Thanks for the attention.