Android Question PDF to SQLITE

AlpVir

Well-Known Member
Licensed User
Longtime User
I have a number of PDF files.
Each file has a few pages (from 1 to a few dozen).
Each page contains one, two or four tables (not three or five or more).
Each page is characterized by:
1) a header ("class" field) at the top center;
2) a table with:
a) 1+6 columns ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday" and "Saturday"); are the days of the week ("lunedi", "martedi", "mercoledi","giovedi", "venerdi" e "sabato");
b) from 5 to 9 lines ("hour" field)
I would like to store the contents of each cell in a sqlite database having in following fields :
1) text - class
2) numeric - day (from 1 to 6)
3) numeric - hour (from 1 to 8)
4) text - contents of the cell.
In the attached example (assuming you have a single PDF file with single page and two table) the database would contain 65 records (33 with the class field set to "1A CA" and 32 with the class field set to "1A MME")
rec n. 1 : "1A CA";1;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 2 : "1A CA";2;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 3 : "1A CA";3;1;"ITALIANO E STORIA - PISANO"
rec n. 4 : "1A CA";4;1;"GEOGRAFIA- CHISU"
etc

How would you suggest we act given that a traditional OCR would most likely not obtain the desired result ?
Thanks for the attention.
 

Attachments

  • Orario immagine.jpg
    134 KB · Views: 150

Peter Simpson

Expert
Licensed User
Longtime User

This could be quite difficult to achieve, especially as I cn't remember sosing any code on the forum for doing such a thing, but I could be incorrect.

Anyway a few years ago I used an online service for a client to achieve what you are asking. I used a free online converter (the information wasn't secretive or important) to convert multiple tables in pdf files to a csv file which I then imported into a B4X project. Just search Google/Bing/Whatever for 'free online pdf to csv converter' to do what you want, then just import the csv file. I believe that I might have used convertio online website when I did a project for a client, convertio does have a PDF to CSV converter.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
Thanks for your answer. I tried one of the live pdf to csv converters and apparently it does the job.
However there are 2 drawbacks:
1) at first glance I wouldn't know how to call this online converter from one of my apps written in B4A; furthermore, the use of the API of this converter requires a fee;
2) the CSV file thus obtained is difficult (but probably not impossible) to interpret; although it is not guaranteed that all but all the features of the PDF file are correctly exported.
Maybe I should orient my search towards a reader (parse) of pdf files.
Any other suggestions?
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
If you post an example PDF file with multiple tables and multiple pages, then perhaps someone would look at the possibilities.

I searched StackOverflow:

 
Upvote 0

zed

Active Member
Licensed User
It's not new, but it might be interesting.
 
Upvote 0

RichardN

Well-Known Member
Licensed User
Longtime User
In the past I have ported lots of data from PDFs to SQLite but it is a bit of a soul destroying task. There are several ways to do it that involve going via the PDF to Text route. Text can be manipulated with Notepad++ or saved to .csv with Excel. Alternatively use B4J and the PDF library to extract text then parse to strings and write to a database. I can't imagine trying to do this within an Android application.

Unfortunately the format of PDFs can be rather unpredictable depending on the program (and author) that created it. Trying to parse from an unpredictable input format is VERY frustrating.
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User

Attachments

  • pdfTables.zip
    9.6 KB · Views: 164
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
The method suggested by William Lancee is not valid in the case of empty cells.
In fact the PDF text

is transformed into

and the position (third and fourth column) of the text "246 John Doyle" is lost.
It seems to be understood that resolving the problem is very difficult.
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
@AlpVir is absolutely right. The image in post #1 shows that empty cells are a possibility.
Another serious problem with the table in #1 is inconsistent multiple lines per cell and cells that span multiple rows.

I would conclude that unless there were excel tables embedded in the PDF (possible but not likely) there is no way to do this
with these PDF's.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
""Another serious problem with the table in #1 is inconsistent multiple lines per cell and cells that span multiple rows.""

I had also seen this further problem but had not mentioned it so as not to add complexity to an already too complicated problem.
So, barring surprises, is it impossible (or at least extraordinarily difficult) to be able to transfer the contents of cells into a database ?
 
Upvote 0

RichardN

Well-Known Member
Licensed User
Longtime User
Most PDFs like the one you have proposed are created manually by a clerk at a keyboard, just like the college timetable example. Aside from the basic cell format they were never created with any fixed structure in mind. If that PDF was machine created by another piece of software you might have the element of consistency on your side when parsing it.... but it appears you do not have that luxury here. To complicate matters it was probably created in MS Word with lots of stray whitespace characters floating around unseen. Which PDF document writer was the PDF created with? My Windows PC has three different choices.... at least one of which seems to employ random character-set encoding!

If capturing this data to SQLite is SUPER IMPORTANT then I suggest you write a text-field front-end B4J program that allows the author of the timetable to enter the data in their familiar format, but captures it to a database at the same time.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
The PDF files that I have to examine are created completely automatically by some different software that independently builds a school timetable.
This should limit interpretation difficulties quite a bit.
For the moment I am examining exclusively PDF files produced by a specific software.
I would add that with this specific software it is possible to export the time into an XML file.
My app ("Orario scolastico") has been able to interpret the XML file for a few years now, transforming it into an SQLite database that can be queried in various ways.
Now I would like to achieve the same thing, but starting directly from a PDF file which is created more easily by the various school structures rather than an XML file.
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…