Android Question PDF to SQLITE

AlpVir

Well-Known Member
Licensed User
Longtime User
I have a number of PDF files.
Each file has a few pages (from 1 to a few dozen).
Each page contains one, two or four tables (not three or five or more).
Each page is characterized by:
1) a header ("class" field) at the top center;
2) a table with:
a) 1+6 columns ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday" and "Saturday"); are the days of the week ("lunedi", "martedi", "mercoledi","giovedi", "venerdi" e "sabato");
b) from 5 to 9 lines ("hour" field)
I would like to store the contents of each cell in a sqlite database having in following fields :
1) text - class
2) numeric - day (from 1 to 6)
3) numeric - hour (from 1 to 8)
4) text - contents of the cell.
In the attached example (assuming you have a single PDF file with single page and two table) the database would contain 65 records (33 with the class field set to "1A CA" and 32 with the class field set to "1A MME")
rec n. 1 : "1A CA";1;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 2 : "1A CA";2;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 3 : "1A CA";3;1;"ITALIANO E STORIA - PISANO"
rec n. 4 : "1A CA";4;1;"GEOGRAFIA- CHISU"
etc

How would you suggest we act given that a traditional OCR would most likely not obtain the desired result ?
Thanks for the attention.
 

Attachments

  • Orario immagine.jpg
    Orario immagine.jpg
    134 KB · Views: 140

Peter Simpson

Expert
Licensed User
Longtime User
I have a number of PDF files.
Each file has a few pages (from 1 to a few dozen).
Each page contains one, two or four tables (not three or five or more).
Each page is characterized by:
1) a header ("class" field) at the top center;
2) a table with:
a) 1+6 columns ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday" and "Saturday"); are the days of the week ("lunedi", "martedi", "mercoledi","giovedi", "venerdi" e "sabato");
b) from 5 to 9 lines ("hour" field)
I would like to store the contents of each cell in a sqlite database having in following fields :
1) text - class
2) numeric - day (from 1 to 6)
3) numeric - hour (from 1 to 8)
4) text - contents of the cell.
In the attached example (assuming you have a single PDF file with single page and two table) the database would contain 65 records (33 with the class field set to "1A CA" and 32 with the class field set to "1A MME")
rec n. 1 : "1A CA";1;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 2 : "1A CA";2;1;"TECNOLOGIA RAPP - GRAFICA, PALUMB"
rec n. 3 : "1A CA";3;1;"ITALIANO E STORIA - PISANO"
rec n. 4 : "1A CA";4;1;"GEOGRAFIA- CHISU"
etc

How would you suggest we act given that a traditional OCR would most likely not obtain the desired result ?
Thanks for the attention.

This could be quite difficult to achieve, especially as I cn't remember sosing any code on the forum for doing such a thing, but I could be incorrect.

Anyway a few years ago I used an online service for a client to achieve what you are asking. I used a free online converter (the information wasn't secretive or important) to convert multiple tables in pdf files to a csv file which I then imported into a B4X project. Just search Google/Bing/Whatever for 'free online pdf to csv converter' to do what you want, then just import the csv file. I believe that I might have used convertio online website when I did a project for a client, convertio does have a PDF to CSV converter.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
Thanks for your answer. I tried one of the live pdf to csv converters and apparently it does the job.
However there are 2 drawbacks:
1) at first glance I wouldn't know how to call this online converter from one of my apps written in B4A; furthermore, the use of the API of this converter requires a fee;
2) the CSV file thus obtained is difficult (but probably not impossible) to interpret; although it is not guaranteed that all but all the features of the PDF file are correctly exported.
Maybe I should orient my search towards a reader (parse) of pdf files.
Any other suggestions?
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
If you post an example PDF file with multiple tables and multiple pages, then perhaps someone would look at the possibilities.

I searched StackOverflow:

This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
  1. use iText (specifically IEventListener) to get information on all rendering events for every page
  2. select those rendering events that make sense for your application. PathRenderInfo and TextRenderInfo.
  3. Events in a pdf do not need to appear in order according to the spec. Solve this problem by implementing a comparator over IEventData. This comparator should sort according to reading order. This implies you might have to implement some basic language detection, since not every language reads left-to-right.
  4. Once sorted, you can now start clustering items together according to any of the various heuristics you find in literature. For instance, two characters can be grouped into a snippet of text if they follow each other in the sorted list of events (meaning they appear next to each other in reading order), if the y-position does not differ too much (subscript and superscript might screw with this), and if the x-position does not differ too much (kerning).
  5. Continue clustering characters until you have formed words
  6. Assuming you have formed words, use similar algorithm to form words into lines. Use PathRenderInfo to withhold merging words if they intersect with a line.
  7. Assuming you have managed to create lines, now look for tables. One possible approach is apply a horizontal and vertical projection. And look for those sub-areas in the page that (when projected) show a grid-like structure.
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.
http://itextpdf.com/itext7/pdf2Data
 
Upvote 0

zed

Active Member
Licensed User
It's not new, but it might be interesting.
 
Upvote 0

RichardN

Well-Known Member
Licensed User
Longtime User
In the past I have ported lots of data from PDFs to SQLite but it is a bit of a soul destroying task. There are several ways to do it that involve going via the PDF to Text route. Text can be manipulated with Notepad++ or saved to .csv with Excel. Alternatively use B4J and the PDF library to extract text then parse to strings and write to a database. I can't imagine trying to do this within an Android application.

Unfortunately the format of PDFs can be rather unpredictable depending on the program (and author) that created it. Trying to parse from an unpredictable input format is VERY frustrating.
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User

Attachments

  • pdfTables.zip
    9.6 KB · Views: 149
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
The method suggested by William Lancee is not valid in the case of empty cells.
In fact the PDF text
uno.jpg

is transformed into
due.jpg

and the position (third and fourth column) of the text "246 John Doyle" is lost.
It seems to be understood that resolving the problem is very difficult.
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
@AlpVir is absolutely right. The image in post #1 shows that empty cells are a possibility.
Another serious problem with the table in #1 is inconsistent multiple lines per cell and cells that span multiple rows.

I would conclude that unless there were excel tables embedded in the PDF (possible but not likely) there is no way to do this
with these PDF's.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
""Another serious problem with the table in #1 is inconsistent multiple lines per cell and cells that span multiple rows.""

I had also seen this further problem but had not mentioned it so as not to add complexity to an already too complicated problem.
So, barring surprises, is it impossible (or at least extraordinarily difficult) to be able to transfer the contents of cells into a database ?
 
Upvote 0

RichardN

Well-Known Member
Licensed User
Longtime User
Most PDFs like the one you have proposed are created manually by a clerk at a keyboard, just like the college timetable example. Aside from the basic cell format they were never created with any fixed structure in mind. If that PDF was machine created by another piece of software you might have the element of consistency on your side when parsing it.... but it appears you do not have that luxury here. To complicate matters it was probably created in MS Word with lots of stray whitespace characters floating around unseen. Which PDF document writer was the PDF created with? My Windows PC has three different choices.... at least one of which seems to employ random character-set encoding!

If capturing this data to SQLite is SUPER IMPORTANT then I suggest you write a text-field front-end B4J program that allows the author of the timetable to enter the data in their familiar format, but captures it to a database at the same time.
 
Upvote 0

AlpVir

Well-Known Member
Licensed User
Longtime User
The PDF files that I have to examine are created completely automatically by some different software that independently builds a school timetable.
This should limit interpretation difficulties quite a bit.
For the moment I am examining exclusively PDF files produced by a specific software.
I would add that with this specific software it is possible to export the time into an XML file.
My app ("Orario scolastico") has been able to interpret the XML file for a few years now, transforming it into an SQLite database that can be queried in various ways.
Now I would like to achieve the same thing, but starting directly from a PDF file which is created more easily by the various school structures rather than an XML file.
 
Upvote 0
Top