B4A Library Pdf To Text

Hi all.
Pdf To Text
This library converts pdf files to txt.
I was looking for a library that could convert a pdf file to txt. Behind tip by @Johan Schoeman ( Thank you dear ) i delivery this wrapper itextpdf-5-5-6.jar ( https://sourceforge.net/projects/itext/ )


pdftotext
Author:
DevilApp
Version: 1
  • PdftToText
    Events:
    • onMessage (Success As String)
    Methods:
    • Initialize (EventName As String)
    • ParsePdf (filepdf As String, filetxt As String)


You must copy this file itextpdf-5-5-6.jar and the wrapper pdftotext ( in attachment )
So you have 3 files:
pdftotext.xml
pdftotext.jar
itextpdf-5-5-6.jar
Copy all files in your libraries folder.

This is code as example ( you found the same in attachment ):
B4X:
Sub Activity_Create(FirstTime As Boolean)
    'Do not forget to load the layout file created with the visual designer. For example:
    'Activity.LoadLayout("Layout1")
    File.Copy(File.DirAssets, "test-armen.pdf", File.DirRootExternal, "test-armen.pdf")
    Dim filepdf As String = File.DirRootExternal & "/test-armen.pdf"
    Dim filetxt As String = File.DirRootExternal & "/test-armen.txt"
 
    Dim pdf As PdftToText
 
    pdf.Initialize("pdf")
    pdf.ParsePdf(filepdf, filetxt)

End Sub

Sub pdf_onMessage(Success As String)
    Log("Status conversion: " & Success)
End Sub
 

Attachments

  • PdfToText-Example.zip
    212.6 KB · Views: 679
  • PdfToText-Library.zip
    2.1 KB · Views: 671
  • pdftotext-source.zip
    4.1 KB · Views: 554
Last edited:

johndb

Active Member
Licensed User
Longtime User
Hi all.
Pdf To Text
This library converts pdf files to txt.
I was looking for a library that could convert a pdf file to txt. Behind tip by @Johan Schoeman ( Thank you dear ) i delivery this wrapper itextpdf-5-5-6.jar ( https://sourceforge.net/projects/itext/ )


pdftotext
Author:
DevilApp
Version: 1
  • PdftToText
    Events:
    • onMessage (Success As String)
    Methods:
    • Initialize (EventName As String)
    • ParsePdf (filepdf As String, filetxt As String)


You must copy this file itextpdf-5-5-6.jar and the wrapper pdftotext ( in attachment )
So you have 3 files:
pdftotext.xml
pdftotext.jar
itextpdf-5-5-6.jar
Copy all files in your libraries folder.

This is code as example ( you found the same in attachment ):
B4X:
Sub Activity_Create(FirstTime As Boolean)
    'Do not forget to load the layout file created with the visual designer. For example:
    'Activity.LoadLayout("Layout1")
    File.Copy(File.DirAssets, "test-armen.pdf", File.DirRootExternal, "test-armen.pdf")
    Dim filepdf As String = File.DirRootExternal & "/test-armen.pdf"
    Dim filetxt As String = File.DirRootExternal & "/test-armen.txt"

    Dim pdf As PdftToText

    pdf.Initialize("pdf")
    pdf.ParsePdf(filepdf, filetxt)

End Sub

Sub pdf_onMessage(Success As String)
    Log("Status conversion: " & Success)
End Sub
This is fantastic @MarcoRome. Thank you very much for your work! I hope I don't sound ungrateful but the iText library has many useful features related to PDF:
  • PDF generation
  • PDF manipulation (stamping, watermarks, merging/splitting PDFs, ...)
  • PDF form filling
  • XML functionality
  • Digital signatures
Yes, I know that other developers have already published PDF libraries that partially include similar features but the iText library/libraries appear to include many more functions. Would there a possibility for these to be included in the B4X library in addition to the PDF-Text conversion? I know that this is a lot of work and I should start looking into "how" to create libraries myself. :confused:

Thanks again for your much appreciated work :)
 

MarcoRome

Expert
Licensed User
Longtime User
This is fantastic @MarcoRome. I know that this is a lot of work and I should start looking into "how" to create libraries myself. :confused:

This is right way :).
Anyway if it is urgent, with a reasonable donation ( depends wrapper that you ask ) and excellent results, there are in this community of excellent wrapper master ( as example @Johan Schoeman , @DonManfred ) to which you can contact. If they have time, they will certainly help
 

johndb

Active Member
Licensed User
Longtime User
This is right way :).
Anyway if it is urgent, with a reasonable donation ( depends wrapper that you ask ) and excellent results, there are in this community of excellent wrapper master ( as example @Johan Schoeman , @DonManfred ) to which you can contact. If they have time, they will certainly help
You are absolutely right .... downloading Eclipse .... wish me luck!
 

Star-Dust

Expert
Licensed User
Longtime User

MarcoRome

Expert
Licensed User
Longtime User

Robert Valentino

Well-Known Member
Licensed User
Longtime User
Found one problem - not with your interface but with the conversion of text.

If you try this file: https://www.b4x.com/android/forum/attachments/test-armen-pdf.33756/
you will see that it repeats lines that are BOLD multiple times. In the above example it says "Organization League" 9 times and "Team Standings" 9 times on multiple lines.

This is something to watch out for when processing the text.


ALSO: Notice at the GitHub site that you may need to buy a license.

iText is licensed as AGPL software.


AGPL is a free / open source software license.


This doesn't mean the software is gratis!


Buying a license is mandatory as soon as you develop commercial activities distributing the iText software inside your product or deploying it on a network without disclosing the source code of your own applications under the AGPL license. These activities include:


  • offering paid services to customers as an ASP
  • serving PDFs on the fly in the cloud or in a web application
  • shipping iText with a closed source product

Contact sales for more info: http://itextpdf.com/sales



Does anyone know what the license might cost?
 
Last edited:

Johan Schoeman

Expert
Licensed User
Longtime User

Johan Schoeman

Expert
Licensed User
Longtime User
Found one problem - not with your interface but with the conversion of text.

If you try this file: https://www.b4x.com/android/forum/attachments/test-armen-pdf.33756/
you will see that it repeats lines that are BOLD multiple times. In the above example it says "Organization League" 9 times and "Team Standings" 9 times on multiple lines.

This is something to watch out for when processing the text.


ALSO: Notice at the GitHub site that you may need to buy a license.

iText is licensed as AGPL software.


AGPL is a free / open source software license.


This doesn't mean the software is gratis!


Buying a license is mandatory as soon as you develop commercial activities distributing the iText software inside your product or deploying it on a network without disclosing the source code of your own applications under the AGPL license. These activities include:


  • offering paid services to customers as an ASP
  • serving PDFs on the fly in the cloud or in a web application
  • shipping iText with a closed source product

Contact sales for more info: http://itextpdf.com/sales



Does anyone know what the license might cost?
Hi Roberto

Browse the web to see if you can find solutions for the problems that you mentioned. It is very simple to make use of this JAR via inline Java code...

@MarcoRome 's project is based on this:
B4X:
#Region  Project Attributes
    #ApplicationLabel: b4aReadPDF
    #VersionCode: 1
    #VersionName:
    'SupportedOrientations possible values: unspecified, landscape or portrait.
    #SupportedOrientations: unspecified
    #CanInstallToExternalStorage: False
#End Region

#AdditionalJar: itextpdf-5.5.6

#Region  Activity Attributes
    #FullScreen: False
    #IncludeTitle: True
#End Region

Sub Process_Globals
    'These global variables will be declared once when the application starts.
    'These variables can be accessed from all modules.
    Dim nativeMe As JavaObject

End Sub

Sub Globals
    'These global variables will be redeclared each time the activity is created.
    'These variables can only be accessed from this module.

End Sub

Sub Activity_Create(FirstTime As Boolean)
    'Do not forget to load the layout file created with the visual designer. For example:
    'Activity.LoadLayout("Layout1")
    Log(File.DirAssets)
    Log(File.DirRootExternal)

    nativeMe.InitializeContext
    nativeMe.RunMethod("parsePdf", Null)



End Sub

Sub Activity_Resume

End Sub

Sub Activity_Pause (UserClosed As Boolean)

End Sub

#If Java

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;


/** The original PDF that will be parsed. */
public static final String PREFACE = "/storage/emulated/0/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "/storage/emulated/0/preface.txt";


    public void parsePdf() throws IOException {
            String pdf = PREFACE;
            String txt = RESULT;
            PdfReader reader = new PdfReader(pdf);
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            PrintWriter out = new PrintWriter(new FileOutputStream(txt));
            TextExtractionStrategy strategy;
            BA.Log("number of pages = " + reader.getNumberOfPages());
            for (int i = 1; i <= reader.getNumberOfPages(); i++) {
                strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
                out.println(strategy.getResultantText());
            }
            reader.close();
            out.flush();
            out.close();
    }




#End If

The other changes to turn it into a library - you will have to do that on your own as what @MarcoRome did.
 
Last edited:

Robert Valentino

Well-Known Member
Licensed User
Longtime User
Thanks for the attachment.

For the last few hours I have been trying a lot of online converters and downloaded and installed some as well

Most do something wrong with converting the PDF.
Some say the bold line multiple times. Some say it once but string multiple lines on one line which makes parsing harder

I'll keep looking and will work on making the one I am using (cPDF2Text) more data friendly

I am not making enough money in my APP to allow me to pay for licensing other products - Maybe someday LOL
 

MarcoRome

Expert
Licensed User
Longtime User
Thanks for the attachment.

For the last few hours I have been trying a lot of online converters and downloaded and installed some as well

Most do something wrong with converting the PDF.
Some say the bold line multiple times. Some say it once but string multiple lines on one line which makes parsing harder

I'll keep looking and will work on making the one I am using (cPDF2Text) more data friendly

I am not making enough money in my APP to allow me to pay for licensing other products - Maybe someday LOL

In #1 you have also source...so you can modified as you want...and if you modified dont forget to share new wrapper for all community
 

DonManfred

Expert
Licensed User
Longtime User
Nice one ;-)
 

Robert Valentino

Well-Known Member
Licensed User
Longtime User
In #1 you have also source...so you can modified as you want...and if you modified dont forget to share new wrapper for all community

Always - but not doing much coding - Summer to many Golf rounds to play. In the fall will start coding again
 

MarcoRome

Expert
Licensed User
Longtime User
I have tried to convert a pdf file to a text file using the CLI.
pdftotext -layout (file name.pdf) (new name.txt)
All I get is a message saying the characters are unrecognized.
Any ideas?

Which file ?
if you attach the same maybe it will be easier to understand
 
Top