B4J Question Reading PDF fields with B4J

Radisk3

Member
Licensed User
Longtime User
Dear all

From the little I know about Python, I can create a code, using the "fitz", "os" and "re" libraries to open PDF and find data within it, split PDF, save files.

Something simple but effective where I can find fields and extract data.

Example of a Python function that searches for data from PDF:

I commented the lines to make it easier to understand.

Extrair_Nome:
def extrair_nome(texto):
    #"""Busca o nome da pessoa no texto da página"""
    linhas = texto.split("\n")  # Quebra o texto por linhas
    for i, linha in enumerate(linhas):
        if "Nome Completo" in linha:  # Encontrou o campo "Nome Completo"
            if i + 1 < len(linhas):  # Verifica se há uma linha abaixo
                cpf = linhas[i + 1].strip()  # Captura o nome abaixo
                nome = linhas[i + 2].strip()  # Captura o nome abaixo
                nome = re.sub(r'[<>:"/\\|?*]', '_', cpf + '_' + nome)  # Remove caracteres inválidos no nome do arquivo
                return nome
    return None  # Retorna None se não encontrar o nome

Can I do something similar in B4J?
If so, where do I start?
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
Extract text with pdfplumber: https://github.com/jsvine/pdfplumber

1. Create a new PyBridge project.
2. Click on the "create a local Python runtime" link.
3. Click on "Open local Python shell"
4. Run: pip install pdfplumber
5.
B4X:
Private Sub GetPageText (pdf As PyWrapper, PageNumber As Int) As ResumableSub
    Wait For (pdf.GetField("pages").Get(PageNumber).Run("extract_text").Fetch) Complete (Result As PyWrapper)
    Return Result.Value
End Sub

Private Sub OpenPdf (FileName As String) As PyWrapper
    Dim pdf As PyWrapper = PDFPlumber.Run("open").Arg(FileName)
    Return pdf
End Sub

Usage:
B4X:
PDFPlumber = Py.ImportModule("pdfplumber") 'global PyWrapper variable
    Dim pdf As PyWrapper = OpenPdf("C:\Users\H\Downloads\aaaa.pdf")
    Wait For (pdf.GetField("pages").Len.Fetch) Complete (Result As PyWrapper)
    Dim NumberOfPages As Int = Result.Value
    Log($"Opened pdf with ${NumberOfPages} pages"$)
    Wait For (GetPageText(pdf, 0))  Complete(Text As String) 'get first page text
    Log(Text)
'close when done
    pdf.Run("close")
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
And if you want to be able to read also files from the assets folder, then you need to read the bytes and send them to the Python process:
B4X:
Private Sub OpenPdf (Dir As String, FileName As String) As PyWrapper
    Dim stream As PyWrapper = Py.ImportModule("io").Run("BytesIO").Arg(File.ReadBytes(Dir, FileName))
    Dim pdf As PyWrapper = PDFPlumber.Run("open").Arg(stream)
    pdf.Print
    Return pdf
End Sub
 
Upvote 0

DonManfred

Expert
Licensed User
Longtime User
Extract text with pdfplumber
Thank you for the example. I fear the TO does not need the PDF-Content as text.

Instead he want to know the fieldnames of containing Acroform-PDF-Fields.. At least this is how i interpret the question in subject.
 
Upvote 0
Top