Tuesday, June 30, 2009

Java PDF Library

I have been playing around with extracting data from PDF files. Apache PDF Box looked pretty promising but unfortunately it is far behind some of the others that are available. iText is a mature library but lacks the ability to extract information (it is actually a PDF creator). I was very impressed by the work done by LAB Asprise!. It took minutes to understand their impressive API and start coding. The parsing is fast, and so far appears accurate. The library is also extremely small for the abilities it provides (just over 3MB). If you are looking for a powerful Java API for processing PDFs then I strongly recommend it. Here is a code sample for extracting text (taken from their site). The code clearly demonstrates how much of an awesome job these guys have done....

PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();

for(int i=0; i < pages; i++)
{
String text = reader.extractTextFromPage(i);
System.out.println("Page " + i + ": " + text);
}

2 comments:

Qoppa Software said...

Or with Qoppa's jPDFText library:

import java.io.FileWriter;

import com.qoppa.pdfText.PDFText;

public class ExtractTextByPage
{
public static void main (String [] args)
{
try
{
// Load the document
PDFText pdfText = new PDFText ("input.pdf", null);

// Loop through the pages
for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx)
{
// Get the text for the page
String pageText = pdfText.getText(pageIx);

// Save the text
// Save the text in a file
FileWriter output = new FileWriter ("output_" + pageIx + ".txt");
output.write(pageText);
output.close();
}
}
catch (Throwable t)
{
t.printStackTrace();
}
}

}

katherine johanson said...

You should also try Aspose.Pdf for Java Library also for extracting text using java. Below is the code i have found on their documentation page for extracting text from all pages of pdf file:


//open document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("input.pdf");
//create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
//accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);
//get the extracted text
String extractedText = textAbsorber.getText();

// create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("Extracted_text.txt"));
writer.write(extractedText);
// write a line of text to the file
//tw.WriteLine(extractedText);
// close the stream
writer.close();