PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++)
{
String text = reader.extractTextFromPage(i);
System.out.println("Page " + i + ": " + text);
}
Software Development and IT security - adamboulton@gmail.com http://uk.linkedin.com/in/adamboulton
Tuesday, June 30, 2009
Java PDF Library
I have been playing around with extracting data from PDF files. Apache PDF Box looked pretty promising but unfortunately it is far behind some of the others that are available. iText is a mature library but lacks the ability to extract information (it is actually a PDF creator). I was very impressed by the work done by LAB Asprise!. It took minutes to understand their impressive API and start coding. The parsing is fast, and so far appears accurate. The library is also extremely small for the abilities it provides (just over 3MB). If you are looking for a powerful Java API for processing PDFs then I strongly recommend it. Here is a code sample for extracting text (taken from their site). The code clearly demonstrates how much of an awesome job these guys have done....
Subscribe to:
Post Comments (Atom)
1 comment:
Or with Qoppa's jPDFText library:
import java.io.FileWriter;
import com.qoppa.pdfText.PDFText;
public class ExtractTextByPage
{
public static void main (String [] args)
{
try
{
// Load the document
PDFText pdfText = new PDFText ("input.pdf", null);
// Loop through the pages
for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx)
{
// Get the text for the page
String pageText = pdfText.getText(pageIx);
// Save the text
// Save the text in a file
FileWriter output = new FileWriter ("output_" + pageIx + ".txt");
output.write(pageText);
output.close();
}
}
catch (Throwable t)
{
t.printStackTrace();
}
}
}
Post a Comment