PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++)
String text = reader.extractTextFromPage(i);
System.out.println("Page " + i + ": " + text);
Tuesday, June 30, 2009
Java PDF Library
I have been playing around with extracting data from PDF files. Apache PDF Box looked pretty promising but unfortunately it is far behind some of the others that are available. iText is a mature library but lacks the ability to extract information (it is actually a PDF creator). I was very impressed by the work done by LAB Asprise!. It took minutes to understand their impressive API and start coding. The parsing is fast, and so far appears accurate. The library is also extremely small for the abilities it provides (just over 3MB). If you are looking for a powerful Java API for processing PDFs then I strongly recommend it. Here is a code sample for extracting text (taken from their site). The code clearly demonstrates how much of an awesome job these guys have done....