Thursday 25 July 2013

Parse formatted PDF in rails

I had to parse a PDF which was formatted and it has different styles eg: bold texts. I tried pdf-reader gem but it wasn't parsing properly. Bold texts were repeated.And it was really hard to figure it out what was the original text.

Then I tried  iText its implementation has been done in Java.Implemented it using ruby Java Bridge, but none of its strategy could parse my formatted pdf properly.

Then at last I found Docsplit and It could parse my PDF properly.

I will simply show you the steps:-

These are the bunch of gems you need to include and one more thing you might need is SUDO permission in order to install some gems:-

gem "pdftk"
gem "docsplit"
gem "glib2"
gem "gdk_pixbuf2"
gem "poppler"

Some of the gem doesn't get installed simply through bundle install and you might need to install it using apt-get. Just google it if you are having any issue on installing any gem or you can leave a comment below.

After successfully installing all the gems:

You simply need this one line of code and it will parse your PDF and save all of its text to a text file with the same name as the PDF file.

file_dest = Rails.public_path+'/pdfparser/text (where you want to save the text file)
Docsplit.extract_text(pdf_path,:output =>file_dest)

There are many other options provided,  like you can parse a specific page of the PDF or even extract images from the PDF please refer to its documentation.

 And after this if you want to find text between two texts from the file here's  what you need to do:

text_main = File.open(extracted_text_url).read
# you need to use  Regexp.escape if you have any special character in your from text.

text = text_main.scan(/#{Regexp.escape(str_from_text)}(.*?)#{str_to_text}/m)
text = text[0].try(:first).try(:rstrip).try(:to_s)

text variable will contain the text you want.

No comments:

Post a Comment