Ruby Forum Ruby > PDF to text covertor?

Posted by dare ruby (martin_mercy2001)
on 11.08.2008 11:44
Dear all,

Could anyone explain how to do convert PDF to text format.

Thanks in advance

Regards,
Jose Martin
Posted by Axel Etzold (Guest)
on 11.08.2008 12:06
(Received via mailing list)
-------- Original-Nachricht --------
> Datum: Mon, 11 Aug 2008 18:41:51 +0900
> Von: dare ruby <martin@angleritech.com>
> An: ruby-talk@ruby-lang.org
> Betreff: PDF to text covertor?

> Dear all,
> 
> Could anyone explain how to do convert PDF to text format.
> 
> Thanks in advance
> 
> Regards,
> Jose Martin
> -- 
> Posted via http://www.ruby-forum.com/.

Dear Jose,

it depends on whether your PDF actually contains text or just images 
that a human can recognize as
text.
In the first case, you can try using tools like pdftotext 
(http://en.wikipedia.org/wiki/Pdftotext), on Linux and
Mac, at least. On Windows, there are also some pdf viewers where you can 
say , "Save as text" .

In the second case, you'll have to use an OCR (optical character 
recognition) software. There are some
good commercial ones available. I've liked ABBYY's Finereader (on 
Windows).

Best regards,

Axel
Posted by Kouhei Sutou (Guest)
on 11.08.2008 13:11
(Received via mailing list)
Hi,

In <59a3f50dc89e69c5250b753986657c78@ruby-forum.com>
  "PDF to text covertor?" on Mon, 11 Aug 2008 18:41:51 +0900,
  dare ruby <martin@angleritech.com> wrote:

> Could anyone explain how to do convert PDF to text format.

It seems that Ruby/Poppler(*1), the Ruby bindings of
Poppler(*2), is what you're looking for.
  http://ruby-gnome2.svn.sourceforge.net/viewvc/ruby-gnome2/ruby-gnome2/trunk/poppler/sample/pdf2text.rb?view=markup

(*1) http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler
(*2) http://poppler.freedesktop.org/

pdftotext is a bundled application in Poppler.


Thanks,
Posted by dare ruby (martin_mercy2001)
on 19.08.2008 08:14
I have some of the study materials as PDF documents. I need to parse the 
PDF to any text format like microsoft word or text pad in windows OS. I 
need to do parsing using a ruby program. Could any one suggesst on this?

Thanks in advance

Regards,
Jose Martin
Posted by Martin DeMello (Guest)
on 19.08.2008 19:43
(Received via mailing list)
On Mon, Aug 18, 2008 at 11:10 PM, dare ruby <martin@angleritech.com> 
wrote:
> I have some of the study materials as PDF documents. I need to parse the
> PDF to any text format like microsoft word or text pad in windows OS. I
> need to do parsing using a ruby program. Could any one suggesst on this?

Your best bet is a ruby script that calls out to xpdf to do the actual
pdf->text conversion, then parses the text. There's a windows port of
the xpdf command line utilities.

http://gnuwin32.sourceforge.net/packages/xpdf.htm
http://www.perlmonks.org/?node_id=298041
http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/
http://forjournalists.com/cookbook/index.php?title=XPDF

martin