Dear all, Could anyone explain how to do convert PDF to text format. Thanks in advance Regards, Jose Martin
on 11.08.2008 11:44
on 11.08.2008 12:06
-------- Original-Nachricht -------- > Datum: Mon, 11 Aug 2008 18:41:51 +0900 > Von: dare ruby <martin@angleritech.com> > An: ruby-talk@ruby-lang.org > Betreff: PDF to text covertor? > Dear all, > > Could anyone explain how to do convert PDF to text format. > > Thanks in advance > > Regards, > Jose Martin > -- > Posted via http://www.ruby-forum.com/. Dear Jose, it depends on whether your PDF actually contains text or just images that a human can recognize as text. In the first case, you can try using tools like pdftotext (http://en.wikipedia.org/wiki/Pdftotext), on Linux and Mac, at least. On Windows, there are also some pdf viewers where you can say , "Save as text" . In the second case, you'll have to use an OCR (optical character recognition) software. There are some good commercial ones available. I've liked ABBYY's Finereader (on Windows). Best regards, Axel
on 11.08.2008 13:11
Hi, In <59a3f50dc89e69c5250b753986657c78@ruby-forum.com> "PDF to text covertor?" on Mon, 11 Aug 2008 18:41:51 +0900, dare ruby <martin@angleritech.com> wrote: > Could anyone explain how to do convert PDF to text format. It seems that Ruby/Poppler(*1), the Ruby bindings of Poppler(*2), is what you're looking for. http://ruby-gnome2.svn.sourceforge.net/viewvc/ruby-gnome2/ruby-gnome2/trunk/poppler/sample/pdf2text.rb?view=markup (*1) http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler (*2) http://poppler.freedesktop.org/ pdftotext is a bundled application in Poppler. Thanks,
on 19.08.2008 08:14
I have some of the study materials as PDF documents. I need to parse the PDF to any text format like microsoft word or text pad in windows OS. I need to do parsing using a ruby program. Could any one suggesst on this? Thanks in advance Regards, Jose Martin
on 19.08.2008 19:43
On Mon, Aug 18, 2008 at 11:10 PM, dare ruby <martin@angleritech.com> wrote: > I have some of the study materials as PDF documents. I need to parse the > PDF to any text format like microsoft word or text pad in windows OS. I > need to do parsing using a ruby program. Could any one suggesst on this? Your best bet is a ruby script that calls out to xpdf to do the actual pdf->text conversion, then parses the text. There's a windows port of the xpdf command line utilities. http://gnuwin32.sourceforge.net/packages/xpdf.htm http://www.perlmonks.org/?node_id=298041 http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/ http://forjournalists.com/cookbook/index.php?title=XPDF martin