From grdetil@scrc.umanitoba.ca Fri Aug 23 16:25:00 2002
Date: Fri, 23 Aug 2002 15:29:38 -0500 (CDT)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: Ted Stresen-Reuter <bowlofcereal@hotmail.com>
Cc: "ht://Dig mailing list" <htdig-general@lists.sourceforge.net>
Subject: Re: [htdig] pdf-parser

According to Ted Stresen-Reuter:
> On a related note, is there any way to customize the TITLE attribute
> htsearch displays for pdfs? We have over 100 MB of pdfs we index every night
> and it would be VERY helpful to be able to provide more accurate titles in
> the search results.

Well, the best way is to edit the PDF description information, in Acrobat
Exchange, to set the title.  That way, the conv_doc.pl or doc2html.pl
script will pick it up automatically, via pdfinfo.

Failing that, the other option is to put a hook into your Perl script to
read the alternate title for a given URL from a file.  Here's how I did
it in conv_doc.pl, for some PDFs of scientific papers...

--- contrib/conv_doc.pl.orig	Thu Jul 12 09:38:29 2001
+++ contrib/conv_doc.pl	Thu Oct 18 12:23:58 2001
@@ -71,6 +71,7 @@ $CATPDF = "/usr/bin/pdftotext";
 $PDFINFO = "/usr/bin/pdfinfo";
 #$CATPDF = "/usr/local/bin/pdftotext";
 #$PDFINFO = "/usr/local/bin/pdfinfo";
+$titlelist = "/home/httpd/html/SCRC/manuscripts/titles.lst";
 
 #########################################
 #
@@ -183,6 +183,23 @@ if ($ishtml) {
 print "<HTML>\n<head>\n";
 
 # print out the title, if it's set, and not just a file name, or make one up
+if (-r $titlelist) {
+    if (open(INFO, "grep \"$ARGV[2]\" $titlelist 2>$null |")) {
+        while (<INFO>) {
+            if (/^$ARGV[2]/) {
+                s/^$ARGV[2]\s+//;
+                s/\s+$//;
+                s/\s+/ /g;
+                s/&/\&amp\;/g;
+                s/</\&lt\;/g;
+                s/>/\&gt\;/g;
+                $title = $_;
+                last;
+            }
+        }
+        close INFO;
+    }
+}
 if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
     @parts = split(/\//, $ARGV[2]);         # get the file basename
     $parts[-1] =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;


Here, for example, is a line from titles.lst:

http://www.scrc.umanitoba.ca/SCRC/manuscripts/41.pdf	Spinal circuitry of sensorimotor control of locomotion

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <htdig-general@lists.sourceforge.net>
To unsubscribe, send a message to <htdig-general-request@lists.sourceforge.net> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

