Script to extract date & subject

raynewalker · 2019-08-23 20:09

Hello!

I have a gazillion family history PDF files that I need to rename. In original format, most were emails (.eml) that were retrieved from a backup then converted to PDF.

I'd like to extract the date the email was created or sent and the info in the subject line. That extracted data becomes the new PDF filename.

I don't have a clue how to write a script. I searched this forum and found two that I thought might work. I copied and pasted each into a "Pascal Script" window. I have Renamer Pro v 7.1. One script required downloading and installing pdfinfo.exe which I did. Neither script worked.

Here's an example of what I'd like to accomplish......

RENAME:
06283EF9-0000001C.pdf

TO:
2011-02-23_1787 billings & bradford deaths [nehgs].pdf

Thanks in advance for any help you can give.

==============================
These are the two scripts I tested that didn't work.

This script made no changes in the filename.

{ Extract PDF tags using Xpdf }

const
  EXE = 'pdfinfo.exe';
  TAG = 'Title\s*\:\s*(.*?)[\r\n]';

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    if Length(Matches) > 0 then
      FileName := Matches[0] + WideExtractFileExt(FileName);
  end;
end.

===============AND===============

This script changed the filename 2F2A0G80-0000011B.pdf to "(.pdf)" without the quotes.

{ Extract PDF tag }

const
  PDF_INFO = 'pdfinfo.exe';
  PDF_TAG = 'Title';

function ExtractTagPDF(const Info, Tag: string): string;
var
  Lines: TStringsArray;
  I, Delim: Integer;
begin
  Result := '';
  Lines := WideSplitString(Info, #13#10);
  for I := 0 to Length(Lines)-1 do
  if WideSameText(Tag, Copy(Lines[i], 1, Length(Tag))) then
  begin
    Delim := WidePos(':', Lines[i]);
    if Delim > 0 then
    begin
      Result := WideCopy(Lines[i], Delim+1, WideLength(Lines[i]));
      Result := Trim(Result);
    end;
  end;
end;

var
  Command, Output: string;
  TagValue: string;

begin
  Command := '"'+PDF_INFO+'" "'+FilePath+'"';
  ExecConsoleApp(Command, Output);
  TagValue := ExtractTagPDF(Output, PDF_TAG);
  FileName := TagValue + WideExtractFileExt(FileName);
end.

Stefan · 2019-08-23 21:19

Hi.

raynewalker wrote:

I have a gazillion family history PDF files...
I'd like to extract the date the email was created or sent and the info in the subject line.
RENAME:
06283EF9-0000001C.pdf
TO:
2011-02-23_1787 billings & bradford deaths [nehgs].pdf

You have to use pdfinfo.exe on a command line (cmd.exe) first.
Next see what part of the output of pdfinfo.exe you could use for your new file name.

Here is a documentation:
https://www.experts-exchange.com/videos … Files.html
See "6. Run the PDFinfo utility on the sample PDF file."

After that, adjust the parameter of the scripts accordingly to get exactly what you need:
- - TAG = 'Title\s*\:\s*(.*?)[\r\n]' //extract parts of the title tag.
or
- - PDF_TAG = 'Title'; //take the whole title tag
(or use a completely different approach and script for your issue)

On my search I also found this old topic of the beginning of PDF extraction:
https://www.den4b.com/forum/viewtopic.php?id=349

Maybe we can support you, if you post one such output here, together with which parts from that you want.

raynewalker · 2019-08-23 21:49

Yes, I tried the script in the link you found from the older post. That's one of the scripts below where the new filename returned was "(.pdf)" without the quotes.

I'll get back with you after I've watched the video re pdfinfo.exe . I followed the directions in another post but having a video for instruction is much better.

Thanks for your reply. I appreciate your time.

den4b · 2019-08-23 22:18

raynewalker wrote:

In original format, most were emails (.eml) that were retrieved from a backup then converted to PDF.
I'd like to extract the date the email was created or sent and the info in the subject line.

You should've kept your archive in the original and native *.eml format. This would have allowed you to quite easily extract the subject line and sent date via the email meta tags, e.g. Email_Subject, Email_DateSent, and others.

The conversion to PDF essentially prints the text of your emails to a virtual image, almost like physical printer. This process losses the original email headers (meta tags), and makes the automated extraction of content more difficult and error prone.

The PDF meta tags have nothing in common with the meta tags in the original *.eml format. So the pdfinfo tool would be quite useless in this case, unless your PDF printer has magically transcribed email headers into appropriate PDF meta tags.

If you don't have your original *.eml files, then your best bet is to use pdftotext tool to extract the printed content from PDF file and then parse the text for the required pieces of information, if such information was printed into the PDF content.

You can perform this task within a Pascal Script rule in ReNamer.

raynewalker · 2019-08-24 16:14

Good morning and thanks for this info. You've saved me a lot of time testing and searching for a method to extract data from this directory of archived PDF files.

I'm probably reaching but.... if I convert the PDF files to MS Word .docx, is there a script that will extract the data needed?

Thanks again for your assistance.

den4b · 2019-08-24 21:11

den4b wrote:

... your best bet is to use pdftotext tool to extract the printed content from PDF file and then parse the text for the required pieces of information, if such information was printed into the PDF content.

Download Xpdf command line tools, and extract the "pdftotext" tool.

Then, try it on your PDF files, to extract the content:

pdftotext.exe "my file.pdf" "my file.txt"

The example above will convert "my file.pdf" to "my file.txt".

If the content in the txt file has your data in a parseable format, then the story shall continue...

raynewalker · 2019-08-26 22:28

The command, pdftotext.exe "my file.pdf" "my file.txt" worked fine. Doing this one at a time won't work with thousands of PDF files to rename. I added about 30 PDF files to Renamer, selected them using "Copy to Clipboard". I was hoping the filenames would copy into the command window but they didn't.

What are your thoughts on the next steps to extract the information I need?

Thanks again!

den4b · 2019-08-28 07:21

The next step is to create a Pascal Script rule which will automatically call pdftotext for every file, extract the content, find the needed pieces of information and put them into the new names of files.

Something like demonstrated in this Xpdf script:
http://www.den4b.com/wiki/ReNamer:Scripts:Xpdf

If you provide a few sample files, or sample pdftotext output at least, then we can help you with writing a script.

raynewalker · 2019-08-30 04:12

Thank you. I'll work with the script although I think I did, previously.

I don't know how I can provide you with output without making public, email addresses and other information that is private. I don't trust redaction to be fool-proof.

Will get back with you next week, after the holiday (U.S.)

Best Regards.

den4b · 2019-08-30 06:58

Take one of those *.txt file exports and substitute the sensitive content with some junk placeholders, but keep all formatting intact for correct parsing.

For example:

Date: 2019-08-26 10:55:00
Subject: aaaaaaaaaaaaaaaaaaaaaa
From: aaaaaaa <aaaaaa@aaaaa.aaa>
To: aaaaaaa <aaaaaa@aaaaa.aaa>

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

We don't need the whole text, just the areas with the key pieces of information which you want to extract.

You may want to experiment with additional options of "pdftotext" tool to extract the most parsable output, including:

pdftotext <file>
pdftotext -layout <file>
pdftotext -simple <file>
pdftotext -table <file>
pdftotext -raw <file>

den4b Forum

#1 2019-08-23 20:09

Script to extract date & subject

#2 2019-08-23 21:19

Re: Script to extract date & subject

#3 2019-08-23 21:49

Re: Script to extract date & subject

#4 2019-08-23 22:18

Re: Script to extract date & subject

#5 2019-08-24 16:14

Re: Script to extract date & subject

#6 2019-08-24 21:11

Re: Script to extract date & subject

#7 2019-08-26 22:28

Re: Script to extract date & subject

#8 2019-08-28 07:21

Re: Script to extract date & subject

#9 2019-08-30 04:12

Re: Script to extract date & subject

#10 2019-08-30 06:58

Re: Script to extract date & subject

Board footer