Create Corpus By PDF Directory
Contents
Operation
To create a Corpus based on a folder containing PDF files, click the Corpora datatype object in the clipboard and select the option Corpus -> Create -> PDF Directory.
Selecting this option causes the Corpus Creation wizard to be launched. Below, we provide a brief overview of the steps.
Select Loader Options and Directory
A GUI appears allowing to select the folder where the PDF files are kept.
Here, you can opt for one of the following Loader Options:
- Default (Without) The PDF Meta-information will not be automatically loaded from PubMed (Workflow Diagram Step 1)
- Use File Name as PMID (Other ID) For each PDF file, the name is associated to a PMID; in further steps, @Note will find meta information (abstract, title or authors) for the article in Pubmed using this PMID (Workflow Diagram Step 2)
- Import Document Meta information from TSV File A tab-separated file (TSV) will be provided including columns with file name and PMID. In further steps, @Note will find meta information in PubMed using the defined PMIDs for each file (Workflow Diagram Step 3)
Select Corpus Name
Select a name for the corpus, e.g “New Corpus” and press next. By default, the Corpus name is the directory Name full path.
Import Meta-Information
A graphical interface is launched that allows you to select the file / view information about the first lines and select General Delimiter,Text Delimiter, DefaultValue and mapping between File name or full path and PMID.
- General Delimiter: overall file delimiter to split the contents of different columns (in Blue)
- Text Delimiter: delimiter to encapsulate information
- Default Value: default value used to represent empty records ( in orange )
- Column Selection Options: select the column in the file for file Name or file full path and column for PMID
Update Document PMID (OtherID)
A graphical interface is launched that allows you to edit the file PMID information. Pressing PDF button a GUI with PDF file is opening.
Update Publication Meta Information and Full Text
In this last GUI it will possible change all meta-information about all PDF and also possible edit PDF to Text conversion. In left side user can inspect the list of documents index by document title ( if available after meta-information finder) or file name. Their are two color: green means that document meta-information is complete ( title,abstract and author are filled) otherwise the documents are red.
Select one document from list user can edit meta-information for this document and switching to "Full Text" tab user can also update the PDF to Text conversion that in most cases has errors.
After finished user must press Ok button and a new Corpus will be formed.
Result
A new Corpus is now created and will be available in the clipboard, being visualized through the Corpora View and automatically open in clipboard.
Besides a Corpus Create will be launched