API Reference¶
BibTeX Validation and Enrichment Script
This script validates BibTeX entries by: 1. Checking DOI information via Crossref API 2. Checking arXiv information via arXiv API 3. Searching Google Scholar for missing information (optional) 4. Comparing and updating fields 5. Generating a validation report
- class validate_bibtex.BibEntry(entry_type: str, citekey: str, fields: Dict[str, str])[source]¶
Bases:
object
- class validate_bibtex.BibTeXValidator(bib_file: str, output_file: str | None = None, update_bib: bool = False, delay: float = 1.0)[source]¶
Bases:
objectValidates and enriches BibTeX entries
- ARXIV_DOI_PATTERN = re.compile('10\\.48550/ARXIV\\.(\\d{4}\\.\\d{4,5})', re.IGNORECASE)¶
- ARXIV_NOTE_PATTERN = re.compile('(?i)arxiv:\\s*(\\d{4}\\.\\d{4,5}(?:v\\d+)?)', re.IGNORECASE)¶
- FIELD_SCHEMA = {'common': {'core': ['author', 'editor', 'title', 'year', 'month', 'note', 'key', 'crossref'], 'extended': ['doi', 'url', 'urldate', 'eprint', 'archiveprefix', 'primaryclass', 'isbn', 'issn', 'language', 'keywords', 'file']}, 'strongly_recommended': {'article': ['volume', 'pages'], 'inbook': ['chapter', 'pages'], 'incollection': ['pages', 'chapter'], 'inproceedings': ['pages'], 'techreport': ['number']}, 'types': {'article': {'extended': ['doi', 'url', 'urldate', 'issn'], 'optional': ['volume', 'number', 'pages', 'month', 'note'], 'required': ['author', 'title', 'journal', 'year']}, 'book': {'extended': ['doi', 'url', 'urldate', 'isbn'], 'optional': ['volume', 'number', 'series', 'address', 'edition', 'month', 'note'], 'required': ['title', 'publisher', 'year'], 'required_any': ['author', 'editor']}, 'booklet': {'extended': ['doi', 'url', 'urldate'], 'optional': ['author', 'howpublished', 'address', 'month', 'year', 'note'], 'required': ['title']}, 'inbook': {'extended': ['doi', 'url', 'urldate', 'isbn'], 'optional': ['volume', 'number', 'series', 'address', 'edition', 'month', 'note'], 'required': ['title', 'publisher', 'year'], 'required_any': ['author', 'editor'], 'required_any_2': ['chapter', 'pages']}, 'incollection': {'extended': ['doi', 'url', 'urldate', 'isbn'], 'optional': ['editor', 'volume', 'number', 'series', 'type', 'chapter', 'pages', 'address', 'edition', 'month', 'note'], 'required': ['author', 'title', 'booktitle', 'publisher', 'year']}, 'inproceedings': {'extended': ['doi', 'url', 'urldate', 'isbn'], 'optional': ['editor', 'volume', 'number', 'series', 'pages', 'publisher', 'organization', 'address', 'month', 'note'], 'required': ['author', 'title', 'booktitle', 'year']}, 'manual': {'extended': ['doi', 'url', 'urldate'], 'optional': ['author', 'organization', 'address', 'edition', 'month', 'year', 'note'], 'required': ['title']}, 'mastersthesis': {'extended': ['doi', 'url', 'urldate'], 'optional': ['type', 'address', 'month', 'note'], 'required': ['author', 'title', 'school', 'year']}, 'misc': {'extended': ['doi', 'url', 'urldate', 'eprint', 'archiveprefix', 'primaryclass'], 'optional': ['author', 'title', 'howpublished', 'month', 'year', 'note'], 'required': []}, 'phdthesis': {'extended': ['doi', 'url', 'urldate'], 'optional': ['type', 'address', 'month', 'note'], 'required': ['author', 'title', 'school', 'year']}, 'proceedings': {'extended': ['doi', 'url', 'urldate', 'isbn'], 'optional': ['editor', 'volume', 'number', 'series', 'publisher', 'organization', 'address', 'month', 'note'], 'required': ['title', 'year']}, 'techreport': {'extended': ['doi', 'url', 'urldate'], 'optional': ['type', 'number', 'address', 'month', 'note'], 'required': ['author', 'title', 'institution', 'year']}, 'unpublished': {'extended': ['doi', 'url', 'urldate', 'eprint', 'archiveprefix', 'primaryclass'], 'optional': ['month', 'year'], 'required': ['author', 'title', 'note']}}}¶
- compare_fields(bib_entry: Dict, api_data: Dict, source: str = 'crossref') Dict[source]¶
Compare BibTeX entry with API data and identify conflicts/updates/identical/different
- Returns:
Dictionary with ‘updated’, ‘conflicts’, ‘identical’, ‘different’, ‘sources’ keys
- extract_arxiv_id(entry: Dict) str | None[source]¶
Extract arXiv ID from BibTeX entry
Checks: 1. note field: “arXiv: YYYY.NNNNN” or “arXiv: YYYY.NNNNNvN” 2. doi field: “10.48550/ARXIV.YYYY.NNNNN” 3. eprint field: “YYYY.NNNNN”
- Returns:
Normalized arXiv ID (YYYY.NNNNN format, version suffix removed) or None
- extract_string_from_api_value(api_value) str[source]¶
Extract string from API value (handles list format)
- fetch_arxiv_data(arxiv_id: str) Dict | None[source]¶
Fetch metadata from arXiv API Respects strict rate limiting: 1 req / 3s
- fetch_crossref_data(doi: str) Dict | None[source]¶
Fetch metadata from Crossref API
- Parameters:
doi – DOI string
- Returns:
Dictionary with metadata or None if not found
- fetch_datacite_data(doi: str) Dict | None[source]¶
Fetch metadata from DataCite API
- Parameters:
doi – DOI string
- Returns:
Dictionary with metadata or None if not found
- fetch_dblp_data(title: str, author: str | None = None) Dict | None[source]¶
Fetch metadata from DBLP API
- Parameters:
title – Paper title
author – First author name (optional)
- Returns:
Dictionary with metadata or None if not found
- fetch_openalex_data(doi: str | None = None, title: str | None = None) Dict | None[source]¶
Fetch metadata from OpenAlex API
- Parameters:
doi – DOI string
title – Title string
- Returns:
Dictionary with metadata or None if not found
- fetch_pubmed_data(pmid: str) Dict | None[source]¶
Fetch metadata from PubMed API via Entrez
- Parameters:
pmid – PubMed ID
- Returns:
Dictionary with metadata or None if not found
- fetch_semantic_scholar_data(title: str, author: str | None = None) Dict | None[source]¶
Fetch metadata from Semantic Scholar API
- Parameters:
title – Paper title
author – First author name (optional)
- Returns:
Dictionary with metadata or None if not found
- fetch_zenodo_data(doi: str) Dict | None[source]¶
Fetch metadata from Zenodo API
- Parameters:
doi – DOI string
- Returns:
Dictionary with metadata or None if not found
- filter_entry_fields(entry: Dict) Dict[source]¶
Filter entry fields to keep only allowed fields for the entry type
- format_crossref_author_list(authors: List[Dict]) str[source]¶
Convert Crossref author list to BibTeX format
- map_api_type_to_bibtex(api_type: str, source: str = 'crossref') str[source]¶
Map API entry type to BibTeX entry type
- normalize_entry(entry: BibEntry) BibEntry[source]¶
Normalize entry based on BibTeX mode policies. - Map BibLaTeX fields to BibTeX - Normalize aliases (conference -> inproceedings) - Normalize DOI and URL - Apply Type Promotion Rules (ArXiv -> Inproceedings/Article)
- normalize_string_for_comparison(s: str, field_name: str = '') str[source]¶
Normalize string for comparison according to BibTeX conventions
Normalizations: - Remove LaTeX braces { } - Remove leading/trailing whitespace - Decode HTML entities (& -> &) - For title: lowercase for comparison - For ISSN: remove hyphens and take first if multiple (0378-7788, 1476-4687 -> 03787788) - For DOI: lowercase for comparison - For DOI: lowercase for comparison
- search_google_scholar(query: str) Dict | None[source]¶
Search Google Scholar for publication information
- Parameters:
query – Search query (title + first author)
- Returns:
Dictionary with metadata or None
- validate_all(show_progress: bool = True, max_workers: int = 30) List[ValidationResult][source]¶
Validate all entries in the BibTeX database
- Parameters:
show_progress – If True, show progress indicators
max_workers – Number of threads for parallel execution
- validate_entry(entry: Dict, index: int = 0, total: int = 0) ValidationResult[source]¶
Validate a single BibTeX entry
- validate_entry_schema(entry: BibEntry) List[LintMessage][source]¶
Validate entry against schema rules.
- class validate_bibtex.LintMessage(level: str, code: str, message: str, field: str | None = None)[source]¶
Bases:
object
- class validate_bibtex.ValidationResult(entry_key: str, entry_type: str = 'misc', has_doi: bool = False, doi_valid: bool = False, has_arxiv: bool = False, arxiv_valid: bool = False, arxiv_id: str | None = None, normalized_entry: ~validate_bibtex.BibEntry | None = None, lint_messages: ~typing.List[~validate_bibtex.LintMessage] = <factory>, fields_missing: ~typing.List[str] = <factory>, fields_updated: ~typing.Dict[str, ~typing.Tuple[str, str]] = <factory>, fields_conflict: ~typing.Dict[str, ~typing.Tuple[str, str]] = <factory>, fields_identical: ~typing.Dict[str, str] = <factory>, fields_different: ~typing.Dict[str, ~typing.Tuple[str, str]] = <factory>, field_sources: ~typing.Dict[str, str] = <factory>, all_sources_data: ~typing.Dict[str, ~typing.Dict] = <factory>, field_source_options: ~typing.Dict[str, ~typing.List[str]] = <factory>, original_values: ~typing.Dict[str, str] = <factory>, errors: ~typing.List[str] = <factory>, warnings: ~typing.List[str] = <factory>)[source]¶
Bases:
objectStores validation results for a single entry
- lint_messages: List[LintMessage]¶
- validate_bibtex.create_gui_app(validator: BibTeXValidator, results: List[ValidationResult]) FastAPI[source]¶
Create FastAPI application for BibTeX validator GUI
- Parameters:
validator – BibTeXValidator instance
results – List of ValidationResult objects
- Returns:
FastAPI app instance