But the Service edition uses the default Python provided by Apple (version 2.7.1 for Mountain Lion). With the command line edition I run Python 2.7.3, whose HTMLParser is robust against this type of malformed HTML. Those extra unescaped quotation marks break the HTMLParser module in Python-except not always. Abstract pages are laden with helpful metadata, but these metadata fields are not escaped! Thus in the header of the aforementioned paper’s HTML page you’ll find the line: It turns out the problem was ultimately with the HTML served by ADS. Thanks to a bug report we determined that the problem is triggered by papers with quotation marks in the paper title, such as The “True” Column Density Distribution in Star-Forming Molecular Clouds. Some papers would work fine with the command line edition, but crash the Service edition. The solution was simple: don’t try to escape characters passed on the command line-just pass data through a temporary file. The point of failure was how this data was escaped and passed via pipes between the Python scraper code and the AppleScript interface script to BibDesk. Personally, I’m most excited about some of the bugs we’ve been able to fix (mostly with the prodding of Issues posted on GitHub).įirst, we’ve fixed a lot of problems caused by unicode characters and LaTeX markup in BibTeX data. ![]() You can easily find the DOI text on the first page of newer papers. Note that DOIs are not present in all papers particularly ones only a few years old. Where my_pdf_dir/ is a directory containing PDFs that you want to ingest into BibDesk. You can give this PDF ingest workflow a try via: adsbibdesk -p my_pdf_dir/ Fortunately, this regular expression seems to work with the astronomical literature. Reading through that StackOverflow post, it appears that DOI is a tricky format to parse. The solution is written by Alix Axel in this excellent StackOverflow post, and the Python implementation is: import re Next, we need to extract a DOI from the paper’s text: a perfect job for regular expressions. Before you try the PDF ingest mode, go ahead and install pdf2json. 1 It can easily be installed with Homebrew on your Mac. To extract text from a PDF, I’ve opted for the pdf2json program. ADS to BibDesk can then act on that DOI as usual. The first step is to extract text from a PDF, and second, to extract a DOI string from that text. The approach I’ve taken is borrowed from an older script by Dr Lucy Kim. ADS to BibDesk is good at downloading papers, BibTeX and abstracts the challenge here is reliably identifying a paper given its PDF. One request I’ve received from new users is an easier way to add folders-full of papers downloaded from ADS and arXiv into BibDesk (with matching the BibTeX and abstract data). ![]() For example: adsbibdesk 1998ApJ.500.525SīibDesk is becoming more popular with astronomers. The command line edition takes the very same tokens as the Service edition: an ADS or arXiv URL, an ADS bibcode, an arXiv pre-print ID, or a DOI. Then check out the help: adsbibdesk -help To get started, you can pip-install the latest release (you may need to run this as sudo): pip install adsbibdesk This opens up new possibilities for hacking your own workflows: from automatic scripts to integration with Mac OS X launchers like Alfred. It is now possible to run ADS to BibDesk from the command line. Lots of bug fixes to make ADS to BibDesk more robust against the peculiarities of some papers.A PDF ingest mode, great for getting your legacy folder of PDFs into BibDesk, and.A full-fledged command line edition, installable with pip,.In the last few weeks I’ve been rolling out improvements to the venerable ADS to BibDesk service.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |