In this post I present analysis of astronomical publication abstracts using natural language processing (NLP). I use publicly available abstracts from NASA's astronomical abstract service. I develop a model which infers the name of the publishing journal based solely on the text of papers abstract.
This project is a part of my learning the data science tools and techniques to work with text data. As an observational astronomer, I worked mostly with numerical data and had a little idea that text data can also be analysed with computational methods. In this project I learn how the text data can be digitized and analysed to provide valuable information.
Writing papers is a necessary part of life of any researcher. When a scientist makes a discovery or simply gets an interesting result, the first order of business is to publish it. By pubishing the results of your work you establish its ownership, inform the scientific community of the new findings and receive feedback, which is very important in defining the direction of new research projects. If not properly pubilicised, the result of your research cannot be used in proposals for further research funding or as an argument in scientific dispute. Without a solid publication you simple can't take a full credit for it and your also should be prepared to see your result reported and claimed by somebody else.
An article submitted to a refereed journal passes a review process. The journal editor sends the paper to a an expert scientist in your research field for evaluation. Based on the reviewer recommendations the editor can accept the publication, to reject the paper, or, most commonly, to recommend corrections to your work before the publishing the work.
For example, the "The Astrophysical Journal" and "Astronomy & Astrophysics" are refereed journals. The most common example of non-referred publication is a conference proceedings. In conference processings scientists usually report intermediate results of an ongoing research. This kind of publications carry much less impact. Refereed journals also differ in terms of importance, thus the journal impact factor (https://en.wikipedia.org/wiki/Impact_factor).
Publication in a refereed journal is a qualitiy stamp from the community of your peers. However, usually it is a tedious and lengthy process. Can there be some way to make it easier? A seasoned scientist usually knows in advance, which journal he plans to submit his new article to based on his past experience. For a young reseacher it can be confusing to choose the journal. As part of my experimentaion with NLP analysis, I decided in investigate wheather it is possible to infer the journal where the paper was published based on the text of its abstract. As side product, it can serve as a way to recommend the journal for submission. When submitted to a right journal, a lot of time and energy can be saved by going through an easier and faster review process.
Abstract is a short paragraph, which samarizes the main results of the presented work. It is therefore the most important and informative part of a publication.
For this project I use the NASA Abstract Service, the primary resorce for astronomers to search publications. I downloaded all abstracts for papers published in the refereed astronomical journals since January 2017. I use the data for 2017 to train the model and the data for the year 2018 to test it.
For the year 2017 there are total of 23118 abstracts published in 257 different journals. For our model training I keep only journals that have 200 or more abstracts. This leaves me with 15,179 abstracts in 17 journals. The figure below gives the number of abstracts for each journals used in model training.
Our goal is to create a classification model, which predicts the journal using the text of abstract. Generally, a classifier takes a set of numerical features of a certain length and corresponding labels, and creates a model by optimizing a set of parameters. However, an abstract is a collection of words of an arbitrary length. To transform a text object into a numerical vector I create a pipeline consisting of several standard NLP processing steps:
I acheive these steps by creating processing pipeline using python NLTK and scikit-learn packages.
After the model has been fit to the train data, I apply it to the test data The results of model application to the abstracts published during three first months of 2018 are summarised in the table below using standard metric:
Journal Tile | Precision | Recall | F1-score | Support |
---|---|---|---|---|
Advances in Space Research | 0.76 | 0.54 | 0.63 | 131 |
Astronomy & Astrophysics | 0.67 | 0.80 | 0.73 | 326 |
Astrophysics and Space Science | 0.50 | 0.05 | 0.09 | 61 |
Classical and Quantum Gravity | 0.84 | 0.69 | 0.76 | 116 |
Earth and Planetary Science Letters | 0.80 | 0.53 | 0.64 | 176 |
Geochimica et Cosmochimica Acta | 0.69 | 0.89 | 0.78 | 165 |
Geophysical Research Letters | 0.62 | 0.76 | 0.68 | 190 |
Icarus | 0.72 | 0.53 | 0.61 | 150 |
Journal of Cosmology and Astroparticle Physics | 0.74 | 0.51 | 0.60 | 168 |
Journal of Geophysical Research: Space Physics | 0.55 | 0.86 | 0.67 | 118 |
Monthly Notices of the Royal Astronomical Society | 0.76 | 0.80 | 0.78 | 1272 |
Monthly Notices of the Royal Astronomical Society: Letters | 0.00 | 0.00 | 0.00 | 100 |
Nature Astronomy | 0.93 | 0.23 | 0.36 | 62 |
Physical Review D | 0.22 | 0.59 | 0.32 | 59 |
The Astronomical Journal | 0.66 | 0.26 | 0.37 | 142 |
The Astrophysical Journal | 0.59 | 0.76 | 0.66 | 752 |
The Astrophysical Journal Letters | 0.55 | 0.14 | 0.22 | 150 |
Avg / Total | 0.67 | 0.67 | 0.65 | 4138 |
The overall presision is 67%. In other words, the model correctly predicts journals for two out of three abstracts. While this may appear not very impressive, I think it is not actually that bad. We are trying to predict one from seventeen categories based on very complex and noisy data. For me as a human, even though former astronomer, it would be a challenging task to guess the publishing journal at this accuracy. There is no direct reason for the abstracts in ,for example, "The Astrophysical Journal" to be too much different from those published in "Astronomy & Astrophysics". However, the model have found the way to do a good job by figuring out features, which reflects such subtile aspects as editing styles, possible differences in terminology, regional differences, etc.
It is also interesing to look at the confusion matrix:
First, the model is doing a great job in guessing the group of the journal. There are at least two major groups of journals, which can be defined as astrophysical group (Astronomy & Astrophysics, "The Astrophysical Journal", "Monthly Notices of the Royal Astronomical Society" and their Letters counterparts) and planetary and Earth sciences ("Earth and Planetary Science Letters", "Geochimica et Cosmochimica Acta", "Geophysical Research Letters", "Icarus") group. Also, the "Classical and Quantum Gravity" and "Physical Review D" may present a small group related to general physics. The prediction for a journal in a group with high probability falls into the same group. This shows that the model does distinguish between texts which belong to different subdivisions of natural sciences.
Another thing to notice is, as expected, the smaller the number of astracts, the poorer the performance. For some journals we even have zero precision, like "Monthly Notices of the Royal Astronomical Society: Letters". All of abstracts for this journal we classified either as the main journal (most cases) or other astrophysical journal. This may be fixed by balancing the training set or class weighting during the model fit.
We can also look at the most and the least important words in the corpus of astronomy abstracts:
Coefficient | More Important | Coefficient | Less Important |
---|---|---|---|
2.0431 | iri | -0.6375 | large |
2.0154 | gnss | -0.6103 | source |
1.9738 | attitude | -0.5405 | ssw |
1.7598 | satellite | -0.5331 | find |
1.6631 | navigation | -0.4760 | epb |
1.3828 | gps | -0.4564 | new |
1.3444 | orbit | -0.4400 | light |
1.3148 | positioning | -0.4385 | instrument |
1.2859 | paper | -0.4301 | wind |
1.1731 | performance | -0.4301 | f |
1.1602 | ppp | -0.4263 | meteor |
1.1582 | flight | -0.4244 | tilde |
1.1464 | propose | -0.4191 | ice |
1.1376 | space debris | -0.4146 | 2 |
1.1242 | debris | -0.4131 | plasma bubble |
1.1001 | hmf2 | -0.4009 | star |
1.0918 | station | -0.3935 | scale |
1.0765 | fof2 | -0.3932 | mstids |
1.0680 | design | -0.3927 | galaxy |
The first and the third colums showing the words preceeded by corresponding coefficient, which is show the features "wight" in the model prediction. We can see that the terms which have more narrow specific meaning, like "satellite", "navigation", "orbit", and abbreviations like GPS or IRI (International Reference Ionosphere) are among the most important model features, while the general terms like "source", "light", "instrument", "scale" are among the least important.
I this post I applied the natural language processing to a set of scientific abstracts. I have created a model that is able to correctly predict the journal which have published the paper. This shows the power of NLP as classification tool to identify a text as relevant to particular scientific field.