Analysis of astronomy abstracts using NLP

Summary

In this post I present analysis of astronomical publication abstracts using natural language processing (NLP). I use publicly available abstracts from NASA's astronomical abstract service. I develop a model which infers the name of the publishing journal based solely on the text of papers abstract.

Motivation

This project is a part of my learning the data science tools and techniques to work with text data. As an observational astronomer, I worked mostly with numerical data and had a little idea that text data can also be analysed with computational methods. In this project I learn how the text data can be digitized and analysed to provide valuable information.

Scientific Pulications

Writing papers is a necessary part of life of any researcher. When a scientist makes a discovery or simply gets an interesting result, the first order of business is to publish it. By pubishing the results of your work you establish its ownership, inform the scientific community of the new findings and receive feedback, which is very important in defining the direction of new research projects. If not properly pubilicised, the result of your research cannot be used in proposals for further research funding or as an argument in scientific dispute. Without a solid publication you simple can't take a full credit for it and your also should be prepared to see your result reported and claimed by somebody else.

An article submitted to a refereed journal passes a review process. The journal editor sends the paper to a an expert scientist in your research field for evaluation. Based on the reviewer recommendations the editor can accept the publication, to reject the paper, or, most commonly, to recommend corrections to your work before the publishing the work.

For example, the "The Astrophysical Journal" and "Astronomy & Astrophysics" are refereed journals. The most common example of non-referred publication is a conference proceedings. In conference processings scientists usually report intermediate results of an ongoing research. This kind of publications carry much less impact. Refereed journals also differ in terms of importance, thus the journal impact factor (https://en.wikipedia.org/wiki/Impact_factor).

Publication in a refereed journal is a qualitiy stamp from the community of your peers. However, usually it is a tedious and lengthy process. Can there be some way to make it easier? A seasoned scientist usually knows in advance, which journal he plans to submit his new article to based on his past experience. For a young reseacher it can be confusing to choose the journal. As part of my experimentaion with NLP analysis, I decided in investigate wheather it is possible to infer the journal where the paper was published based on the text of its abstract. As side product, it can serve as a way to recommend the journal for submission. When submitted to a right journal, a lot of time and energy can be saved by going through an easier and faster review process.

Data: Astronomy Abstracts

Abstract is a short paragraph, which samarizes the main results of the presented work. It is therefore the most important and informative part of a publication.

For this project I use the NASA Abstract Service, the primary resorce for astronomers to search publications. I downloaded all abstracts for papers published in the refereed astronomical journals since January 2017. I use the data for 2017 to train the model and the data for the year 2018 to test it.

For the year 2017 there are total of 23118 abstracts published in 257 different journals. For our model training I keep only journals that have 200 or more abstracts. This leaves me with 15,179 abstracts in 17 journals. The figure below gives the number of abstracts for each journals used in model training.

Absracts

Data Processing and Modeling

Our goal is to create a classification model, which predicts the journal using the text of abstract. Generally, a classifier takes a set of numerical features of a certain length and corresponding labels, and creates a model by optimizing a set of parameters. However, an abstract is a collection of words of an arbitrary length. To transform a text object into a numerical vector I create a pipeline consisting of several standard NLP processing steps:

  • Turn the journal names to numerical labels
  • Tokenize the text, i.e. break the text into sentences and words
  • Putting the words into lowecase
  • Apply part of speach (POS) tagging
  • Lemmatizing, i.e. putting the words into their basic form by removing plural, ending, etc.
  • Apply N-Gram model by keeping 1-gram and 2-grams sequences
  • Apply term frequency–inverse document frequency (TDIDF) statistic to vectorize the data
  • Finally, feed the vectors and corresponding journal labels, which I obtain at the previous steps to stochastic gradinent (SGD) classifier

I acheive these steps by creating processing pipeline using python NLTK and scikit-learn packages.

Model Performance

After the model has been fit to the train data, I apply it to the test data The results of model application to the abstracts published during three first months of 2018 are summarised in the table below using standard metric:

Journal Tile Precision Recall F1-score Support
Advances in Space Research 0.76 0.54 0.63 131
Astronomy & Astrophysics 0.67 0.80 0.73 326
Astrophysics and Space Science 0.50 0.05 0.09 61
Classical and Quantum Gravity 0.84 0.69 0.76 116
Earth and Planetary Science Letters 0.80 0.53 0.64 176
Geochimica et Cosmochimica Acta 0.69 0.89 0.78 165
Geophysical Research Letters 0.62 0.76 0.68 190
Icarus 0.72 0.53 0.61 150
Journal of Cosmology and Astroparticle Physics 0.74 0.51 0.60 168
Journal of Geophysical Research: Space Physics 0.55 0.86 0.67 118
Monthly Notices of the Royal Astronomical Society 0.76 0.80 0.78 1272
Monthly Notices of the Royal Astronomical Society: Letters 0.00 0.00 0.00 100
Nature Astronomy 0.93 0.23 0.36 62
Physical Review D 0.22 0.59 0.32 59
The Astronomical Journal 0.66 0.26 0.37 142
The Astrophysical Journal 0.59 0.76 0.66 752
The Astrophysical Journal Letters 0.55 0.14 0.22 150
Avg / Total 0.67 0.67 0.65 4138

The overall presision is 67%. In other words, the model correctly predicts journals for two out of three abstracts. While this may appear not very impressive, I think it is not actually that bad. We are trying to predict one from seventeen categories based on very complex and noisy data. For me as a human, even though former astronomer, it would be a challenging task to guess the publishing journal at this accuracy. There is no direct reason for the abstracts in ,for example, "The Astrophysical Journal" to be too much different from those published in "Astronomy & Astrophysics". However, the model have found the way to do a good job by figuring out features, which reflects such subtile aspects as editing styles, possible differences in terminology, regional differences, etc.

It is also interesing to look at the confusion matrix: Confusion matrix First, the model is doing a great job in guessing the group of the journal. There are at least two major groups of journals, which can be defined as astrophysical group (Astronomy & Astrophysics, "The Astrophysical Journal", "Monthly Notices of the Royal Astronomical Society" and their Letters counterparts) and planetary and Earth sciences ("Earth and Planetary Science Letters", "Geochimica et Cosmochimica Acta", "Geophysical Research Letters", "Icarus") group. Also, the "Classical and Quantum Gravity" and "Physical Review D" may present a small group related to general physics. The prediction for a journal in a group with high probability falls into the same group. This shows that the model does distinguish between texts which belong to different subdivisions of natural sciences.

Another thing to notice is, as expected, the smaller the number of astracts, the poorer the performance. For some journals we even have zero precision, like "Monthly Notices of the Royal Astronomical Society: Letters". All of abstracts for this journal we classified either as the main journal (most cases) or other astrophysical journal. This may be fixed by balancing the training set or class weighting during the model fit.

We can also look at the most and the least important words in the corpus of astronomy abstracts:

Coefficient More Important Coefficient Less Important
2.0431 iri -0.6375 large
2.0154 gnss -0.6103 source
1.9738 attitude -0.5405 ssw
1.7598 satellite -0.5331 find
1.6631 navigation -0.4760 epb
1.3828 gps -0.4564 new
1.3444 orbit -0.4400 light
1.3148 positioning -0.4385 instrument
1.2859 paper -0.4301 wind
1.1731 performance -0.4301 f
1.1602 ppp -0.4263 meteor
1.1582 flight -0.4244 tilde
1.1464 propose -0.4191 ice
1.1376 space debris -0.4146 2
1.1242 debris -0.4131 plasma bubble
1.1001 hmf2 -0.4009 star
1.0918 station -0.3935 scale
1.0765 fof2 -0.3932 mstids
1.0680 design -0.3927 galaxy

The first and the third colums showing the words preceeded by corresponding coefficient, which is show the features "wight" in the model prediction. We can see that the terms which have more narrow specific meaning, like "satellite", "navigation", "orbit", and abbreviations like GPS or IRI (International Reference Ionosphere) are among the most important model features, while the general terms like "source", "light", "instrument", "scale" are among the least important.

Conclusion

I this post I applied the natural language processing to a set of scientific abstracts. I have created a model that is able to correctly predict the journal which have published the paper. This shows the power of NLP as classification tool to identify a text as relevant to particular scientific field.