Analysis of astronomy abstracts using NLP¶

Summary¶

In this post I present analysis of astronomical publication abstracts using natural language processing (NLP). I use publicly available abstracts from NASA's astronomical abstract service. I develop a model which infers the name of the publishing journal based solely on the text of papers abstract.

Motivation¶

This project is a part of my learning the data science tools and techniques to work with text data. As an observational astronomer, I worked mostly with numerical data and had a little idea that text data can also be analysed with computational methods. In this project I learn how the text data can be digitized and analysed to provide valuable information.

Scientific Pulications¶

Writing papers is a necessary part of life of any researcher. When a scientist makes a discovery or simply gets an interesting result, the first order of business is to publish it. By pubishing the results of your work you establish its ownership, inform the scientific community of the new findings and receive feedback, which is very important in defining the direction of new research projects. If not properly pubilicised, the result of your research cannot be used in proposals for further research funding or as an argument in scientific dispute. Without a solid publication you simple can't take a full credit for it and your also should be prepared to see your result reported and claimed by somebody else.

An article submitted to a refereed journal passes a review process. The journal editor sends the paper to a an expert scientist in your research field for evaluation. Based on the reviewer recommendations the editor can accept the publication, to reject the paper, or, most commonly, to recommend corrections to your work before the publishing the work.

For example, the "The Astrophysical Journal" and "Astronomy & Astrophysics" are refereed journals. The most common example of non-referred publication is a conference proceedings. In conference processings scientists usually report intermediate results of an ongoing research. This kind of publications carry much less impact. Refereed journals also differ in terms of importance, thus the journal impact factor (https://en.wikipedia.org/wiki/Impact_factor).

Publication in a refereed journal is a qualitiy stamp from the community of your peers. However, usually it is a tedious and lengthy process. Can there be some way to make it easier? A seasoned scientist usually knows in advance, which journal he plans to submit his new article to based on his past experience. For a young reseacher it can be confusing to choose the journal. As part of my experimentaion with NLP analysis, I decided in investigate wheather it is possible to infer the journal where the paper was published based on the text of its abstract. As side product, it can serve as a way to recommend the journal for submission. When submitted to a right journal, a lot of time and energy can be saved by going through an easier and faster review process.

Data: Astronomy Abstracts¶

Abstract is a short paragraph, which samarizes the main results of the presented work. It is therefore the most important and informative part of a publication.

For this project I use the NASA Abstract Service, the primary resorce for astronomers to search publications. I downloaded all abstracts for papers published in the refereed astronomical journals since January 2017. I use the data for 2017 to train the model and the data for the year 2018 to test it.

For the year 2017 there are total of 23118 abstracts published in 257 different journals. For our model training I keep only journals that have 200 or more abstracts. This leaves me with 15,179 abstracts in 17 journals. The figure below gives the number of abstracts for each journals used in model training.

Absracts

Data Processing and Modeling¶

Our goal is to create a classification model, which predicts the journal using the text of abstract. Generally, a classifier takes a set of numerical features of a certain length and corresponding labels, and creates a model by optimizing a set of parameters. However, an abstract is a collection of words of an arbitrary length. To transform a text object into a numerical vector I create a pipeline consisting of several standard NLP processing steps:

Turn the journal names to numerical labels
Tokenize the text, i.e. break the text into sentences and words
Putting the words into lowecase
Apply part of speach (POS) tagging
Lemmatizing, i.e. putting the words into their basic form by removing plural, ending, etc.
Apply N-Gram model by keeping 1-gram and 2-grams sequences
Apply term frequency–inverse document frequency (TDIDF) statistic to vectorize the data
Finally, feed the vectors and corresponding journal labels, which I obtain at the previous steps to stochastic gradinent (SGD) classifier

I acheive these steps by creating processing pipeline using python NLTK and scikit-learn packages.

Model Performance¶

After the model has been fit to the train data, I apply it to the test data The results of model application to the abstracts published during three first months of 2018 are summarised in the table below using standard metric:

Journal Tile	Precision	Recall	F1-score	Support
Advances in Space Research	0.76	0.54	0.63	131
Astronomy & Astrophysics	0.67	0.80	0.73	326
Astrophysics and Space Science	0.50	0.05	0.09	61
Classical and Quantum Gravity	0.84	0.69	0.76	116
Earth and Planetary Science Letters	0.80	0.53	0.64	176
Geochimica et Cosmochimica Acta	0.69	0.89	0.78	165
Geophysical Research Letters	0.62	0.76	0.68	190
Icarus	0.72	0.53	0.61	150
Journal of Cosmology and Astroparticle Physics	0.74	0.51	0.60	168
Journal of Geophysical Research: Space Physics	0.55	0.86	0.67	118
Monthly Notices of the Royal Astronomical Society	0.76	0.80	0.78	1272
Monthly Notices of the Royal Astronomical Society: Letters	0.00	0.00	0.00	100
Nature Astronomy	0.93	0.23	0.36	62
Physical Review D	0.22	0.59	0.32	59
The Astronomical Journal	0.66	0.26	0.37	142
The Astrophysical Journal	0.59	0.76	0.66	752
The Astrophysical Journal Letters	0.55	0.14	0.22	150

Avg / Total	0.67	0.67	0.65	4138

The overall presision is 67%. In other words, the model correctly predicts journals for two out of three abstracts. While this may appear not very impressive, I think it is not actually that bad. We are trying to predict one from seventeen categories based on very complex and noisy data. For me as a human, even though former astronomer, it would be a challenging task to guess the publishing journal at this accuracy. There is no direct reason for the abstracts in ,for example, "The Astrophysical Journal" to be too much different from those published in "Astronomy & Astrophysics". However, the model have found the way to do a good job by figuring out features, which reflects such subtile aspects as editing styles, possible differences in terminology, regional differences, etc.

It is also interesing to look at the confusion matrix: First, the model is doing a great job in guessing the group of the journal. There are at least two major groups of journals, which can be defined as astrophysical group (Astronomy & Astrophysics, "The Astrophysical Journal", "Monthly Notices of the Royal Astronomical Society" and their Letters counterparts) and planetary and Earth sciences ("Earth and Planetary Science Letters", "Geochimica et Cosmochimica Acta", "Geophysical Research Letters", "Icarus") group. Also, the "Classical and Quantum Gravity" and "Physical Review D" may present a small group related to general physics. The prediction for a journal in a group with high probability falls into the same group. This shows that the model does distinguish between texts which belong to different subdivisions of natural sciences.

Another thing to notice is, as expected, the smaller the number of astracts, the poorer the performance. For some journals we even have zero precision, like "Monthly Notices of the Royal Astronomical Society: Letters". All of abstracts for this journal we classified either as the main journal (most cases) or other astrophysical journal. This may be fixed by balancing the training set or class weighting during the model fit.

We can also look at the most and the least important words in the corpus of astronomy abstracts:

Coefficient	More Important	Coefficient	Less Important
2.0431	iri	-0.6375	large
2.0154	gnss	-0.6103	source
1.9738	attitude	-0.5405	ssw
1.7598	satellite	-0.5331	find
1.6631	navigation	-0.4760	epb
1.3828	gps	-0.4564	new
1.3444	orbit	-0.4400	light
1.3148	positioning	-0.4385	instrument
1.2859	paper	-0.4301	wind
1.1731	performance	-0.4301	f
1.1602	ppp	-0.4263	meteor
1.1582	flight	-0.4244	tilde
1.1464	propose	-0.4191	ice
1.1376	space debris	-0.4146	2
1.1242	debris	-0.4131	plasma bubble
1.1001	hmf2	-0.4009	star
1.0918	station	-0.3935	scale
1.0765	fof2	-0.3932	mstids
1.0680	design	-0.3927	galaxy

The first and the third colums showing the words preceeded by corresponding coefficient, which is show the features "wight" in the model prediction. We can see that the terms which have more narrow specific meaning, like "satellite", "navigation", "orbit", and abbreviations like GPS or IRI (International Reference Ionosphere) are among the most important model features, while the general terms like "source", "light", "instrument", "scale" are among the least important.

Conclusion¶

I this post I applied the natural language processing to a set of scientific abstracts. I have created a model that is able to correctly predict the journal which have published the paper. This shows the power of NLP as classification tool to identify a text as relevant to particular scientific field.