Parsing a Maven pom.xml file with a shell script
My entry in this year's "stupid SED tricks" is a command-line run tool
for Maven-based Java projects. Maven's a great choice for managing complex
dependencies, especially because of the central repository that allows you
to just declare, for example, that your project uses Apache Commons HTTPClient
and let Maven resolve and download it for you. One thing that bothers me
when I use Maven from the command line, though, is that it's a hassle to
build up the correct CLASSPATH environment variable that lets you run the
finished product. Ant made that pretty easy, but Maven wants to control
everything via its <dependencies>
section. This is fine
when you're building a webapp, since a webapp builds its classpath dynamically
from it's /classes and /lib directories, but if you want to run from the
command line, you're stuck with Maven's horrible exec
target (which doesn't
seem to let you define multiple configurations) or the
assembly:assembly
goal that conglomerates all of your class files
into a single monster .jar file.
If you look at the <dependencies>
section, though, you
can see that it sort of looks like a classpath — recognizing that
the <groupId>
and <artifactId>
identify
a directory under your Maven repository and <artifactId>
and
<version>
identify a file within that directory, you can
see an automatic way to translate that section into a classpath. This is,
of course, what Maven does in Java — but can you do it using command
line tools? As it turns out, you can, if you don't mind trying to maintain
a three-deep multi-command pipe configuration.
I'll use the GWT POM file as an example. The XML is "pretty-printed" for human consumption; sed is going to want it "de-pretty-printed". First, strip out the dependencies part to set the classpath:
sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml
This gives you:
<dependencies>
<!-- Google Web Toolkit (GWT) -->
<dependency>
<groupId>com.google.gwt</groupId>
<artifactId>gwt-user</artifactId>
<version>2.5.1</version>
<!-- "provided" so that we don't deploy -->
<scope>provided</scope>
</dependency>
<!-- GWT projects do not usually need a dependency on gwt-dev, but MobileWebApp
contains a GWTC Linker (AppCacheLinker) which in turn depends on internals
of the GWT compiler. -->
<dependency>
<groupId>com.google.gwt</groupId>
<artifactId>gwt-dev</artifactId>
<version>2.5.1</version>
<!-- "provided" so that we don't deploy -->
<scope>provided</scope>
</dependency>
<!-- RequestFactory server -->
<dependency>
<groupId>com.google.web.bindery</groupId>
<artifactId>requestfactory-server</artifactId>
<version>2.5.1</version>
</dependency>
<!-- RequestFactory will use JSR 303 javax.validation if you let it -->
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-validator</artifactId>
<version>4.1.0.Final</version>
<exclusions>
<exclusion>
<groupId>javax.xml.bind</groupId>
<artifactId>jaxb-api</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.xml.bind</groupId>
<artifactId>jaxb-impl</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Required by Hibernate validator because slf4j-log4j is
optional in the hibernate-validator POM -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.1</version>
</dependency>
<!-- Google App Engine (GAE) -->
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-api-1.0-sdk</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-testing</artifactId>
<version>1.7.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-api-stubs</artifactId>
<version>1.7.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-api-labs</artifactId>
<version>1.7.1</version>
</dependency>
<!-- Objectify for persistence. It uses the stock javax.persistence annotations -->
<dependency>
<groupId>com.googlecode.objectify</groupId>
<artifactId>objectify</artifactId>
<version>3.0</version>
</dependency>
<dependency>
<groupId>javax.persistence</groupId>
<artifactId>persistence-api</artifactId>
<version>1.0</version>
</dependency>
<!-- GIN and Guice for IoC / DI -->
<dependency>
<groupId>com.google.inject</groupId>
<artifactId>guice</artifactId>
<version>2.0</version>
</dependency>
<dependency>
<groupId>com.google.gwt.inject</groupId>
<artifactId>gin</artifactId>
<version>1.0</version>
</dependency>
<!-- Use the JSR 330 injection interfaces-->
<dependency>
<groupId>javax.inject</groupId>
<artifactId>javax.inject</artifactId>
<version>1</version>
</dependency>
<!-- Unit tests -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
Now, concatenate everything onto one line:
sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml | tr -d '\n'
And remove whitespace:
sed -n -e '/<dependencies>/,/<\/dependencies>/p' pom.xml | tr -d '\n\t '
Ah, nice, command-line-friendly ugliness. Remove the comments (which are now useless anyway with the removal of whitespace):
sed -e 's/<!--[^>]*>//g'
(Notice that I can't capture using ".*" here; I have to use "[^>]*" instead to avoid greedy consumption). I couldn't do this in the prior step since it wouldn't have handled multi-line comments correctly. Finally, you have your classpath — it just happens to still be in XML form. You can almost translate it directly like this:
sed -e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
Here's what this almost classpath looks like:
One problem here is that the groupId
s have .'s in them instead of path
separators. You can't just s/./\//
, since the version themselves have .'s
that you want to preserve. The solution is to s/./\//
before
concatenating onto one line:
sed -n \
-e '/<groupId>/s/\./\//g' \
-e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
tr -d '\n\t ' | \
sed \
-e 's/<!--[^>]*>//g' \
-e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
There are still a few stragglers in there, though. The inclusion of optional
elements like <scope>
and <exclusions>
cause the last pattern not to match. The easiest way to handle that is to
strip them out. <scope>
is easy enough to deal with:
sed -n \
-e '/<groupId>/s/\./\//g' \
-e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
tr -d '\n\t ' | \
sed \
-e 's/<scope>[^<]*<\/scope>//g'
-e 's/<!--[^>]*>//g' \
-e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
<exclusions>
is a little trickier, though, since it has
multiple child elements. This can be accomplished at the upper (pretty-print)
layer by just deleting those lines from the output entirely:
sed -n \
-e '/<exclusions>/,/<\/exclusions>/d' \
-e '/<groupId>/s/\./\//g' \
-e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
tr -d '\n\t ' | \
sed \
-e 's/<scope>[^<]*<\/scope>//g'
-e 's/<!--[^>]*>//g' \
-e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
Finally, of course, get rid of the "<dependencies>" delimiters:
sed -n \
-e '/<exclusions>/,/<\/exclusions>/d' \
-e '/<groupId>/s/\./\//g' \
-e '/<dependencies>/,/<\/dependencies>/p' pom.xml |
tr -d '\n\t ' | \
sed \
-e 's/<scope>[^<]*<\/scope>//g'
-e 's/<!--[^>]*>//g' \
-e 's/<dependencies>//g' \
-e 's/<\/dependencies>//g' \
-e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
I did cheat in one place — the original POM file had placeholders for the GWT and GAE versions ${gwtVersion} and ${gae.version}. I replaced them in the POM file to simplify my work here; you could get a little wilder and do something like this:
GWT_VERSION=`grep "<gwtVersion>" pom.xml | sed -e 's/^ *<gwtVersion>\(.*\)<\/gwtVersion> *$/\1/'`
GWT_VERSION=`grep "<gae.version>" pom.xml | sed -e 's/^ *<gae.version>\(.*\)<\/gae.version> *$/\1/'`
sed -e 's/${gwtVersion}/'${GWT_VERSION}'/' -e 's/${gae.version}/'${GAE_VERSION}'/' pom.xml
Before parsing the POM file. This approach, however, would still require you
to name each expansion variable in the script itself. Ideally, you'd do
here exactly what Maven does — dynamically replace the property
placeholders with the values that the user entered.
These variable values come from the <properties>
section
of the pom — you can convert the <properties>
list into a set of name-value
pairs via:
sed -n -e '/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/\1=\2/p' pom.xml
So that:
<pre>
<properties>
<!-- Convenience property to set the GWT version -->
<gwtVersion>2.5.1</gwtVersion>
<!-- GWT needs at least java 1.6 -->
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<!-- GAE properties -->
<gae.version>1.7.1</gae.version>
<gae.application.version>1</gae.application.version>
<!-- Don't let your Mac use a crazy non-standard encoding -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
Becomes:
gwtVersion=2.5.1
maven.compiler.source=1.6
maven.compiler.target=1.6
gae.version=1.7.1
gae.application.version=1
project.build.sourceEncoding=UTF-8
project.reporting.outputEncoding=UTF-8
As is, this would be executable via:
eval `sed -n -e '/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/\1=\2/p' pom.xml`
This would create a set of environment variables in the script that would represent the expansion variables declared in the POM. But that's not really what you want here — you want sed substitution. So, rather than building a list of variable declarations you can actually build another sed command! (Bet you never realized that sed supports recursion, eh?)
eval "sed `sed -n -e "/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/-e 's\/\\\\$\\{\1\\}\/\2\/' /p" pom.xml |
tr -d '\n'` pom.xml"
Which expands to:
sed -e 's/\${gwtVersion}/2.5.1/'
-e 's/\${maven.compiler.source}/1.6/'
-e 's/\${maven.compiler.target}/1.6/'
-e 's/\${gae.version}/1.7.1/'
-e 's/\${gae.application.version}/1/'
-e 's/\${project.build.sourceEncoding}/UTF-8/'
-e 's/\${project.reporting.outputEncoding}/UTF-8/' pom.xml
And feed this stream into the classpath builder:
eval "sed `sed -n -e "/<properties>/,/<\/properties>/s/ *<\([^>]*\)>\([^<]*\)<.*$/-e 's\/\\\\$\\{\1\\}\/\2\/' /p" pom.xml |
tr -d '\n'` pom.xml" |
sed -n \
-e '/<exclusions>/,/<\/exclusions>/d' \
-e '/<groupId>/s/\./\//g' \
-e '/<dependencies>/,/<\/dependencies>/p' |
tr -d '\n\t ' | \
sed \
-e 's/<scope>[^<]*<\/scope>//g' \
-e 's/<!--[^>]*>//g' \
-e 's/<dependencies>//g' \
-e 's/<\/dependencies>//g' \
-e 's/<dependency><groupId>\([^<]*\)<\/groupId><artifactId>\([^<]*\)<\/artifactId><version>\([^<]*\)<\/version><\/dependency>/
\/Users\/joshuadavies\/.m2\/repository\/\1\/\2\/\3\/\2-\3.jar:/g'
And this actually creates a CLASSPATH that you can use to execute the project from the command line; no setup required. There are probably a few edge cases not dealt with here — spaces in command lines or instances where property names might mis-expand due to the use of '.' characters in the search criteria of a sed command are two that spring to mind — but in practical use, I've actually found this to be workable for real-world Maven projects.
Add a comment:
I'm using this in Ansible to install Groovy and add some libraries that we require in the build. They're all managed in a build POM, which I can now read thanks to your sed overkill ;-)
I am running a jenkins pipeline build. In the build I am cloning a repo from git and it has a pom.xml. Once I clone I want to increase the version of pom by one. I am looking for a sed command to do so.
Thanks In Advance