Menu
Tax Notes logo

AI for Tax Analogies and Code Renumbering

Posted on Mar. 29, 2021
[Editor's Note:

This article originally appeared in the March 29, 2021, issue of Tax Notes Federal.

]
Benjamin Van Durme
Benjamin Van Durme
Andrew Blair-Stanek
Andrew Blair-Stanek

Andrew Blair-Stanek is a tax law professor at the University of Maryland Carey School of Law, and Benjamin Van Durme is an associate professor of computer science at Johns Hopkins University.

In this article, Blair-Stanek and Van Durme present an artificial intelligence tool that can complete analogies in tax law and provide evidence-based guidance on how Congress can renumber IRC sections in future tax reform efforts.

This research was supported in part by a Defense Advanced Research Projects Agency grant. The views and conclusions are those of the authors and should not be interpreted as representing official policies or endorsements of that agency or the U.S. government.

Copyright 2021 Andrew Blair-Stanek and
Benjamin Van Durme.
All rights reserved.

Words used in similar contexts tend to have similar meanings. In law, this insight underpins the canon noscitur a sociis, that “a word is known by the company it keeps.”1 If, for example, Congress had passed a law stating that “no lions, jaguars, or tigers may be imported,” then an attempt by the president to bar imports of Jaguars, the British-made cars, would fail in court thanks to this canon.

Computer scientists have gone one step further, creating algorithms that use the contexts of words to calculate precise numerical representations for all words in a large body of text. For example, Tomas Mikolov and his colleagues at Google created the program word2vec,2 which they ran over billions of words of English text gathered from the internet to get a vector (that is, a list) of 50 numbers for every word in the vocabulary.

These vectors had surprising properties for a range of artificial intelligence (AI) tasks, including completing analogies. Most humans can complete the analogy “man is to king as woman is to __.” The vectors could, too. If you started with the vector representing “king,” subtracted the vector representing “man,” added the vector representing “woman,” and then looked for the most similar vector, you got the vector for “queen.”

How did this work? The algorithm had derived a 50-dimensional vector for each of the several hundred thousand words that appeared sufficiently often in the texts they used. These vectors were simply 50 numbers, but these numbers collectively embedded the relevant attributes of the word, gleaned from their contexts. The vector for “king” embedded the attributes human, monarch, and male. The vector for “man” embedded the attributes male and human, while the vector for “woman” embedded female and human. The analogies then used vector addition and subtraction, which is simply adding or subtracting each of the 50 numbers in a vector. If you take the vector for “king,” subtract the vector for “man,” and add the vector for “woman,” you end up with a vector roughly representing a human monarch who is female. You then look for the closest existing vector, which, not surprisingly, is the one for “queen.” The computer has thus completed the analogy. These vectors embed a wide range of knowledge. For example, they can complete the analogy “Italy is to Rome as France is to __.”

But Google’s vectors represent little knowledge of law, let alone tax law. For example, given “corporation is to shareholders as partnership is to __,” the vector for “partners” is nowhere near a top hit.

We set out to create vectors representing U.S. federal tax law. We collected 188 million words, consisting of decisions in federal tax cases and private letter rulings.3 Then, we standardized citation formats, so that the program knew that equivalent citations had the same meaning. We also treated citations as single units of meaning, so that the text “section 11” would have a single vector rather than one vector for “section” and another for “11.” Similarly, we treated tax law terms of art, like “gross income” or “step transaction doctrine,” as single units of meaning. We then ran the program word2vec.

The vectors we got exceeded our expectations. Not only could these vectors correctly complete the analogy “corporation is to shareholders as partnership is to __,” but they also correctly completed much more sophisticated legal analogies like “corporation is to section 351 as partnership is to __.” We simply took the vector for “section 351,” subtracted the vector for “corporation,” and added the vector for “partnership.” We then looked for the closest vector, which was the vector for “section 721.” From the millions of words of tax law text, word2vec learned that section 351 is used to form corporations with nonrecognition in the same way that section 721 is used to form partnerships with nonrecognition. The vectors embed the actual real-world meaning of legal authorities like IRC sections, cases, and revenue rulings.

The vectors can correctly complete many other analogies, including “partnership is to section 701 as S corporation is to __,” which it correctly completes as “section 1363.” Just as section 701 says partnerships are not taxed, with partners taxed instead, section 1363 says S corporations are generally not taxed, with their shareholders taxed instead. Similarly, the vectors correctly complete “partnership is to section 702 as S corporation is to __” with “section 1366.” Just as section 702 details how partners report their partnership’s tax items, section 1366 details how shareholders report their S corporation’s tax items.

We have put a simple interface for our vectors on the University of Maryland website for all to use free of charge, so that anyone can try their own analogies. AI technology can be imperfect, and one can expect this tool to sometimes give nonsensical responses.

This internet interface also has an even simpler research tool called “Nearest Tax Concept” that simply lists the closest vectors for any word, citation, or tax term of art. For example, type in “step transaction doctrine,” and it will give you a list of related cases and concepts. You can do the same with any tax case, code section, Treasury regulation, revenue ruling, or tax term of art that appeared sufficiently often in the tax law texts we used. Getting the nearest tax concepts can be a useful research tool.

Renumbering Tax Sections

Our vectors can be used to complete analogies and find nearest concepts, but those are just two of many possible uses. Another use is helping Congress renumber the sections of the IRC.

As background, in 1954 Congress renumbered the entire code, giving us roughly the section numbering and breakdown into chapters, subchapters, parts, and subparts that we have now. Since 1954 Congress has renumbered sections more than 140 times. For 1954 and all the following renumberings, Congress aimed — quite sensibly — to keep sections covering similar subject matter in the same subdivisions, with the section numbers close to each other. Good organization makes the code easier to navigate and more accessible.

But in all its renumbering, Congress has been guided by guesswork. By contrast, our vectors provide quantitative evidence of how sections are actually used. In any future tax reform or smaller-scale changes, Congress can use these vectors to identify sections that could be renumbered to make the IRC’s organization better reflect actual usage, and thus be more usable.

Using the vectors, we identified 192 sections that Congress should consider renumbering to be in a different subchapter.4 How does this work? You can determine precisely how well a section “fits” into a subchapter by calculating the distance between the section’s vector and the vectors of the other sections in the same subchapter. If a section’s vector is closer to a different subchapter than to the subchapter containing it, that is evidence that Congress should consider renumbering that section to fit in the closer subchapter.

For example, we found that section 1032 is substantially closer to subchapter C (sections 301-385) than to subchapter O (sections 1001-1092), where it currently resides. This result makes sense: Section 1032 gives a corporation nonrecognition when it receives property in exchange for its own stock. Section 1032 applies solely to corporations and involves a quintessentially corporate transaction, the exchange of stock for property. Section 1032 frequently works in tandem with sections 351 and 354, which are in subchapter C. By contrast, section 1032’s neighbors in subchapter O include section 1031, dealing with like-kind real estate exchanges, and section 1033, covering involuntary conversions like theft, fire, and eminent domain. Congress should consider renumbering section 1032 to be in subchapter C. Where in subchapter C? Running the same analysis part-by-part rather than subchapter-by-subchapter indicates that section 1032 should be moved to Part III (sections 351-368).

The figure graphically demonstrates why section 1032 should be in subchapter C, not subchapter O. In it, our vectors have been reduced to a two-dimensional plot,5 and we have zoomed in on just the neighborhood containing most of subchapter C. Sections in subchapter C are in red (or gray), and section 1032 is highlighted with an arrow. The closer that sections are in this plot, the more similar they are in actual usage. You can see that section 1032 is very close to many of the sections in subchapter C. Meanwhile, most of the other sections in subchapter O are nowhere near section 1032.

Plotting of Sections in 2-Dimensions

The table includes several examples of the 192 sections that our calculations show Congress should consider renumbering to be in a different subchapter. The rightmost column, “Gain,” is a measure of how much closer the section is to the suggested “Better Subchapter” than the “Current Subchapter” the section is in. A higher number means the evidence for moving the section is higher, and at 0.282, there is particularly strong evidence that section 1032 should be moved to subchapter C.

Sample of Sections That Should Be Moved

Section

Current Subchapter

Better Subchapter

Gain

Section 1032 — exchange of stock for property

Subchapter O — gain or loss on disposition of property (sections 1001-1092)

Subchapter C — corporate distributions and adjustments (sections 301-385)

0.282

Section 457 — deferred compensation plans of state and local governments and tax-exempt organizations

Subchapter E — accounting periods and methods of accounting (sections 441-483)

Subchapter D — deferred compensation, etc. (sections 401-436)

0.274

Section 7454 — burden of proof in fraud, foundation manager, and transferee cases

Subchapter C — the Tax Court (sections 7441-7479)

Subchapter A — additions to the tax and additional amounts (sections 6651-6665)

0.244

Section 1248 — gain from certain sales or exchanges of stock in certain foreign corporations

Subchapter P — capital gains and losses (sections 1202-1298)

Subchapter N — tax based on income from sources within or without the United States (sections 861-999)

0.225

Section 1040 — transfer of certain farm, etc., real property

Subchapter O — gain or loss on disposition of property (sections 1001-1092)

Subchapter A — estates of citizens or residents (sections 2001-2058)

0.205

Section 83 — property transferred in connection with performance of services

Subchapter B — computation of taxable income (sections 61-291)

Subchapter D — deferred compensation, etc. (sections 401-436)

0.188

Section 521 — exemption of farmers’ cooperatives from tax

Subchapter F — exempt organizations (sections 501-530)

Subchapter T — cooperatives and their patrons (sections 1381-1388)

0.113

Section 166 — bad debts

Subchapter B — computation of taxable income (sections 61-291)

Subchapter H — banking institutions (sections 581-597)

0.05

Another example is section 457, which governs section 401(k)-like deferred compensation plans for employees of state and local governments and of nonprofits. Section 457 is in subchapter E (sections 441-483), dealing with accounting methods. But our analysis strongly indicates that it should be in subchapter D, titled “Deferred Compensation, etc.” (sections 401-436), which is also home to section 401(k). Similarly, section 83, which allows deferring gross income when restricted property is given as compensation for services, should also be moved to subchapter D.

The candidates for moving include not only substantive sections like sections 1032 and 83 but also procedural sections. For example, section 7454 provides that the IRS bears the burden of proving taxpayer fraud. This section expressly applies to “any proceeding,” and courts have long applied it to refund suits in federal district courts.6 Yet oddly, section 7454 is in the subchapter at sections 7441-7479 governing the Tax Court, nestled between the section on Tax Court procedure and evidence7 and the section governing service of process in the Tax Court.8 Our analysis strongly suggests that section 7454 should be moved instead to the subchapter at sections 6651-6665 governing “Additions to the Tax and Additional Amounts,” which includes fraud penalties.9

Even international tax sections might be moved. When a U.S. person sells stock in a controlled foreign corporation, section 1248 generally causes the gain to be treated as a dividend. This section is important for international tax planning and refers repeatedly to international concepts like the definition of CFC. Yet it resides in subchapter P, dealing with “Capital Gains and Losses.” Our metric strongly indicates that section 1248 would be more at home in subchapter N (sections 861-999), home to most of the international tax provisions.

There are other possible metrics and algorithms for reorganizing the IRC based on vector representations. We have made all the vectors we derived freely downloadable10 and put our code for suggesting moves on Github, an open-source code repository.11 Others can easily try alternative approaches.

One alternative approach that requires neither math nor computer programming is for congressional staffers to manually look at a two-dimensional plot of all the section vectors. We have posted such a plot in PDF form,12 containing all 1,245 IRC sections for which we have sufficient data. (To fit in all 1,245 sections, the section labels are in two-point font, which is why we have not reproduced that plot here.) The figure is simply a small portion of this much larger plot, zoomed in to just the neighborhood containing most of subchapter C. The closer sections are to each other in the plot, the closer they are in usage. If most sections in a subchapter are clustered together — except for one outlier that is nearer to some other subchapter’s sections — then it might make sense to move that outlier. Looking at this plot also can give additional context to the renumbering suggestions we have made.

Conclusion

We have described two limited applications of AI in tax law, but we and other researchers are pursuing many others. New models with millions of mathematical neurons approximating the neurons in the human brain promise much more power than the model we used here.13 Moreover, all AI models rely on data; having more data and higher-quality data is always better. The full Tax Analysts Federal Research Library, just released under an agreement between Tax Analysts and Deloitte Tax LLP, contains extensive, very high-quality tax law text. This combination of more powerful models with more and better data is reason for optimism that AI will result in many more tools to aid tax practitioners and policymakers.

FOOTNOTES

1 Yates v. United States, 574 U.S. 528, 543 (2015).

3 About 76 million came from Harvard Law Library’s “Caselaw Access Project,” with the remainder scraped from the Tax Court and IRS websites.

4 The full listing of sections that Congress should consider moving is at Andrew Blair-Stanek, “TaxVectorReorganization: Data Output,” GitHub (2021).

5 The mechanism for this is in Blair-Stanek, “TaxVectorReorganization: Reduce Dimensions,” GitHub (2021).

6 E.g., United States v. Prince, 348 F.2d 746 (2d Cir. 1965) (refund suit).

10 Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme, “Tax Law NLP Resources,” Johns Hopkins University Data Archive (2021).

11 Blair-Stanek, “TaxVectorReorganization,” GitHub (2021).

12 Blair-Stanek, “TaxVectorReorganization: Tax Vectors,” GitHub (2021).

13 E.g., Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv (2018); Holzenberger, Blair-Stanek, and Van Durme, “A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering,” Proceedings of the 2020 Natural Legal Language Processing Workshop, arXiv (2020).

END FOOTNOTES

Copy RID